Thread: including PID or backend ID in relpath of temp rels
Time for a new thread specific to this subject. For previous discussion, see here: http://archives.postgresql.org/pgsql-hackers/2010-04/msg01140.php http://archives.postgresql.org/pgsql-hackers/2010-04/msg01152.php I attempted to implement this by adding an isTemp argument to relpath, but ran into problems. It turns out that when we create a temporary relation and then exit the backend, the relation is merely truncated it, and it's the background writer which actually removes the file following the next checkpoint. Therefore, relpath() for the temprel must return the same answer in the background writer as it does in the original backend, so passing isTemp isn't enough - we actually need to pass whatever identifier we're including in the file name. As far as I can see, though I'm not 100% sure of this, it looks like we never actually ask the background writer to fsync any of these files because we never fsync them at all; but we do ask it to remove them, which is enough to create a problem. So, what to do about this? Ideas: 1. We could move the responsibility for removing the files associated with temp rels from the background writer to the owning backend. I think the reason why we initially truncate the files and only later remove them is because somebody else might have 'em open, so it mightn't be necessary for temp rels. 2. Instead of embedding a PID or backend ID in the filename, we could just embed a boolean: isTemp or not? This seems like cutting ourselves off from quite a bit of useful information but maybe it would be OK. We could nuke all the temp stuff on cluster startup, but we'd have to rely on catalog entries to identify orphaned files that accumulated during normal running, which isn't ideal since one of our long-term goals is to eliminate the need for those catalog entries. 3. We could change RelFileNode.relNode from an OID to an unsigned 32-bit integer drive off of a separate counter, and reserve some portion of the 4 billion available values for temp relations. I doubt we'd have enough bits to embed something like a PID though, so this would end up being basically an embedded boolean, along the lines of #2. 4. We could add an additional 32-bit value to RelFileNode to identify the backend (or a sentinel value when not temp) and create a separate structure XLogRelFileNode or PermRelFileNode or somesuch for use in contexts where no temp rels are allowed. Either #3 or #4 has some possible advantages for Hot Standby in terms of perhaps making it feasible to assign relfilenodes on a standby server without danger of conflicting with one already assigned on the master. 5. ??? Thoughts? ...Robert
On Sun, Apr 25, 2010 at 8:07 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > 1. We could move the responsibility for removing the files associated > with temp rels from the background writer to the owning backend. I > think the reason why we initially truncate the files and only later > remove them is because somebody else might have 'em open, so it > mightn't be necessary for temp rels. > what happens if the backend crash and obviously doesn't remove the file associated with temp rels? -- Atentamente, Jaime Casanova Soporte y capacitación de PostgreSQL Asesoría y desarrollo de sistemas Guayaquil - Ecuador Cel. +59387171157
On Sun, Apr 25, 2010 at 10:19 PM, Jaime Casanova <jcasanov@systemguards.com.ec> wrote: > On Sun, Apr 25, 2010 at 8:07 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> 1. We could move the responsibility for removing the files associated >> with temp rels from the background writer to the owning backend. I >> think the reason why we initially truncate the files and only later >> remove them is because somebody else might have 'em open, so it >> mightn't be necessary for temp rels. >> > > what happens if the backend crash and obviously doesn't remove the > file associated with temp rels? Currently, they just get orphaned. As I understand it, if the catalog entry survives the crash, autovacuum will remove them 2 BILLION transactions later (and emit warning messages in the meantime); otherwise we won't even know they're there. As I further understand it, the main point of this change is that if temporary tables have a distinctive name of some kind, then when we can run through the directory and blow away files with those names without fearing that it's *permanent* table data that somehow got orphaned. ...Robert
On Sun, Apr 25, 2010 at 9:07 PM, Robert Haas <robertmhaas@gmail.com> wrote: > 4. We could add an additional 32-bit value to RelFileNode to identify > the backend (or a sentinel value when not temp) and create a separate > structure XLogRelFileNode or PermRelFileNode or somesuch for use in > contexts where no temp rels are allowed. I experimented with this approach and created LocalRelFileNode and GlobalRelFileNode and, for use in the buffer headers, BufferRelFileNode (same as GlobalRelFileNode, but named differently for clarity). LocallRelFileNode = GlobalRelFileNode + the ID of the owning backend for temp rels; or InvalidBackendId if referencing a non-temporary rel. These might not be the greatest names, but I think the concept is good, because it really breaks the things that need to be adjusted quite thoroughly. In the course of repairing the damage I came across a couple of things I wasn't sure about: [relcache.c] RelationInitPhysicalAddr can't initialize relation->rd_node.backend properly for a non-local temporary relation, because that information isn't available. But I'm not clear on why we would need to create a relcache entry for a non-local temporary relation. If we do need to, then we'll probably need to store the backend ID in pg_class. That seems like something that would be best avoided, all things being equal, especially since I can't see how to generalize it to global temporary tables. [smgr.c,inval.c] Do we need to call CacheInvalidSmgr for temporary relations? I think the only backend that can have an smgr reference to a temprel other than the owning backend is bgwriter, and AFAICS bgwriter will only have such a reference if it's responding to a request by the owning backend to unlink the associated files, in which case (I think) the owning backend will have no reference. [dbsize.c] As with relcache.c, there's a problem if we're asked for the size of a temporary relation that is not our own: we can't call relpath() without knowing the ID of the owning backend, and there's no way to acquire that information for pg_class. I guess we could just refuse to answer the question in that case, but that doesn't seem real cool. Or we could physically scan the directory for files that match a suitably constructed wildcard, I suppose. [storage.c,xact.c,twophase.c] smgrGetPendingDeletes returns via an out parameter (its second argument) a list of RelFileNodes pending delete, which we then write to WAL or to the two-phase state file. Of course, if the backend ID (or pid, but I picked backend ID somewhat arbitrarily) is part of the filename, then we need to write that to WAL, too. It seems somewhat unfortunate to have to WAL-log temprels here; as best I can tell, this is the only case where it's necessary.But if we implement a more general mechanism for cleaningup temp files, then might the need to do this go away? Not sure. [syncscan.c] It seems we pursue this optimization even for temprels; I can't think of why that would be useful in practice. If it's useless overhead, should we skip it? This is really independent of this project; just a side thought. ...Robert
Robert Haas escribió: > [smgr.c,inval.c] Do we need to call CacheInvalidSmgr for temporary > relations? I think the only backend that can have an smgr reference > to a temprel other than the owning backend is bgwriter, and AFAICS > bgwriter will only have such a reference if it's responding to a > request by the owning backend to unlink the associated files, in which > case (I think) the owning backend will have no reference. Hmm, wasn't there a proposal to have the owning backend delete the files instead of asking the bgwriter to? > [dbsize.c] As with relcache.c, there's a problem if we're asked for > the size of a temporary relation that is not our own: we can't call > relpath() without knowing the ID of the owning backend, and there's no > way to acquire that information for pg_class. I guess we could just > refuse to answer the question in that case, but that doesn't seem real > cool. Or we could physically scan the directory for files that match > a suitably constructed wildcard, I suppose. I don't very much like the wildcard idea; but I don't think it's unreasonable to refuse to provide a file size. If the owning backend has still got part of the table in local buffers, you'll get a misleading answer, so perhaps it's best to not give an answer at all. Maybe this problem could be solved if we could somehow force that backend to write down its local buffers, in which case it'd be nice to have a solution to the dbsize problem. > [syncscan.c] It seems we pursue this optimization even for temprels; I > can't think of why that would be useful in practice. If it's useless > overhead, should we skip it? This is really independent of this > project; just a side thought. Maybe recently used buffers are more likely to be in the OS page cache, so perhaps it's not good to disable it. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Tue, May 4, 2010 at 2:06 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Robert Haas escribió: Hey, thanks for writing back! I just spent the last few hours thinking about this and beating my head against the wall. >> [smgr.c,inval.c] Do we need to call CacheInvalidSmgr for temporary >> relations? I think the only backend that can have an smgr reference >> to a temprel other than the owning backend is bgwriter, and AFAICS >> bgwriter will only have such a reference if it's responding to a >> request by the owning backend to unlink the associated files, in which >> case (I think) the owning backend will have no reference. > > Hmm, wasn't there a proposal to have the owning backend delete the files > instead of asking the bgwriter to? I did propose that upthread; it may have been proposed previously also. This might be worth doing independently of the rest of the patch (which I'm starting to fear is doomed, cue ominous soundtrack) since it would reduce the chance of orphaning data files and possibly simplify the logic also. >> [dbsize.c] As with relcache.c, there's a problem if we're asked for >> the size of a temporary relation that is not our own: we can't call >> relpath() without knowing the ID of the owning backend, and there's no >> way to acquire that information for pg_class. I guess we could just >> refuse to answer the question in that case, but that doesn't seem real >> cool. Or we could physically scan the directory for files that match >> a suitably constructed wildcard, I suppose. > > I don't very much like the wildcard idea; but I don't think it's > unreasonable to refuse to provide a file size. If the owning backend > has still got part of the table in local buffers, you'll get a > misleading answer, so perhaps it's best to not give an answer at all. > > Maybe this problem could be solved if we could somehow force that > backend to write down its local buffers, in which case it'd be nice to > have a solution to the dbsize problem. I'm sure we could add some kind of signaling mechanism that would tell all backends to flush their local buffers, but I'm not too sure it would help this case very much, because you likely wouldn't want to wait for all the backends to complete that process before reporting results. >> [syncscan.c] It seems we pursue this optimization even for temprels; I >> can't think of why that would be useful in practice. If it's useless >> overhead, should we skip it? This is really independent of this >> project; just a side thought. > > Maybe recently used buffers are more likely to be in the OS page cache, > so perhaps it's not good to disable it. I don't get it. If the whole relation fits in the page cache, it doesn't much matter where you start a seqscan. If it doesn't, starting where the last one ended is anti-optimal. ...Robert
Robert Haas escribió: > On Tue, May 4, 2010 at 2:06 PM, Alvaro Herrera > <alvherre@commandprompt.com> wrote: > > Robert Haas escribió: > > Hey, thanks for writing back! I just spent the last few hours > thinking about this and beating my head against the wall. :-) > >> [smgr.c,inval.c] Do we need to call CacheInvalidSmgr for temporary > >> relations? I think the only backend that can have an smgr reference > >> to a temprel other than the owning backend is bgwriter, and AFAICS > >> bgwriter will only have such a reference if it's responding to a > >> request by the owning backend to unlink the associated files, in which > >> case (I think) the owning backend will have no reference. > > > > Hmm, wasn't there a proposal to have the owning backend delete the files > > instead of asking the bgwriter to? > > I did propose that upthread; it may have been proposed previously > also. This might be worth doing independently of the rest of the patch > (which I'm starting to fear is doomed, cue ominous soundtrack) since > it would reduce the chance of orphaning data files and possibly > simplify the logic also. +1 for doing it separately, but hopefully that doesn't mean the rest of this patch is doomed ... > >> [dbsize.c] As with relcache.c, there's a problem if we're asked for > >> the size of a temporary relation that is not our own: we can't call > >> relpath() without knowing the ID of the owning backend, and there's no > >> way to acquire that information for pg_class. I guess we could just > >> refuse to answer the question in that case, but that doesn't seem real > >> cool. Or we could physically scan the directory for files that match > >> a suitably constructed wildcard, I suppose. > > > > I don't very much like the wildcard idea; but I don't think it's > > unreasonable to refuse to provide a file size. If the owning backend > > has still got part of the table in local buffers, you'll get a > > misleading answer, so perhaps it's best to not give an answer at all. > > > > Maybe this problem could be solved if we could somehow force that > > backend to write down its local buffers, in which case it'd be nice to > > have a solution to the dbsize problem. > > I'm sure we could add some kind of signaling mechanism that would tell > all backends to flush their local buffers, but I'm not too sure it > would help this case very much, because you likely wouldn't want to > wait for all the backends to complete that process before reporting > results. Hmm, I was thinking in the pg_relation_size function -- given this new mechanism you could get an accurate size of temp tables for other backends. I wasn't thinking in the pg_database_size function, and perhaps it's better to *not* include temp tables in that report at all. > >> [syncscan.c] It seems we pursue this optimization even for temprels; I > >> can't think of why that would be useful in practice. If it's useless > >> overhead, should we skip it? This is really independent of this > >> project; just a side thought. > > > > Maybe recently used buffers are more likely to be in the OS page cache, > > so perhaps it's not good to disable it. > > I don't get it. If the whole relation fits in the page cache, it > doesn't much matter where you start a seqscan. If it doesn't, > starting where the last one ended is anti-optimal. Err, I was thinking that a syncscan started a bunch of pages earlier than the point where the previous scan ended, but yeah, that's a bit silly. Maybe we should just ignore syncscan in temp tables altogether, as you propose. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Tue, May 4, 2010 at 3:03 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote: >> > Hmm, wasn't there a proposal to have the owning backend delete the files >> > instead of asking the bgwriter to? >> >> I did propose that upthread; it may have been proposed previously >> also. This might be worth doing independently of the rest of the patch >> (which I'm starting to fear is doomed, cue ominous soundtrack) since >> it would reduce the chance of orphaning data files and possibly >> simplify the logic also. > > +1 for doing it separately, but hopefully that doesn't mean the rest of > this patch is doomed ... I wonder if it would be possible to reject access to temporary relations at a higher level. Right now, if you create a temporary relation in one session, you can issue a SELECT statement against it in another relation, and get back 0 rows. If you then insert data into it and select against it again, you'll get an error saying that you can't access temporary tables of other sessions. If you try to truncate somebody else's temporary relation, it fails; but if you try to drop it, it works. In fact, you can even run ALTER TABLE ... ADD COLUMN on somebody else's temp table, as long as you don't do anything that requires a rewrite. CLUSTER fails; VACUUM and VACUUM FULL both appear to work but apparently actually don't do anything under the hood, so that database-wide vacuums don't barf. The whole thing seems pretty leaky. It would be nice if we could find a small set of control points where we basically reject ALL access to somebody else's temp relations, period. One possible thing we might do (bearing in mind that we might need to wall off access at multiple levels) would be to forbid creating a relcache entry for a non-local temprel. That would, in turn, forbid doing pretty much anything to such a relation, although I'm not sure what else would get broken in the process. But it would eliminate, for example, all the checks for RELATION_IS_OTHER_TEMP, since that Just Couldn't Happen. It would would eliminate the need to install specific handling for this case in dbsize.c - we'd just automatically croak. And it's also probably necessary to do this anyhow if we want to ever eliminate those CacheInvalidSmgr() calls for temp rels, because if I can drop your temprel, that implies I can smgropen() it. ...Robert
Robert Haas <robertmhaas@gmail.com> writes: > One possible thing we might do (bearing in mind that we might need to > wall off access at multiple levels) would be to forbid creating a > relcache entry for a non-local temprel. That would, in turn, forbid > doing pretty much anything to such a relation, although I'm not sure > what else would get broken in the process. Dropping temprels left behind by a crashed backend would get broken by that; which is a deal-breaker, because we have to be able to clean those up. regards, tom lane
Alvaro Herrera <alvherre@commandprompt.com> writes: > I don't very much like the wildcard idea; but I don't think it's > unreasonable to refuse to provide a file size. If the owning backend > has still got part of the table in local buffers, you'll get a > misleading answer, so perhaps it's best to not give an answer at all. FWIW, that's not the case, anymore than it is for blocks in shared buffer cache for regular rels. smgrextend() results in an observable extension of the file EOF immediately, whether or not you can see up-to-date data for those pages. Now people have often complained about the extra I/O involved in that, and it'd be nice to have a solution, but it's not clear to me that fixing it would be harder for temprels than regular rels. regards, tom lane
On Tue, May 4, 2010 at 5:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> One possible thing we might do (bearing in mind that we might need to >> wall off access at multiple levels) would be to forbid creating a >> relcache entry for a non-local temprel. That would, in turn, forbid >> doing pretty much anything to such a relation, although I'm not sure >> what else would get broken in the process. > > Dropping temprels left behind by a crashed backend would get broken by > that; which is a deal-breaker, because we have to be able to clean those > up. Phooey. It was such a good idea in my head. ...Robert
On Tue, Apr 27, 2010 at 9:59 PM, Robert Haas <robertmhaas@gmail.com> wrote: > [storage.c,xact.c,twophase.c] smgrGetPendingDeletes returns via an out > parameter (its second argument) a list of RelFileNodes pending delete, > which we then write to WAL or to the two-phase state file. It appears that we are playing a little bit fast and loose with this. I think that the two-phase code path is solid because we prohibit PREPARE TRANSACTION if the transaction has referenced any temporary tables, so when we read the two-phase state file it's safe to assume that all the tables mentioned are non-temporary. But the ordinary one-phase commit writes permanent and temporary relfilenodes to WAL without distinction, and then, in xl_redo_commit() and xl_redo_abort(), does this: XLogDropRelation(xlrec->xnodes[i], fork); smgrdounlink(srel, fork, false, true); The third argument to smgrdounlink() is "isTemp", which we're here passing as false, but might really be true. I don't think it technically matters at present because the only effect of that parameter right now is that we pass it through to DropRelFileNodeBuffers(), which will drop shared buffers rather than local buffers as a result of the incorrect setting. But that won't matter because the WAL replay process shouldn't have any local buffers anyway, since temp relations are not otherwise WAL-logged. For the same reason, I don't think the call to XLogDropRelation() is an issue because its only purpose is to remove entries from invalid_page_tab, and there won't be any temporary pages in there anyway. Of course if we're going to do $SUBJECT this will need to be changed anyway, but assuming the above analysis is correct I think the existing coding at least deserves a comment... then again, maybe I'm all mixed up? ...Robert
On Tue, May 4, 2010 at 3:03 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote: >> >> [smgr.c,inval.c] Do we need to call CacheInvalidSmgr for temporary >> >> relations? I think the only backend that can have an smgr reference >> >> to a temprel other than the owning backend is bgwriter, and AFAICS >> >> bgwriter will only have such a reference if it's responding to a >> >> request by the owning backend to unlink the associated files, in which >> >> case (I think) the owning backend will have no reference. This turns out to be wrong, I think. It seems that what we do is prevent backends other than the opening backend from reading pages from non-local temp rels into private or shared buffers, but we don't actually prevent them from having smgr references. This allows autovacuum to drop them, for example, in an anti-wraparound situation. (Thanks to Tom for helping me get my head around this better.) >> > Hmm, wasn't there a proposal to have the owning backend delete the files >> > instead of asking the bgwriter to? >> >> I did propose that upthread; it may have been proposed previously >> also. This might be worth doing independently of the rest of the patch >> (which I'm starting to fear is doomed, cue ominous soundtrack) since >> it would reduce the chance of orphaning data files and possibly >> simplify the logic also. > > +1 for doing it separately, but hopefully that doesn't mean the rest of > this patch is doomed ... Doom has been averted. Proposed patch attached. It passes regression tests and seems to work, but could use additional testing and, of course, some code-reading also. Some notes on this patch: It seems prett clear that it isn't desirable to simply add backend ID to RelFileNode, because there are too many places using RelFileNode already for purposes where the backend ID can be inferred from context, such as buffer headers and most of xlog. Instead, I introduced BackendRelFileNode, which consists of an ordinary RelFileNode augmented with a backend ID, and use that only where needed. In particular, the smgr layer must use BackendRelFileNode throughout, since it operates on both permanent and temporary relations. smgr invalidations must also include the backend ID. xlog generally happens only for non-temporary relations and can thus continue to use an ordinary RelFileNode; however, commit/abort records must use BackendRelFileNode as there may be physical storage associated with temporary relations that must be unlinked. Communication with the bgwriter must use BackendRelFileNode for similar reasons. The relcache now stores rd_backend rather than rd_islocaltemp so that it remains straightforward to call smgropen() based on a relcache entry. Some smgr functions no longer require an isTemp argument, because they can infer the necessary information from their BackendRelFileNode. smgrwrite() and smgrextend() now take a skipFsync argument rather than an isTemp argument. I'm not totally sure whether it makes sense to do what we were talking about above, viz, transfer unlink responsibility for temp rels from the bgwriter to the owning backend. I haven't done that here. Nor have I implemented any kind of improved temporary file cleanup strategy, though I hope such a thing is possible. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Attachment
On May 6, 2010, at 10:24 PM, Robert Haas wrote: > On Tue, May 4, 2010 at 3:03 PM, Alvaro Herrera > <alvherre@commandprompt.com> wrote: >>>>> [smgr.c,inval.c] Do we need to call CacheInvalidSmgr for temporary >>>>> relations? I think the only backend that can have an smgr reference >>>>> to a temprel other than the owning backend is bgwriter, and AFAICS >>>>> bgwriter will only have such a reference if it's responding to a >>>>> request by the owning backend to unlink the associated files, in which >>>>> case (I think) the owning backend will have no reference. > > This turns out to be wrong, I think. It seems that what we do is > prevent backends other than the opening backend from reading pages > from non-local temp rels into private or shared buffers, but we don't > actually prevent them from having smgr references. This allows > autovacuum to drop them, for example, in an anti-wraparound situation. > (Thanks to Tom for helping me get my head around this better.) > >>>> Hmm, wasn't there a proposal to have the owning backend delete the files >>>> instead of asking the bgwriter to? >>> >>> I did propose that upthread; it may have been proposed previously >>> also. This might be worth doing independently of the rest of the patch >>> (which I'm starting to fear is doomed, cue ominous soundtrack) since >>> it would reduce the chance of orphaning data files and possibly >>> simplify the logic also. >> >> +1 for doing it separately, but hopefully that doesn't mean the rest of >> this patch is doomed ... > > Doom has been averted. Proposed patch attached. It passes regression > tests and seems to work, but could use additional testing and, of > course, some code-reading also. > > Some notes on this patch: > > It seems prett clear that it isn't desirable to simply add backend ID > to RelFileNode, because there are too many places using RelFileNode > already for purposes where the backend ID can be inferred from > context, such as buffer headers and most of xlog. Instead, I > introduced BackendRelFileNode, which consists of an ordinary > RelFileNode augmented with a backend ID, and use that only where > needed. In particular, the smgr layer must use BackendRelFileNode > throughout, since it operates on both permanent and temporary > relations. smgr invalidations must also include the backend ID. xlog > generally happens only for non-temporary relations and can thus > continue to use an ordinary RelFileNode; however, commit/abort records > must use BackendRelFileNode as there may be physical storage > associated with temporary relations that must be unlinked. > Communication with the bgwriter must use BackendRelFileNode for > similar reasons. The relcache now stores rd_backend rather than > rd_islocaltemp so that it remains straightforward to call smgropen() > based on a relcache entry. Some smgr functions no longer require an > isTemp argument, because they can infer the necessary information from > their BackendRelFileNode. smgrwrite() and smgrextend() now take a > skipFsync argument rather than an isTemp argument. > > I'm not totally sure whether it makes sense to do what we were talking > about above, viz, transfer unlink responsibility for temp rels from > the bgwriter to the owning backend. I haven't done that here. Nor > have I implemented any kind of improved temporary file cleanup > strategy, though I hope such a thing is possible. Any particular reason not to use directories to help organize things? IE: base/database_oid/pg_temp_rels/backend_pid/relfilenode Perhaps relfilenode should be something else. This seems to have several advantages: 1: It's more organized. If you want to see all the files for a single backend you have just one place to look. Finding everythingis still easy via filesystem find. 2: Cleanup becomes easier. When a backend exits, it's entire directory goes away. On server start, everything under pg_temp_relsgoes away. Unfortunately we still have a race condition with cleaning up if a backend dies and can't run it'sown cleanup, though I think that anytime that happens we're going to restart everything anyway. 3: It separates all the temporary stuff away from real files. The only downside I see is some extra code to create the backend_pid directory. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Mon, May 17, 2010 at 2:10 PM, Jim Nasby <decibel@decibel.org> wrote: >> It seems prett clear that it isn't desirable to simply add backend ID >> to RelFileNode, because there are too many places using RelFileNode >> already for purposes where the backend ID can be inferred from >> context, such as buffer headers and most of xlog. Instead, I >> introduced BackendRelFileNode, which consists of an ordinary >> RelFileNode augmented with a backend ID, and use that only where >> needed. In particular, the smgr layer must use BackendRelFileNode >> throughout, since it operates on both permanent and temporary >> relations. smgr invalidations must also include the backend ID. xlog >> generally happens only for non-temporary relations and can thus >> continue to use an ordinary RelFileNode; however, commit/abort records >> must use BackendRelFileNode as there may be physical storage >> associated with temporary relations that must be unlinked. >> Communication with the bgwriter must use BackendRelFileNode for >> similar reasons. The relcache now stores rd_backend rather than >> rd_islocaltemp so that it remains straightforward to call smgropen() >> based on a relcache entry. Some smgr functions no longer require an >> isTemp argument, because they can infer the necessary information from >> their BackendRelFileNode. smgrwrite() and smgrextend() now take a >> skipFsync argument rather than an isTemp argument. >> >> I'm not totally sure whether it makes sense to do what we were talking >> about above, viz, transfer unlink responsibility for temp rels from >> the bgwriter to the owning backend. I haven't done that here. Nor >> have I implemented any kind of improved temporary file cleanup >> strategy, though I hope such a thing is possible. > > Any particular reason not to use directories to help organize things? IE: > > base/database_oid/pg_temp_rels/backend_pid/relfilenode > > Perhaps relfilenode should be something else. > > This seems to have several advantages: > > 1: It's more organized. If you want to see all the files for a single backend you have just one place to look. Findingeverything is still easy via filesystem find. > 2: Cleanup becomes easier. When a backend exits, it's entire directory goes away. On server start, everything under pg_temp_relsgoes away. Unfortunately we still have a race condition with cleaning up if a backend dies and can't run it'sown cleanup, though I think that anytime that happens we're going to restart everything anyway. > 3: It separates all the temporary stuff away from real files. > > The only downside I see is some extra code to create the backend_pid directory. I like the idea of using directories to organize things better and I completely agree with points #1 and #3. Point #2 is a little more complicated, I think, and something I've been struggling with. We need to make sure that we clean up not only the temporary files but also the catalog entries that point to them, if any. The current code only blows away temporary tables that are "old" in terms of transaction IDs, is driven off the catalog entries, and essentially does DROP TABLE <whatever>. So it can't clean up orphaned temporary files that don't have any catalog entries associated with them (which is what we want to fix) but on the other hand whatever it does clean up is cleaned up completely. We might be able to do something like this: 1. Scan pg_temp_rels. For each subdirectory found (whose name looks like a backend ID), if the corresponding backend is not currently running, add the backend ID to a list of backend IDs needing cleanup. 2. For each backend ID derived in step 1: 2A. Scan the subdirectory and add all the files you find (whose names are in the right format) to a list of files to be deleted. 2B. Check again whether the backend in question is running. If it is, don't do anything further for this backend and go on to the next backend ID (i.e. continue with step 2). 2C. For each file found in step 2A, look for a pg_class entry in the temp tablespace for that backend ID with a matching relfilenode number. If one is found, drop the rel. If not, unlink the file if it still exists. 2D. Attempt to remove the directory, ignoring failures. I think step 2B is sufficient to prevent a race condition where we end up mistaking a newly created file for an orphaned one. Am I right? One possible problem with this is that it would need to be repeated for every database/tablespace combination, but maybe that wouldn't be too expensive. autovacuum already needs to process every database, but I don't know how you'd decide how often to check for stray temp files. Certainly you'd want to check after a restart... after that I get fuzzy. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Tue, May 4, 2010 at 5:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > FWIW, that's not the case, anymore than it is for blocks in shared > buffer cache for regular rels. smgrextend() results in an observable > extension of the file EOF immediately, whether or not you can see > up-to-date data for those pages. > > Now people have often complained about the extra I/O involved in that, > and it'd be nice to have a solution, but it's not clear to me that > fixing it would be harder for temprels than regular rels. For systems that have it and filesystems that optimize it I think posix_fallocate() handles this case. We can extend files without actually doing any i/o but we get the guarantee from the filesystem that it has the space available and writing to those blocks won't fail due to lack of space. Only meta-data i/o is triggered allocating the blocks and marking them as virtually filled with nulls and it's not synced unless there's an fsync so there's no extra physical i/o. This should be the case for ext4 but I'm not sure what other filesystems implement this. -- greg
On Mon, May 17, 2010 at 2:10 PM, Jim Nasby <decibel@decibel.org> wrote: > Any particular reason not to use directories to help organize things? IE: > > base/database_oid/pg_temp_rels/backend_pid/relfilenode > > Perhaps relfilenode should be something else. > > This seems to have several advantages: > > 1: It's more organized. If you want to see all the files for a single backend you have just one place to look. Findingeverything is still easy via filesystem find. > 2: Cleanup becomes easier. When a backend exits, it's entire directory goes away. On server start, everything under pg_temp_relsgoes away. Unfortunately we still have a race condition with cleaning up if a backend dies and can't run it'sown cleanup, though I think that anytime that happens we're going to restart everything anyway. > 3: It separates all the temporary stuff away from real files. > > The only downside I see is some extra code to create the backend_pid directory. I thought this was a good idea when you first proposed it, but on further review I've changed my mind. There are several places in the code that rely on checking whether the database directory within any given tablespace is empty to determine whether that database is using that tablespace. While I could rewrite all of that logic to do the right thing, I think it's unnecessary pain. I talked with Tom Lane about this a little bit at PGcon and opined that we probably only need to clean out stray temporary files at startup. So what I'm tempted to do is just write a function that goes through all tablespace/database combinations and scans each directory for files with a name like t<digits>_<digits> and blows them away. This will leave the catalog entries pointing at nothing, but we already have working code in autovacuum.c to clean up the catalog entries, and I believe that will work just fine even if the underlying files have been removed earlier. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company