Thread: including backend ID in relpath of temp rels - updated patch
For previous discussion of this topic, see: http://archives.postgresql.org/pgsql-hackers/2010-04/msg01181.php http://archives.postgresql.org/pgsql-hackers/2010-05/msg00352.php http://archives.postgresql.org/pgsql-hackers/2010-06/msg00302.php As in the original version of the patch, I have not simply added backend ID to RelFileNode, because there are too many places using RelFileNode in contexts where the backend ID can be determined from context, such as the shared and local buffer managers and the xlog code. Instead, I have introduced BackendRelFileNode for contexts where we need both the RelFileNode and the backend ID. The smgr layer has to use BackendRelFileNode across the board, since it operates on both permanent and temporary relation, including - potentially - temporary relations of other backends. smgr invalidations must also include the backend ID, as must communication between regular backends and the bgwriter. The relcache now stores rd_backend instead of rd_islocaltemp so that it remains straightforward to call smgropen() based on a relcache entry. Some smgr functions no longer require an isTemp argument, because they can infer the necessary information from their BackendRelFileNode. smgrwrite() and smgrextend() now take a skipFsync argument rather than an isTemp argument. In this version of the patch, I've improved the temporary file cleanup mechanism. In CVS HEAD, when a transaction commits or aborts, we write an XLOG record with all relations that must be unlinked, including temporary relations. With this patch, we no longer include temporary relations in the XLOG records (which may be a tiny performance benefit for some people); instead, on every startup of the database cluster, we just nuke all temporary relation files (which is now easy to do, since they are named differently than files for non-temporary relations) at the same time that we nuke other temp files. This produces slightly different behavior. In CVS HEAD, temporary files get removed whenever an xlog redo happens, so either at cluster start or after a backend crash, but only to the extent that they appear in WAL. With this patch, we can be sure of removing ALL stray files, which is maybe ever-so-slightly leaky in CVS HEAD. But on the other hand, because it hooks into the existing temporary file cleanup code, it only happens at cluster startup, NOT after a backend crash. The existing coding leaves temporary sort files and similar around after a backend crash for forensics purposes. That might or might not be worth rethinking for non-debug builds, but I don't think there's any very good reason to think that temporary relation files need to be handled differently than temporary work files. I believe that this patch will clear away one major obstacle to implementing global temporary and unlogged tables: it enables us to be sure of cleaning up properly after a crash without relying on catalog entries or XLOG. Based on previous discussions, however, I believe there is support for making this change independently of what becomes of that project, just for the benefit of having a more robust cleanup mechanism. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Attachment
On Thu, Jun 10, 2010 at 4:39 PM, Robert Haas <robertmhaas@gmail.com> wrote: > For previous discussion of this topic, see: > > http://archives.postgresql.org/pgsql-hackers/2010-04/msg01181.php > http://archives.postgresql.org/pgsql-hackers/2010-05/msg00352.php > http://archives.postgresql.org/pgsql-hackers/2010-06/msg00302.php > > As in the original version of the patch, I have not simply added > backend ID to RelFileNode, because there are too many places using > RelFileNode in contexts where the backend ID can be determined from > context, such as the shared and local buffer managers and the xlog > code. Instead, I have introduced BackendRelFileNode for contexts > where we need both the RelFileNode and the backend ID. The smgr layer > has to use BackendRelFileNode across the board, since it operates on > both permanent and temporary relation, including - potentially - > temporary relations of other backends. smgr invalidations must also > include the backend ID, as must communication between regular backends > and the bgwriter. The relcache now stores rd_backend instead of > rd_islocaltemp so that it remains straightforward to call smgropen() > based on a relcache entry. Some smgr functions no longer require an > isTemp argument, because they can infer the necessary information from > their BackendRelFileNode. smgrwrite() and smgrextend() now take a > skipFsync argument rather than an isTemp argument. > > In this version of the patch, I've improved the temporary file cleanup > mechanism. In CVS HEAD, when a transaction commits or aborts, we > write an XLOG record with all relations that must be unlinked, > including temporary relations. With this patch, we no longer include > temporary relations in the XLOG records (which may be a tiny > performance benefit for some people); instead, on every startup of the > database cluster, we just nuke all temporary relation files (which is > now easy to do, since they are named differently than files for > non-temporary relations) at the same time that we nuke other temp > files. This produces slightly different behavior. In CVS HEAD, > temporary files get removed whenever an xlog redo happens, so either > at cluster start or after a backend crash, but only to the extent that > they appear in WAL. With this patch, we can be sure of removing ALL > stray files, which is maybe ever-so-slightly leaky in CVS HEAD. But > on the other hand, because it hooks into the existing temporary file > cleanup code, it only happens at cluster startup, NOT after a backend > crash. The existing coding leaves temporary sort files and similar > around after a backend crash for forensics purposes. That might or > might not be worth rethinking for non-debug builds, but I don't think > there's any very good reason to think that temporary relation files > need to be handled differently than temporary work files. > > I believe that this patch will clear away one major obstacle to > implementing global temporary and unlogged tables: it enables us to be > sure of cleaning up properly after a crash without relying on catalog > entries or XLOG. Based on previous discussions, however, I believe > there is support for making this change independently of what becomes > of that project, just for the benefit of having a more robust cleanup > mechanism. Updated patch to remove minor bitrot. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Attachment
On Thu, Jun 10, 2010 at 3:39 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > > I believe that this patch will clear away one major obstacle to > implementing global temporary and unlogged tables: it enables us to be > sure of cleaning up properly after a crash without relying on catalog > entries or XLOG. Based on previous discussions, however, I believe > there is support for making this change independently of what becomes > of that project, just for the benefit of having a more robust cleanup > mechanism. > Hi, i was looking at v3 of this patch... it looks good, works as advertised, pass regression tests, and all tests i could think of... haven't looked the code at much detail but seems ok, to me at least... -- Jaime Casanova www.2ndQuadrant.com Soporte y capacitación de PostgreSQL
On Fri, Jul 23, 2010 at 10:05 AM, Jaime Casanova <jaime@2ndquadrant.com> wrote: > On Thu, Jun 10, 2010 at 3:39 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> >> I believe that this patch will clear away one major obstacle to >> implementing global temporary and unlogged tables: it enables us to be >> sure of cleaning up properly after a crash without relying on catalog >> entries or XLOG. Based on previous discussions, however, I believe >> there is support for making this change independently of what becomes >> of that project, just for the benefit of having a more robust cleanup >> mechanism. >> > > Hi, > > i was looking at v3 of this patch... > Ok, i like what you did in smgrextend, smgrwrite, and others... changing isTemp for skipFsync is more descriptive but i have a few questions, maybe is right what you did i only want to understand it: - you added this in include/storage/smgr.h, so why is safe to assume that if the backend != InvalidBackendId it must be a temp relation? +#define SmgrIsTemp(smgr) \ + ((smgr)->smgr_rnode.backend != InvalidBackendId) - you added a question like this "if (rel->rd_backend == MyBackendId)" in a few places... why are you assuming that? that couldn't be a new created relation (in current session of course)? is that safe? -- Jaime Casanova www.2ndQuadrant.com Soporte y capacitación de PostgreSQL
On Sun, Jul 25, 2010 at 2:37 AM, Jaime Casanova <jaime@2ndquadrant.com> wrote: > but i have a few questions, maybe is right what you did i only want to > understand it: > - you added this in include/storage/smgr.h, so why is safe to assume > that if the backend != InvalidBackendId it must be a temp relation? > > +#define SmgrIsTemp(smgr) \ > + ((smgr)->smgr_rnode.backend != InvalidBackendId) That's pretty much the whole point of the patch. Instead of identifying relations as simply "temporary" or "not temporary", we identify as "a temporary relation owned by backend X" or as "not temporary". > - you added a question like this "if (rel->rd_backend == MyBackendId)" > in a few places... why are you assuming that? that couldn't be a new > created relation (in current session of course)? is that safe? Again, rd_backend is not the creating backend ID unless the relation is a temprel. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Sun, Jul 25, 2010 at 4:32 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, Jul 25, 2010 at 2:37 AM, Jaime Casanova <jaime@2ndquadrant.com> wrote: >> but i have a few questions, maybe is right what you did i only want to >> understand it: >> - you added this in include/storage/smgr.h, so why is safe to assume >> that if the backend != InvalidBackendId it must be a temp relation? >> >> +#define SmgrIsTemp(smgr) \ >> + ((smgr)->smgr_rnode.backend != InvalidBackendId) > > That's pretty much the whole point of the patch. Instead of > identifying relations as simply "temporary" or "not temporary", we > identify as "a temporary relation owned by backend X" or as "not > temporary". > Ok, this one seems good enough... i'm marking it as "ready for committer" -- Jaime Casanova www.2ndQuadrant.com Soporte y capacitación de PostgreSQL
Robert Haas <robertmhaas@gmail.com> writes: > [ BackendRelFileNode patch ] One thing that I find rather distressing about this is the 25% bloat in sizeof(SharedInvalidationMessage). Couldn't we avoid that? Is it really necessary to *ever* send an SI message for a backend-local rel? I agree that one needs to send relcache inval sometimes for temp rels, but I don't see why each backend couldn't interpret that as a flush on its own local version. regards, tom lane
On Thu, Aug 5, 2010 at 7:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> [ BackendRelFileNode patch ] > > One thing that I find rather distressing about this is the 25% bloat > in sizeof(SharedInvalidationMessage). Couldn't we avoid that? Is it > really necessary to *ever* send an SI message for a backend-local rel? It can be dropped or unlinked by another backend, so, yes. > I agree that one needs to send relcache inval sometimes for temp rels, > but I don't see why each backend couldn't interpret that as a flush > on its own local version. The problem is that receipt of the inval message results in a call to smgrclosenode(), which has to look up the (Backend)RelFileNode in SMgrRelationHash. If the backend ID isn't present, we can't search the hash table efficiently. This is not only a problem for smgrclosenode(), either; that hash is used in a bunch of places, so changing the hash key isn't really an option. One possibility would be to have two separate shared-invalidation message types for smgr. One would indicate that a non-temporary relation was flushed, so the backend ID field is known to be -1 and we can search the hash quickly.The other would indicate that a temporary relationwas flushed and you'd have to scan the whole hash table to find and flush all matching entries. That still doesn't sound great, though. Another thought I had was that if we could count on the backend ID to fit into an int16 we could fit it in to what's currently padding space. That would require rather dramatically lowering the maximum number of backends (currently INT_MAX/4), but it's a little hard to imagine that we can really support more than 32,767 simultaneous backends anyway. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Aug 5, 2010 at 7:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> One thing that I find rather distressing about this is the 25% bloat >> in sizeof(SharedInvalidationMessage). �Couldn't we avoid that? �Is it >> really necessary to *ever* send an SI message for a backend-local rel? > It can be dropped or unlinked by another backend, so, yes. Really? Surely that should be illegal during normal operation. We might be doing such during crash recovery, but we don't need to broadcast sinval messages then. It might be sufficient to consider that there are "local" and "global" smgr inval messages, where the former never get out of the generating backend, so a bool is enough in the message struct. > had was that if we could count on the backend ID to fit into an int16 > we could fit it in to what's currently padding space. That would > require rather dramatically lowering the maximum number of backends > (currently INT_MAX/4), but it's a little hard to imagine that we can > really support more than 32,767 simultaneous backends anyway. Yeah, that occurred to me too. A further thought is that the id field could probably be reduced to 1 byte, leaving 3 for backendid, which would certainly be plenty. However representing that in a portable struct declaration would be a bit painful I fear. regards, tom lane
On Fri, Aug 6, 2010 at 1:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Thu, Aug 5, 2010 at 7:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> One thing that I find rather distressing about this is the 25% bloat >>> in sizeof(SharedInvalidationMessage). Couldn't we avoid that? Is it >>> really necessary to *ever* send an SI message for a backend-local rel? > >> It can be dropped or unlinked by another backend, so, yes. > > Really? Surely that should be illegal during normal operation. We > might be doing such during crash recovery, but we don't need to > broadcast sinval messages then. autovacuum.c does it when we start to worry about XID wraparound, but you can also do it from any normal backend. Just "DROP TABLE pg_temp_2.foo" or whatever and away you go. > It might be sufficient to consider that there are "local" and "global" > smgr inval messages, where the former never get out of the generating > backend, so a bool is enough in the message struct. It would be nice to be able to do it that way, but I don't believe it's the case, per the above. >> had was that if we could count on the backend ID to fit into an int16 >> we could fit it in to what's currently padding space. That would >> require rather dramatically lowering the maximum number of backends >> (currently INT_MAX/4), but it's a little hard to imagine that we can >> really support more than 32,767 simultaneous backends anyway. > > Yeah, that occurred to me too. A further thought is that the id field > could probably be reduced to 1 byte, leaving 3 for backendid, which > would certainly be plenty. However representing that in a portable > struct declaration would be a bit painful I fear. Well, presumably we'd just represent it as a 1-byte field followed by a 2-byte field, and do a bit of math. But I don't really see the point. The whole architecture of a shared invalidation queue is fundamentally non-scalable because it's a broadcast medium. If we wanted to efficiently support even thousands of backends (let alone tens or hundreds of thousands) I assume we would need to rearchitect this completely with more fine-grained queues, and have backends subscribe to the queues pertaining to the objects they want to access before touching them. Or maybe something else entirely. But I don't think broadcasting to 30,000 backends is going to work for the same reason that plugging 30,000 machines into an Ethernet *hub* doesn't work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Aug 6, 2010 at 1:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Really? �Surely that should be illegal during normal operation. We >> might be doing such during crash recovery, but we don't need to >> broadcast sinval messages then. > autovacuum.c does it when we start to worry about XID wraparound, but > you can also do it from any normal backend. Just "DROP TABLE > pg_temp_2.foo" or whatever and away you go. Mph. I'm still not convinced that the sinval message needs to carry backendid. But maybe the first-cut solution should just be to squeeze the id into the padding area. You should be able to get up to 65535 allowed backends, not 32k --- or perhaps use different message type IDs for local and global backendid, so that all 65536 bitpatterns are allowed for a non-global backendid. > Well, presumably we'd just represent it as a 1-byte field followed by > a 2-byte field, and do a bit of math. But I don't really see the > point. The whole architecture of a shared invalidation queue is > fundamentally non-scalable because it's a broadcast medium. Sure, it tops out somewhere, but 32K is way too close to configurations we know work well enough in the field (I've seen multiple reports of people using a couple thousand backends). In any case, sinval readers don't block each other in the current implementation, so I'm really dubious that there's any inherent scalability limitation there. I'll hold still for 64K but I think it might be better to go for 2^24. regards, tom lane
On Fri, Aug 6, 2010 at 12:50 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > Just "DROP TABLE > pg_temp_2.foo" or whatever and away you go. > wow! that's true... and certainly a bug... we shouldn't allow any session to drop other session's temp tables, or is there a reason for this misbehavior? -- Jaime Casanova www.2ndQuadrant.com Soporte y capacitación de PostgreSQL
Jaime Casanova <jaime@2ndquadrant.com> writes: > On Fri, Aug 6, 2010 at 12:50 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Just "DROP TABLE pg_temp_2.foo" or whatever and away you go. > wow! that's true... and certainly a bug... No, it's not a bug. You'll find only superusers can do it. > we shouldn't allow any session to drop other session's temp tables, or > is there a reason for this misbehavior? It is intentional, though I'd be willing to give it up if we had more bulletproof crash-cleanup of temp tables --- which is one of the things this patch is supposed to result in. regards, tom lane
On Fri, Aug 6, 2010 at 2:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Fri, Aug 6, 2010 at 1:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Really? Surely that should be illegal during normal operation. We >>> might be doing such during crash recovery, but we don't need to >>> broadcast sinval messages then. > >> autovacuum.c does it when we start to worry about XID wraparound, but >> you can also do it from any normal backend. Just "DROP TABLE >> pg_temp_2.foo" or whatever and away you go. > > Mph. I'm still not convinced that the sinval message needs to carry > backendid. Hey, if you have an idea... I'd love to get rid of it, too, but I don't see how to do it. > But maybe the first-cut solution should just be to squeeze > the id into the padding area. You should be able to get up to 65535 > allowed backends, not 32k --- or perhaps use different message type IDs > for local and global backendid, so that all 65536 bitpatterns are > allowed for a non-global backendid. > >> Well, presumably we'd just represent it as a 1-byte field followed by >> a 2-byte field, and do a bit of math. But I don't really see the >> point. The whole architecture of a shared invalidation queue is >> fundamentally non-scalable because it's a broadcast medium. > > Sure, it tops out somewhere, but 32K is way too close to configurations > we know work well enough in the field (I've seen multiple reports of > people using a couple thousand backends). In any case, sinval readers > don't block each other in the current implementation, so I'm really > dubious that there's any inherent scalability limitation there. I'll > hold still for 64K but I think it might be better to go for 2^24. Well, I wouldn't expect anyone to use an exclusive lock for readers without a good reason, but you still have n backends that each have to read, presumably, about O(n) messages, so eventually that's going to start to pinch. Do you think it's worth worrying about the reduction in the number of possible SI message types? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Fri, Aug 6, 2010 at 2:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jaime Casanova <jaime@2ndquadrant.com> writes: >> On Fri, Aug 6, 2010 at 12:50 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> Just "DROP TABLE pg_temp_2.foo" or whatever and away you go. > >> wow! that's true... and certainly a bug... > > No, it's not a bug. You'll find only superusers can do it. > >> we shouldn't allow any session to drop other session's temp tables, or >> is there a reason for this misbehavior? > > It is intentional, though I'd be willing to give it up if we had more > bulletproof crash-cleanup of temp tables --- which is one of the things > this patch is supposed to result in. This patch only directly addresses the issue of cleaning up the storage, so there are still the catalog entries to worry about. But it doesn't seem impossible to think about building on this approach to eventually handle that part of the problem better, too. I haven't thought too much about what that would look like, but I think it could be done. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Aug 6, 2010 at 2:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Sure, it tops out somewhere, but 32K is way too close to configurations >> we know work well enough in the field (I've seen multiple reports of >> people using a couple thousand backends). > Well, I wouldn't expect anyone to use an exclusive lock for readers > without a good reason, but you still have n backends that each have to > read, presumably, about O(n) messages, so eventually that's going to > start to pinch. Sure, but I don't see much to be gained from multiple queues either. There are few (order of zero, in fact) cases where sinval messages are transmitted that aren't of potential interest to all backends. Maybe you could do something useful with a very large number of dynamically-defined queues (like one per relation) ... but managing that would probably swamp any savings. > Do you think it's worth worrying about the reduction in the number of > possible SI message types? IIRC the number of message types is the number of catalog caches plus half a dozen or so. We're a long way from exhausting even a 1-byte ID field; and we could play more games if we had to, since there would be a padding byte free in the message types that refer to a catalog cache. IOW, 1-byte id doesn't bother me. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > This patch only directly addresses the issue of cleaning up the > storage, so there are still the catalog entries to worry about. But > it doesn't seem impossible to think about building on this approach to > eventually handle that part of the problem better, too. I haven't > thought too much about what that would look like, but I think it could > be done. Perhaps run through pg_class after restart and flush anything marked relistemp? Although the ideal solution, probably, would be for temp tables to not have persistent catalog entries in the first place. regards, tom lane
On Fri, Aug 06, 2010 at 03:00:35PM -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > This patch only directly addresses the issue of cleaning up the > > storage, so there are still the catalog entries to worry about. > > But it doesn't seem impossible to think about building on this > > approach to eventually handle that part of the problem better, > > too. I haven't thought too much about what that would look like, > > but I think it could be done. > > Perhaps run through pg_class after restart and flush anything marked > relistemp? Although the ideal solution, probably, would be for temp > tables to not have persistent catalog entries in the first place. For the upcoming global temp tables, which are visible in other sessions, how would this work? Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
David Fetter <david@fetter.org> writes: > On Fri, Aug 06, 2010 at 03:00:35PM -0400, Tom Lane wrote: >> Perhaps run through pg_class after restart and flush anything marked >> relistemp? Although the ideal solution, probably, would be for temp >> tables to not have persistent catalog entries in the first place. > For the upcoming global temp tables, which are visible in other > sessions, how would this work? Well, that's a totally different matter. Those would certainly have persistent entries, at least for the "master" version. I don't think anyone's really figured out what the best implementation looks like; but maybe there would be per-backend "child" versions that could act just like the current kind of temp table (and again would ideally not have any persistent catalog entries). regards, tom lane
On Fri, Aug 6, 2010 at 3:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> This patch only directly addresses the issue of cleaning up the >> storage, so there are still the catalog entries to worry about. But >> it doesn't seem impossible to think about building on this approach to >> eventually handle that part of the problem better, too. I haven't >> thought too much about what that would look like, but I think it could >> be done. > > Perhaps run through pg_class after restart and flush anything marked > relistemp? The trouble is that you have to bind to a database before you can run through pg_class, and the postmaster doesn't. Of course, if it could attach to a database and then detach again, this might be feasible, although perhaps still a bit overly complex for the postmaster, but in any event it doesn't. We seem to already have a mechanism in place where a backend that creates a temprel nukes any pre-existing temprel schema for its own backend, but of course that's not fool-proof. > Although the ideal solution, probably, would be for temp > tables to not have persistent catalog entries in the first place. I've been thinking about that, but it's a bit challenging to imagine how it could work. It's not just the pg_class entries you have to think about, but also pg_attrdef, pg_attribute, pg_constraint, pg_description, pg_index, pg_rewrite, and pg_trigger. An even stickier problem is that we have lots of places in the backend code that refer to objects by OID. We'd either need to change all of that code (which seems like a non-starter) or somehow guarantee that the OIDs assigned to any given backend's private objects would be different from those assigned to any public object (which I also don't see how to do). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Fri, Aug 6, 2010 at 2:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Fri, Aug 6, 2010 at 2:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Sure, it tops out somewhere, but 32K is way too close to configurations >>> we know work well enough in the field (I've seen multiple reports of >>> people using a couple thousand backends). > >> Well, I wouldn't expect anyone to use an exclusive lock for readers >> without a good reason, but you still have n backends that each have to >> read, presumably, about O(n) messages, so eventually that's going to >> start to pinch. > > Sure, but I don't see much to be gained from multiple queues either. > There are few (order of zero, in fact) cases where sinval messages > are transmitted that aren't of potential interest to all backends. > Maybe you could do something useful with a very large number of > dynamically-defined queues (like one per relation) ... but managing that > would probably swamp any savings. Well, what I was thinking is that if you could guarantee that a certain backend COULDN'T have a particular relfilenode open at the smgr level, for example, then it needn't read the invalidation messages for that relfilenode. Precisely how to slice that up is another matter. For the present case, for instance, you could creating one queue per backend. In the normal course of events, each backend would subscribe only to its own queue, but if one backend wanted to drop a temporary relation belonging to some other backend, it would temporarily subscribe to that backend's queue, do whatever it needed to do, and then flush all the smgr references before unsubscribing from the queue. That's a bit silly because we doubtless wouldn't invent such a complicated mechanism just for this case, but I think it illustrates the kind of thing that one could do. Or if you wanted to optimize for the case of a large number of databases running in a single cluster, perhaps you'd want one queue per database plus a shared queue for the shared catalogs. I don't know. This is just pie in the sky. >> Do you think it's worth worrying about the reduction in the number of >> possible SI message types? > > IIRC the number of message types is the number of catalog caches plus > half a dozen or so. We're a long way from exhausting even a 1-byte > ID field; and we could play more games if we had to, since there would > be a padding byte free in the message types that refer to a catalog > cache. IOW, 1-byte id doesn't bother me. OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Excerpts from Robert Haas's message of vie ago 06 15:32:21 -0400 2010: > > Perhaps run through pg_class after restart and flush anything marked > > relistemp? > > The trouble is that you have to bind to a database before you can run > through pg_class, and the postmaster doesn't. Of course, if it could > attach to a database and then detach again, this might be feasible, > although perhaps still a bit overly complex for the postmaster, but in > any event it doesn't. A simpler idea seems to run a process specifically to connect to the database to scan pg_class there, and then die. It sounds a tad expensive though. > I've been thinking about that, but it's a bit challenging to imagine > how it could work. It's not just the pg_class entries you have to > think about, but also pg_attrdef, pg_attribute, pg_constraint, > pg_description, pg_index, pg_rewrite, and pg_trigger. An even > stickier problem is that we have lots of places in the backend code > that refer to objects by OID. We'd either need to change all of that > code (which seems like a non-starter) or somehow guarantee that the > OIDs assigned to any given backend's private objects would be > different from those assigned to any public object (which I also don't > see how to do). Maybe we could reserve one of the 32 bits of OID to indicate private-ness. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Fri, Aug 6, 2010 at 3:47 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> Do you think it's worth worrying about the reduction in the number of >>> possible SI message types? >> >> IIRC the number of message types is the number of catalog caches plus >> half a dozen or so. We're a long way from exhausting even a 1-byte >> ID field; and we could play more games if we had to, since there would >> be a padding byte free in the message types that refer to a catalog >> cache. IOW, 1-byte id doesn't bother me. > > OK. Here's an updated patch, with the invalidation changes merged in and hopefully-suitable adjustments elsewhere. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Attachment
Robert Haas <robertmhaas@gmail.com> writes: > Here's an updated patch, with the invalidation changes merged in and > hopefully-suitable adjustments elsewhere. I haven't tested this patch, but I read through it (and have I mentioned how unbelievably illegible -u format patches are?). It seems to be close to commitable, but I found a few issues. In no particular order: storage.sgml needs to be updated to describe this file-naming scheme. BackendRelFileNode isn't a particularly nice struct name; it seems like it puts the most important aspect of the struct's purpose somewhere in the middle of the name. Maybe RelFileNodeBackend would be better, or RelFileNodeFull, or something like that. In GetNewRelFileNode, you've changed some code that is commented /* This logic should match RelationInitPhysicalAddr */, but there is not any corresponding change in RelationInitPhysicalAddr. I find this bothersome. I think the problem is that you've made it look like the assignment of rnode.backend is part of the logic that ought to match RelationInitPhysicalAddr. Perhaps set that off, at least by a blank line, if not its own comment. The relpath() and relpathperm() macros are a tad underdocumented for my taste; at least put comments noting that one takes a RelFileNode and the other a BackendRelFileNode. And I wonder if you couldn't find better names for relpathperm and relpathbackend. assign_maxconnections and assign_autovacuum_max_workers need to be fixed for MAX_BACKENDS; in fact I think all the occurrences of INT_MAX / 4 in guc.c need to be replaced. (And if I were you I'd grep to see if anyplace outside guc.c knows that value...) Also, as a matter of style, the comment currently attached to max_connections needs to be attached to the definition of the constant, not just modified where it is. As an old bit-shaver this sorta bothers me: +++ b/src/include/utils/rel.h @@ -127,7 +127,7 @@ typedef struct RelationData struct SMgrRelationData *rd_smgr; /* cached file handle, or NULL */ int rd_refcnt; /* reference count */ bool rd_istemp; /* rel is a temporary relation*/ - bool rd_islocaltemp; /* rel is a temp rel of this session */ + BackendId rd_backend; /* owning backend id, if temporary relation */ bool rd_isnailed; /* relis nailed in cache */ bool rd_isvalid; /* relcache entry is valid */ char rd_indexvalid; /* state of rd_indexlist: 0 = not valid, 1 = The struct is going to be less efficiently packed with that field order. Maybe swap rd_istemp and rd_backend? + Assert(relation->rd_backend != InvalidOid); ought to be InvalidBackendId, no? Both new calls of GetTempNamespaceBackendId have this wrong. You might also want to change the comment of GetTempNamespaceBackendId to note that its failure result is not just -1 but InvalidBackendId. Lastly, it bothers me that you've put in code to delete files belonging to temp rels during crash restart, without any code to clean up their catalog entries. This will therefore lead to dangling pg_class references, with uncertain but probably not very nice consequences. I think that probably shouldn't go in until there's code to drop the catalog entries too. regards, tom lane
On Thu, Aug 12, 2010 at 12:27:45PM -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > Here's an updated patch, with the invalidation changes merged in and > > hopefully-suitable adjustments elsewhere. > > I haven't tested this patch, but I read through it (and have I mentioned > how unbelievably illegible -u format patches are?). I have every confidence that you, of all people, can arrange to use 'filterdiff --format=context' on attached patches automatically, saving you some time and us some boredom :) Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Thu, 2010-08-12 at 09:44 -0700, David Fetter wrote: > On Thu, Aug 12, 2010 at 12:27:45PM -0400, Tom Lane wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > > > Here's an updated patch, with the invalidation changes merged in and > > > hopefully-suitable adjustments elsewhere. > > > > I haven't tested this patch, but I read through it (and have I mentioned > > how unbelievably illegible -u format patches are?). > > I have every confidence that you, of all people, can arrange to use > 'filterdiff --format=context' on attached patches automatically, > saving you some time and us some boredom :) I was under the impression that the project guideline was that we only accepted context diffs? Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt
On Thu, Aug 12, 2010 at 09:53:54AM -0700, Joshua D. Drake wrote: > On Thu, 2010-08-12 at 09:44 -0700, David Fetter wrote: > > On Thu, Aug 12, 2010 at 12:27:45PM -0400, Tom Lane wrote: > > > Robert Haas <robertmhaas@gmail.com> writes: > > > > Here's an updated patch, with the invalidation changes merged > > > > in and hopefully-suitable adjustments elsewhere. > > > > > > I haven't tested this patch, but I read through it (and have I > > > mentioned how unbelievably illegible -u format patches are?). > > > > I have every confidence that you, of all people, can arrange to > > use 'filterdiff --format=context' on attached patches > > automatically, saving you some time and us some boredom :) > > I was under the impression that the project guideline was that we > only accepted context diffs? Since they're trivially producible from unified diffs, this is a pretty silly reason to bounce--or even comment on--patches. It's less a guideline than a personal preference, namely Tom's. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Excerpts from David Fetter's message of jue ago 12 12:59:33 -0400 2010: > On Thu, Aug 12, 2010 at 09:53:54AM -0700, Joshua D. Drake wrote: > > I was under the impression that the project guideline was that we > > only accepted context diffs? > > Since they're trivially producible from unified diffs, this is a > pretty silly reason to bounce--or even comment on--patches. It's less > a guideline than a personal preference, namely Tom's. Not that I review that many patches, but I do dislike unified diffs too. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Thu, Aug 12, 2010 at 12:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Here's an updated patch, with the invalidation changes merged in and >> hopefully-suitable adjustments elsewhere. > > I haven't tested this patch, but I read through it (and have I mentioned > how unbelievably illegible -u format patches are?). Sorry, I'll run it through filterdiff for you next time. As you know, I have the opposite preference, but since I was producing this primarily for you to read.... I can clean up the rest of this stuff, but not this: > Lastly, it bothers me that you've put in code to delete files belonging > to temp rels during crash restart, without any code to clean up their > catalog entries. This will therefore lead to dangling pg_class > references, with uncertain but probably not very nice consequences. > I think that probably shouldn't go in until there's code to drop the > catalog entries too. I thought about this pretty carefully, and I don't believe that there are any unpleasant consequences. The code that assigns relfilenode numbers is pretty careful to check that the newly assigned value is unused BOTH in pg_class and in the directory where the file will be created, so there should be no danger of a number getting used over again while the catalog entries remain. Also, the drop-object code doesn't mind that the physical storage doesn't exist; it's perfectly happy with that situation. It is true that you will get an ERROR if you try to SELECT from a table whose physical storage has been removed, but that seems OK. We have two existing mechanisms for removing the catalog entries: when a backend is first asked to access a temporary file, it does a DROP SCHEMA ... CASCADE on any pre-existing temp schema. And a table is in wraparound trouble and the owning backend is no longer running, autovacuum will drop it. Improving on this seems difficult: if you wanted to *guarantee* that the catalog entries were removed before we started letting in connections, you'd need to fork a backend per database and have each one iterate through all the temp schemas and drop them. Considering that the existing code seems to have been pretty careful about how this stuff gets handled, I don't think it's worth making the whole startup sequence slower for it. What might be worth considering is changing the autovacuum policy to eliminate the wraparound check, and just have it drop temp table catalog entries for any backend not currently running, period. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Aug 12, 2010 at 12:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Lastly, it bothers me that you've put in code to delete files belonging >> to temp rels during crash restart, without any code to clean up their >> catalog entries. �This will therefore lead to dangling pg_class >> references, with uncertain but probably not very nice consequences. > I thought about this pretty carefully, and I don't believe that there > are any unpleasant consequences. The code that assigns relfilenode > numbers is pretty careful to check that the newly assigned value is > unused BOTH in pg_class and in the directory where the file will be > created, so there should be no danger of a number getting used over > again while the catalog entries remain. Also, the drop-object code > doesn't mind that the physical storage doesn't exist; it's perfectly > happy with that situation. Well, okay, but I'd suggest adding comments to the drop-table code pointing out that it is now NECESSARY for it to not complain if the file isn't there. This was never a design goal before, AFAIR --- the fact that it works like that is kind of accidental. I am also pretty sure that there used to be at least warning messages for that case, which we would now not want. regards, tom lane
On Thu, Aug 12, 2010 at 2:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Thu, Aug 12, 2010 at 12:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Lastly, it bothers me that you've put in code to delete files belonging >>> to temp rels during crash restart, without any code to clean up their >>> catalog entries. This will therefore lead to dangling pg_class >>> references, with uncertain but probably not very nice consequences. > >> I thought about this pretty carefully, and I don't believe that there >> are any unpleasant consequences. The code that assigns relfilenode >> numbers is pretty careful to check that the newly assigned value is >> unused BOTH in pg_class and in the directory where the file will be >> created, so there should be no danger of a number getting used over >> again while the catalog entries remain. Also, the drop-object code >> doesn't mind that the physical storage doesn't exist; it's perfectly >> happy with that situation. > > Well, okay, but I'd suggest adding comments to the drop-table code > pointing out that it is now NECESSARY for it to not complain if the file > isn't there. This was never a design goal before, AFAIR --- the fact > that it works like that is kind of accidental. I am also pretty sure > that there used to be at least warning messages for that case, which we > would now not want. That seems like a good idea. I'll post an updated patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Excerpts from Robert Haas's message of jue ago 12 13:29:57 -0400 2010: > We have two existing mechanisms for removing the catalog entries: when > a backend is first asked to access a temporary file, it does a DROP > SCHEMA ... CASCADE on any pre-existing temp schema. And a table is in > wraparound trouble and the owning backend is no longer running, > autovacuum will drop it. Improving on this seems difficult: if you > wanted to *guarantee* that the catalog entries were removed before we > started letting in connections, you'd need to fork a backend per > database and have each one iterate through all the temp schemas and > drop them. Considering that the existing code seems to have been > pretty careful about how this stuff gets handled, I don't think it's > worth making the whole startup sequence slower for it. What might be > worth considering is changing the autovacuum policy to eliminate the > wraparound check, and just have it drop temp table catalog entries for > any backend not currently running, period. What about having autovacuum silenty drop the catalog entry if it's a temp entry for which the underlying file does not exist? -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Thu, Aug 12, 2010 at 5:29 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Excerpts from Robert Haas's message of jue ago 12 13:29:57 -0400 2010: > >> We have two existing mechanisms for removing the catalog entries: when >> a backend is first asked to access a temporary file, it does a DROP >> SCHEMA ... CASCADE on any pre-existing temp schema. And a table is in >> wraparound trouble and the owning backend is no longer running, >> autovacuum will drop it. Improving on this seems difficult: if you >> wanted to *guarantee* that the catalog entries were removed before we >> started letting in connections, you'd need to fork a backend per >> database and have each one iterate through all the temp schemas and >> drop them. Considering that the existing code seems to have been >> pretty careful about how this stuff gets handled, I don't think it's >> worth making the whole startup sequence slower for it. What might be >> worth considering is changing the autovacuum policy to eliminate the >> wraparound check, and just have it drop temp table catalog entries for >> any backend not currently running, period. > > What about having autovacuum silenty drop the catalog entry if it's a > temp entry for which the underlying file does not exist? I think that would be subject to race conditions. The current mechanism is actually pretty good, and I think we can build on it if we want to do more, rather than inventing something new. We just need to be specific about what problem we're trying to solve. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Aug 12, 2010 at 5:29 PM, Alvaro Herrera > <alvherre@commandprompt.com> wrote: >> What about having autovacuum silenty drop the catalog entry if it's a >> temp entry for which the underlying file does not exist? > I think that would be subject to race conditions. Well, autovacuum's willingness to drop sufficiently old temp tables would already risk any such race conditions. However ... > The current > mechanism is actually pretty good, and I think we can build on it if > we want to do more, rather than inventing something new. We just need > to be specific about what problem we're trying to solve. ... I agree with this point. This idea wouldn't fix the concern I had about the existence of pg_class entries with no underlying file: if that causes any issues we'd have to fix them anyway. So I'm not sure what the gain is. regards, tom lane
One other thought about all this: in the past, one objection that's been raised to deleting files during crash restart is the possible loss of forensic evidence. It might be a good idea to provide some fairly simple way for developers to disable that deletion subroutine. I'm not sure that it rises to the level of needing a GUC, though. regards, tom lane
On Thu, Aug 12, 2010 at 6:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Thu, Aug 12, 2010 at 5:29 PM, Alvaro Herrera >> <alvherre@commandprompt.com> wrote: >>> What about having autovacuum silenty drop the catalog entry if it's a >>> temp entry for which the underlying file does not exist? > >> I think that would be subject to race conditions. > > Well, autovacuum's willingness to drop sufficiently old temp tables > would already risk any such race conditions. However ... I don't think we do. It only drops them if the backend isn't running.And even if the backend starts running after we checkand before we drop it, that's OK. We're only dropping the OLD table, which by definition isn't relevant to the new session. Perhaps we should be taking a lock on the relation first, but I think that can only really bite us if the relation were dropped and a new relation were created with the same OID - in that case, we might drop an unrelated table, though it's vanishingly unlikely in practice. >> The current >> mechanism is actually pretty good, and I think we can build on it if >> we want to do more, rather than inventing something new. We just need >> to be specific about what problem we're trying to solve. > > ... I agree with this point. This idea wouldn't fix the concern I had > about the existence of pg_class entries with no underlying file: if that > causes any issues we'd have to fix them anyway. So I'm not sure what > the gain is. Right. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Thu, 2010-08-12 at 09:44 -0700, David Fetter wrote: > On Thu, Aug 12, 2010 at 12:27:45PM -0400, Tom Lane wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > > > Here's an updated patch, with the invalidation changes merged in and > > > hopefully-suitable adjustments elsewhere. > > > > I haven't tested this patch, but I read through it (and have I mentioned > > how unbelievably illegible -u format patches are?). > > I have every confidence that you, of all people, can arrange to use > 'filterdiff --format=context' on attached patches automatically, > saving you some time and us some boredom :) I was under the impression that the project guideline was that we only accepted context diffs? Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt
On Thu, Aug 12, 2010 at 7:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > One other thought about all this: in the past, one objection that's been > raised to deleting files during crash restart is the possible loss of > forensic evidence. It might be a good idea to provide some fairly > simple way for developers to disable that deletion subroutine. I'm not > sure that it rises to the level of needing a GUC, though. See http://archives.postgresql.org/pgsql-hackers/2010-06/msg00622.php : In this version of the patch, I've improved the temporary file cleanup mechanism. In CVS HEAD, when a transaction commits or aborts, we write an XLOG record with all relations that must be unlinked, including temporary relations. With this patch, we no longer include temporary relations in the XLOG records (which may be a tiny performance benefit for some people); instead, on every startup of the database cluster, we just nuke all temporary relation files (which is now easy to do, since they are named differently than files for non-temporary relations) at the same time that we nuke other temp files. This produces slightly different behavior. In CVS HEAD, temporary files get removed whenever an xlog redo happens, so either at cluster start or after a backend crash, but only to the extent that they appear in WAL. With this patch, we can be sure of removing ALL stray files, which is maybe ever-so-slightly leaky in CVS HEAD. But on the other hand, because it hooks into the existing temporary file cleanup code, it only happens at cluster startup, NOT after a backend crash. The existing coding leaves temporary sort files and similar around after a backend crash for forensics purposes. That might or might not be worth rethinking for non-debug builds, but I don't think there's any very good reason to think that temporary relation files need to be handled differently than temporary work files. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Aug 12, 2010 at 7:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> One other thought about all this: in the past, one objection that's been >> raised to deleting files during crash restart is the possible loss of >> forensic evidence. > ... With this patch, we can be sure of removing ALL > stray files, which is maybe ever-so-slightly leaky in CVS HEAD. But > on the other hand, because it hooks into the existing temporary file > cleanup code, it only happens at cluster startup, NOT after a backend > crash. Doh. Thanks. regards, tom lane
On Thu, Aug 12, 2010 at 2:12 PM, Robert Haas <robertmhaas@gmail.com> wrote: > That seems like a good idea. I'll post an updated patch. Here is an updated patch. It's in context-diff format this time, but that made it go over 100K, so I gzip'd it because I can't remember what the attachment size limit is these days. I'm not sure whether that works out to a win or not. I think this addresses all previous review comments with the exception that I've been unable to devise a more clever name for relpathperm() and relpathbackend(). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Attachment
Robert Haas <robertmhaas@gmail.com> writes: > Here is an updated patch. It's in context-diff format this time, Thanks, I appreciate that ;-) This looks committable to me, with a couple of tiny suggestions: In the text added to storage.sgml, s/temporary relation/temporary relations/. Also, it'd be better if BBB and FFF were marked up as <replaceable> rather than <literal>, see examples elsewhere in that file. The comment for local_buffer_write_error_callback() probably meant to say "during local buffer writes". regards, tom lane
On Fri, Aug 13, 2010 at 12:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Here is an updated patch. It's in context-diff format this time, > > Thanks, I appreciate that ;-) > > This looks committable to me, with a couple of tiny suggestions: Woo hoo! > In the text added to storage.sgml, s/temporary relation/temporary relations/. > Also, it'd be better if BBB and FFF were marked up as <replaceable> > rather than <literal>, see examples elsewhere in that file. I see. How should I mark tBBB_FFF? I actually didn't like that way of explaining it very much, but I couldn't think of anything clearer. Saying "the name will consist of a lowercase t, followed by the backend ID, followed by an underscore, followed by the filenode" did not seem better. > The comment for local_buffer_write_error_callback() probably meant to > say "during local buffer writes". No doubt. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Aug 13, 2010 at 12:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Also, it'd be better if BBB and FFF were marked up as <replaceable> >> rather than <literal>, see examples elsewhere in that file. > I see. How should I mark tBBB_FFF? I'd do <literal>t<replaceable>BBB</replaceable>_<replaceable>FFF</replaceable></literal> regards, tom lane
On Thu, Aug 12, 2010 at 1:29 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Lastly, it bothers me that you've put in code to delete files belonging >> to temp rels during crash restart, without any code to clean up their >> catalog entries. This will therefore lead to dangling pg_class >> references, with uncertain but probably not very nice consequences. >> I think that probably shouldn't go in until there's code to drop the >> catalog entries too. > > I thought about this pretty carefully, and I don't believe that there > are any unpleasant consequences. The code that assigns relfilenode > numbers is pretty careful to check that the newly assigned value is > unused BOTH in pg_class and in the directory where the file will be > created, so there should be no danger of a number getting used over > again while the catalog entries remain. Also, the drop-object code > doesn't mind that the physical storage doesn't exist; it's perfectly > happy with that situation. It is true that you will get an ERROR if > you try to SELECT from a table whose physical storage has been > removed, but that seems OK. After further reflection, I think that the above reasoning is wrong. GetNewRelFileNode() guarantees that the OID doesn't collide with the OID column of pg_class, not the relfilenode column of pg_class; and, per the comments to that function, it only guarantees this when creating a new relation (to which we're therefore assigning an OID) and not when rewriting an existing relation. So it provides no protection against the scenario where backend #1 drops a table that lacks physical storage just as backend #2 assigns that OID to some other relation and begins creating files, which backend #1 then blows away. However, I think we're safe for a different reason. The only time we unlink the physical storage of a relation without nuking its catalog entries is during the startup sequence, when we wipe out the physical storage for any leftover temp tables. So if there's a race condition, it can only apply to temp tables. But temp tables for a particular backend ID can only be created by that backend, and before a backend will create any temp tables it will drop any previously existing temp schema. Thus, by the time a temp table can get created, there CAN'T be any leftover catalog entries from previous sessions for it to potentially collide with. I think the reasoning above probably should be added to the comment at the beginning of GetNewRelFileNode(), and I'll go do that unless someone thinks it's incorrect. The first sentence of that comment is also now technically incorrect: what it now does is generate a relfilenode such that <tablespace, relfilenode, backendid> is unique. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> writes: > After further reflection, I think that the above reasoning is wrong. > GetNewRelFileNode() guarantees that the OID doesn't collide with the > OID column of pg_class, not the relfilenode column of pg_class; and, > per the comments to that function, it only guarantees this when > creating a new relation (to which we're therefore assigning an OID) > and not when rewriting an existing relation. So it provides no > protection against the scenario where backend #1 drops a table that > lacks physical storage just as backend #2 assigns that OID to some > other relation and begins creating files, which backend #1 then blows > away. > However, I think we're safe for a different reason [ omitted ] The above scenario is only a risk if you suppose that dropping a relation that lacks physical storage will nonetheless result in attempted unlink()s. I think that that's probably not the case; but it seems related to a question that was bothering me recently. Namely: why do we assign relfilenode values to relations that have no physical storage? If we did not do so, then relation drop would see a zero in relfilenode and would certainly not attempt an incorrect unlink(). So I'd like to look into setting relfilenode to zero for relation relkinds that lack storage. I'm not exactly sure why the code doesn't do that already, though. This came to mind after reading a commit from Bruce in which he had to hack up pg_upgrade to preserve relfilenode numbers for storage-less relations. Presumably that hack could get undone. regards, tom lane
On Wed, Sep 15, 2010 at 12:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> After further reflection, I think that the above reasoning is wrong. >> GetNewRelFileNode() guarantees that the OID doesn't collide with the >> OID column of pg_class, not the relfilenode column of pg_class; and, >> per the comments to that function, it only guarantees this when >> creating a new relation (to which we're therefore assigning an OID) >> and not when rewriting an existing relation. So it provides no >> protection against the scenario where backend #1 drops a table that >> lacks physical storage just as backend #2 assigns that OID to some >> other relation and begins creating files, which backend #1 then blows >> away. > >> However, I think we're safe for a different reason [ omitted ] > > The above scenario is only a risk if you suppose that dropping a > relation that lacks physical storage will nonetheless result in > attempted unlink()s. I think that that's probably not the case; Why? How would we know that it didn't have physical storage prior to attempting the unlinks? > but it seems related to a question that was bothering me recently. > Namely: why do we assign relfilenode values to relations that have > no physical storage? If we did not do so, then relation drop would > see a zero in relfilenode and would certainly not attempt an incorrect > unlink(). I agree that setting relfilenode to 0 for relkinds that lack physical storage is a good idea, but that's obviously only going to handle that specific case. > This came to mind after reading a commit from Bruce in which he had to > hack up pg_upgrade to preserve relfilenode numbers for storage-less > relations. Presumably that hack could get undone. Seems like a good thing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Sep 15, 2010 at 12:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The above scenario is only a risk if you suppose that dropping a >> relation that lacks physical storage will nonetheless result in >> attempted unlink()s. �I think that that's probably not the case; > Why? How would we know that it didn't have physical storage prior to > attempting the unlinks? From the relkind. regards, tom lane
On Wed, Sep 15, 2010 at 12:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Wed, Sep 15, 2010 at 12:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> The above scenario is only a risk if you suppose that dropping a >>> relation that lacks physical storage will nonetheless result in >>> attempted unlink()s. I think that that's probably not the case; > >> Why? How would we know that it didn't have physical storage prior to >> attempting the unlinks? > > From the relkind. Oh, sure, I agree with you in that specific case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company