Thread: Drop type "smgr"?
Hello hackers, The type smgr has only one value 'magnetic disk'. ~15 years ago it also had a value 'main memory', and in Berkeley POSTGRES 4.2 there was a third value 'sony jukebox'. Back then, all tables had an associated block storage manager, and it was recorded as an attribute relsmgr of pg_class (or pg_relation as it was known further back). This was the type of that attribute, removed by Bruce in 3fa2bb31 (1997). Nothing seems to break if you remove it (except for some tests using it in an incidental way). See attached. Motivation: A couple of projects propose to add new smgr implementations alongside md.c in order to use bufmgr.c for more kinds of files, but it seems entirely bogus to extend the unused smgr type to cover those. -- Thomas Munro https://enterprisedb.com
Attachment
Thomas Munro <thomas.munro@gmail.com> writes: > Motivation: A couple of projects propose to add new smgr > implementations alongside md.c in order to use bufmgr.c for more kinds > of files, but it seems entirely bogus to extend the unused smgr type > to cover those. I agree that smgrtype as it stands is pretty pointless, but what will we be using instead to get to those other implementations? regards, tom lane
On Thu, Feb 28, 2019 at 7:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > Motivation: A couple of projects propose to add new smgr > > implementations alongside md.c in order to use bufmgr.c for more kinds > > of files, but it seems entirely bogus to extend the unused smgr type > > to cover those. > > I agree that smgrtype as it stands is pretty pointless, but what > will we be using instead to get to those other implementations? Our current thinking is that smgropen() should know how to map a small number of special database OIDs to different smgr implementations (where currently it hard-codes smgr_which = 0). For example there would be a pseudo-database for undo logs, and another for the SLRUs that Shawn Debnath and others have been proposing to move into shared buffers. Another idea would be to widen the buffer tag to include the selector. Unlike the ancestral code, it wouldn't need to appear in catalogs or ever be seen or typed in by users so there still wouldn't be a use for this type. -- Thomas Munro https://enterprisedb.com
Thomas Munro <thomas.munro@gmail.com> writes: > On Thu, Feb 28, 2019 at 7:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I agree that smgrtype as it stands is pretty pointless, but what >> will we be using instead to get to those other implementations? > Our current thinking is that smgropen() should know how to map a small > number of special database OIDs to different smgr implementations Hmm. Maybe mapping based on tablespaces would be a better idea? > Unlike the ancestral code, it wouldn't need to appear in > catalogs or ever be seen or typed in by users so there still wouldn't > be a use for this type. Yeah, the $64 problem at this level is that you don't really want to depend on catalog contents, because you cannot do a lookup to find out what to do. So I agree that we're pretty unlikely to resurrect an smgr type per se. But I'd been expecting an answer mentioning pg_am OIDs, and was wondering how that'd work exactly. Probably, it would still be down to some C code having hard-wired knowledge about some OIDs ... regards, tom lane
On Thu, Feb 28, 2019 at 7:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > On Thu, Feb 28, 2019 at 7:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> I agree that smgrtype as it stands is pretty pointless, but what > >> will we be using instead to get to those other implementations? > > > Our current thinking is that smgropen() should know how to map a small > > number of special database OIDs to different smgr implementations > > Hmm. Maybe mapping based on tablespaces would be a better idea? In the undo log proposal (about which more soon) we are using tablespaces for their real purpose, so we need that OID. If you SET undo_tablespaces = foo then future undo data created by your session will be written there, which might be useful for putting that IO on different storage. We also use the relation OID to chop up the undo address space into separate numbered undo logs, so that different sessions get their own space to insert data without trampling on each other's buffer locks. That leaves only the database OID to mess with (unless we widen buffer tags and the smgropen() interface). > > Unlike the ancestral code, it wouldn't need to appear in > > catalogs or ever be seen or typed in by users so there still wouldn't > > be a use for this type. > > Yeah, the $64 problem at this level is that you don't really want > to depend on catalog contents, because you cannot do a lookup to > find out what to do. Right. It would be a circular dependency if you had to read a catalog before you could consult low level SLRUs like (say) the clog* via shared buffers, since you need the clog to read the catalog, or (to peer a long way down the road and around several corners) if the catalogs themselves could optionally be stored in an undo-aware table access manager like zheap, which might require reading undo. I think this block storage stuff exists at a lower level than relations and transactions and therefore also catalogs, and there will be a small fixed number of them and it makes sense to hard-code the knowledge of them. *I'm at least a little bit aware of the history here: your 2001 commit 2589735d moved clog out of shared buffers! That enabled you to develop the system of segment files truncated from the front. That's sort of what this new smgr work is about; putting things in shared buffers, but just mapping the blocks to paths differently than md.c, as appropriate. -- Thomas Munro https://enterprisedb.com
On Thu, Feb 28, 2019 at 5:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Thomas Munro <thomas.munro@gmail.com> writes:
> On Thu, Feb 28, 2019 at 7:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I agree that smgrtype as it stands is pretty pointless, but what
>> will we be using instead to get to those other implementations?
> Our current thinking is that smgropen() should know how to map a small
> number of special database OIDs to different smgr implementations
Hmm. Maybe mapping based on tablespaces would be a better idea?
Thanks to bringing up this idea of mutliple smgr implementations. I also
thought of implementing our own smgr implementation to support transparent
data encryption on the disk based on tablespace mapping.
Regards,
Haribabu Kommi
Fujitsu Australia
On Thu, Feb 28, 2019 at 1:03 AM Thomas Munro <thomas.munro@gmail.com> wrote: > The type smgr has only one value 'magnetic disk'. ~15 years ago it > also had a value 'main memory', and in Berkeley POSTGRES 4.2 there was > a third value 'sony jukebox'. Back then, all tables had an associated > block storage manager, and it was recorded as an attribute relsmgr of > pg_class (or pg_relation as it was known further back). This was the > type of that attribute, removed by Bruce in 3fa2bb31 (1997). > > Nothing seems to break if you remove it (except for some tests using > it in an incidental way). See attached. FWIW, +1 from me. I thought about arguing to remove this a number of years ago when I was poking around in this area for some reason, but it didn't seem important enough to be worth arguing about then. Now, because we're actually going to maybe-hopefully get some more smgrs that do interesting things, it seems worth the arguing... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Thomas Munro <thomas.munro@gmail.com> writes: > On Thu, Feb 28, 2019 at 7:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Thomas Munro <thomas.munro@gmail.com> writes: >>> Our current thinking is that smgropen() should know how to map a small >>> number of special database OIDs to different smgr implementations >> Hmm. Maybe mapping based on tablespaces would be a better idea? > In the undo log proposal (about which more soon) we are using > tablespaces for their real purpose, so we need that OID. If you SET > undo_tablespaces = foo then future undo data created by your session > will be written there, which might be useful for putting that IO on > different storage. Meh. That's a point, but it doesn't exactly seem like a killer argument. Just in the abstract, it seems much more likely to me that people would want per-database special rels than per-tablespace special rels. And I think your notion of a GUC that can control this is probably pie in the sky anyway: if we can't afford to look into the catalogs to resolve names at this code level, how are we going to handle a GUC? The real reason I'm concerned about this, though, is that for either a database or a tablespace, you can *not* get away with having a magic OID just hanging in space with no actual catalog row matching it. If nothing else, you need an entry there to prevent someone from reusing the OID for another purpose. And a pg_database row that doesn't correspond to a real database is going to break all kinds of code, starting with pg_upgrade and the autovacuum launcher. Special rows in pg_tablespace are much less likely to cause issues, because of the precedent of pg_global and pg_default. In short, I think you're better off equating smgrs to magic tablespaces, and if you need some way of letting users control where those map to storage-wise, control it in some other way. regards, tom lane
On Thu, Feb 28, 2019 at 10:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > The real reason I'm concerned about this, though, is that for either > a database or a tablespace, you can *not* get away with having a magic > OID just hanging in space with no actual catalog row matching it. > If nothing else, you need an entry there to prevent someone from > reusing the OID for another purpose. And a pg_database row that > doesn't correspond to a real database is going to break all kinds of > code, starting with pg_upgrade and the autovacuum launcher. Special > rows in pg_tablespace are much less likely to cause issues, because > of the precedent of pg_global and pg_default. My first intuition was the same as yours -- that we should use the tablespace to decide which smgr is relevant -- but I now think that intuition was wrong. Even if you use the tablespace OID to select the smgr, it doesn't completely solve the problem you're worried about here. You still have to put SOMETHING in the database OID field, and that's going to be just as fake as it was before. I guess you could use the OID for pg_global for things like undo and SLRU data, but then how is GetRelationPath() going to work? You don't have enough bits left anywhere to specify both a tablespace and a location within that tablespace in a reasonable way, and I think it's far from obvious how you would build a side channel that could carry that information. Also, I don't see why we'd need a fake pg_database row in the first place. IIUC, the OID counter wraps around to FirstNormalObjectId, so nobody should have a database with an OID less than that value. If we were using smgrs to represent other kinds of ways of storing user tables, e.g. network.c, cloud.c, magtape.c, punchcard.c, etc. - then I think we'd have to find some way to let each user-defined tablespace pick an smgr, and I don't really know how that would work given the fact that we can't really use catalog lookups to figure out which one to use, but actually the direction here is to store files internal to the system using special-purpose smgrs so that we can pull more things into shared_buffers, and for that purpose the database OID seems cleaner. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > My first intuition was the same as yours -- that we should use the > tablespace to decide which smgr is relevant -- but I now think that > intuition was wrong. Even if you use the tablespace OID to select the > smgr, it doesn't completely solve the problem you're worried about > here. You still have to put SOMETHING in the database OID field, and > that's going to be just as fake as it was before. My thought was that that could be zero if not relevant ... isn't that what we do now for buffer tags for shared rels? > I guess you could > use the OID for pg_global for things like undo and SLRU data, but then > how is GetRelationPath() going to work? You don't have enough bits > left anywhere to specify both a tablespace and a location within that > tablespace in a reasonable way, and I think it's far from obvious how > you would build a side channel that could carry that information. It's certainly possible/likely that we're going to end up needing to widen buffer tags to represent the smgr explicitly, because some use cases are going to need a real database spec, some are going to need a real tablespace spec, and some might need both. Maybe we should just bite that bullet. regards, tom lane
On Thu, Feb 28, 2019 at 11:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > It's certainly possible/likely that we're going to end up needing to > widen buffer tags to represent the smgr explicitly, because some use > cases are going to need a real database spec, some are going to need > a real tablespace spec, and some might need both. Maybe we should > just bite that bullet. Well, Andres will probably complain about that. He thinks, IIUC, that the buffer tags are too wide already and that it's significantly hurting performance on very very common operations - like buffer lookups. I haven't verified that myself, but I tend to think he knows what he's talking about. Anyway, given that your argument started off from the premise that we had to have a pg_database row, I think we'd better look a little harder at whether that premise is correct before getting too excited here. As I said in my earlier reply, I think that we probably don't need to have a pg_database row given that we wrap around to FirstNormalObjectId; any value we hard-code would be less than that. If there are other reasons why we'd need that, it might be useful to hear about them. However, all we really need to decide on this thread is whether we need 'smgr' exposed as an SQL type. And I can't see why we need that no matter how all of the rest of this turns out. Nobody is currently proposing to give users a choice of smgrs, just to use them for internal stuff. Even if that changed later, it doesn't necessarily mean we'd add back an SQL type, or that if we did it would look like the one we have today. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Feb 28, 2019 at 1:03 AM Thomas Munro <thomas.munro@gmail.com> wrote: >> Nothing seems to break if you remove it (except for some tests using >> it in an incidental way). See attached. > FWIW, +1 from me. To be clear, I'm not objecting to the proposed patch either. I was just wondering where we plan to go from here, given that smgr.c wasn't getting removed. BTW, there is stuff in src/backend/storage/smgr/README that is already obsoleted by this patch, and more that might be obsoleted if development proceeds as discussed here. So that needs a look. regards, tom lane
On Thu, Feb 28, 2019 at 10:35:50AM -0500, Robert Haas wrote: > On Thu, Feb 28, 2019 at 10:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > The real reason I'm concerned about this, though, is that for either > > a database or a tablespace, you can *not* get away with having a magic > > OID just hanging in space with no actual catalog row matching it. > > If nothing else, you need an entry there to prevent someone from > > reusing the OID for another purpose. And a pg_database row that > > doesn't correspond to a real database is going to break all kinds of > > code, starting with pg_upgrade and the autovacuum launcher. Special > > rows in pg_tablespace are much less likely to cause issues, because > > of the precedent of pg_global and pg_default. > > Also, I don't see why we'd need a fake pg_database row in the first > place. IIUC, the OID counter wraps around to FirstNormalObjectId, so > nobody should have a database with an OID less than that value. We have scripts under catalog directory that can check to ensure OIDs aren't re-used accidentally. However, we still have to define an entry in a catalog somewhere and I was proposing creating a new one, pg_storage_managers?, to track these entries. See [1] for previous discussion on this topic. We wouldn't need to do catalog lookups for being able to use the smgrs as the OIDs will be hardcoded in C, but the data will be available for posterity and OID reservation. > It's certainly possible/likely that we're going to end up needing to > widen buffer tags to represent the smgr explicitly, because some use > cases are going to need a real database spec, some are going to need > a real tablespace spec, and some might need both. Maybe we should > just bite that bullet. For now, the two projects that require the new smgrs, undo and slru, can get away with using the database OID as the smgr differentiator. We have enough left over bits to get our work done (Thomas please correct me if I am mistaken). The only issue is pg_buffercache would present DB OIDs that wouldn't map correctly (along with the rest of relfilenode). We can modify the output to include more columns to give caching specific information when these special buffers are encountered. > Well, Andres will probably complain about that. He thinks, IIUC, that > the buffer tags are too wide already and that it's significantly > hurting performance on very very common operations - like buffer > lookups. I haven't verified that myself, but I tend to think he knows > what he's talking about. I can imagine this would. I am curious about the pef impact, going to at least create a patch and do some testing. Another thought: my colleague Anton Shyrabokau suggested potentially re-using forknumber to differentiate smgrs. We are using 32 bits to map 5 entries today. This approach would be similar to how we split up the segment numbers and use the higher bits to identify forget or unlink requests for checkpointer. > Anyway, given that your argument started off from the premise that we > had to have a pg_database row, I think we'd better look a little > harder at whether that premise is correct before getting too excited > here. As I said in my earlier reply, I think that we probably don't > need to have a pg_database row given that we wrap around to > FirstNormalObjectId; any value we hard-code would be less than that. > If there are other reasons why we'd need that, it might be useful to > hear about them. See above, I think a new catalog, instead of pg_database, can resolve the issue while providing the ability to reserve the OIDs. > However, all we really need to decide on this thread is whether we > need 'smgr' exposed as an SQL type. And I can't see why we need that > no matter how all of the rest of this turns out. Nobody is currently > proposing to give users a choice of smgrs, just to use them for > internal stuff. Even if that changed later, it doesn't necessarily > mean we'd add back an SQL type, or that if we did it would look like > the one we have today. +1 smgr types can be removed. -- Shawn Debnath Amazon Web Services (AWS)
On Thu, Feb 28, 2019 at 10:02:46AM -0800, Shawn Debnath wrote: > in a catalog somewhere and I was proposing creating a new one, > pg_storage_managers?, to track these entries. See [1] for previous > discussion on this topic. We wouldn't need to do catalog lookups for *ahem*, here's the footnote: [1] https://www.postgresql.org/message-id/20180821184835.GA1032%4060f81dc409fc.ant.amazon.com -- Shawn Debnath Amazon Web Services (AWS)
Hi, On 2019-02-28 12:36:50 -0500, Robert Haas wrote: > Well, Andres will probably complain about that. He thinks, IIUC, that > the buffer tags are too wide already and that it's significantly > hurting performance on very very common operations - like buffer > lookups. Correct. Turns out especially comparing the keys after the hash match is pretty expensive. It also is a significant factor influencing the size of the hashtable, which influences how much of it can be in cache. My plan is still to move to a two tiered system, where we have one unordered datastructure to map from (db, tablespace, oid) to a secondary ordered datastructure that then maps from (block number) to an actual offset. With the first being cached somewhere in RelationData, therefore not being performance critical. But while I hope to work for that in 13, I don't think making other large projects depend on it would be smart. Greetings, Andres Freund
Hi, On 2019-02-28 10:02:46 -0800, Shawn Debnath wrote: > We have scripts under catalog directory that can check to ensure OIDs > aren't re-used accidentally. However, we still have to define an entry > in a catalog somewhere and I was proposing creating a new one, > pg_storage_managers?, to track these entries. See [1] for previous > discussion on this topic. We wouldn't need to do catalog lookups for > being able to use the smgrs as the OIDs will be hardcoded in C, but the > data will be available for posterity and OID reservation. I'm inclined to just put them in pg_am, with a new type 's' (we already have amtype = i for indexes, I'm planning to add 't' for tables soon). While not a perfect fit for storage managers, it seems to fit well enough. > Another thought: my colleague Anton Shyrabokau suggested potentially > re-using forknumber to differentiate smgrs. We are using 32 bits to > map 5 entries today. This approach would be similar to how we split up > the segment numbers and use the higher bits to identify forget or > unlink requests for checkpointer. That could probably be done, without incurring too much overhead here. I'm not sure that the added complexity around the tree is worth it however. I personally would just go with the DB oid for the near future, where we don't need per-database storage for those. And then plan for buffer manager improvements. Greetings, Andres Freund
Shawn Debnath <sdn@amazon.com> writes: > On Thu, Feb 28, 2019 at 10:35:50AM -0500, Robert Haas wrote: >> Also, I don't see why we'd need a fake pg_database row in the first >> place. IIUC, the OID counter wraps around to FirstNormalObjectId, so >> nobody should have a database with an OID less than that value. > We have scripts under catalog directory that can check to ensure OIDs > aren't re-used accidentally. However, we still have to define an entry > in a catalog somewhere and I was proposing creating a new one, > pg_storage_managers?, to track these entries. That would fail to capture the property that the smgr OIDs mustn't conflict with database OIDs, so the whole thing still seems like an ugly kluge from here. > Another thought: my colleague Anton Shyrabokau suggested potentially > re-using forknumber to differentiate smgrs. We are using 32 bits to > map 5 entries today. Yeah, that seems like it might be a workable answer. I really do think that relying on magic database OIDs is a bad idea; if you think you aren't going to need a real database ID in there in the future, you're being short-sighted. And I guess the same argument establishes that relying on magic tablespace OIDs would also be a bad idea, even if there weren't a near-term proposal on the table for needing real tablespace IDs in an alternate smgr. So we need to find some bits somewhere else, and the fork number field is the obvious candidate. Since, per this discussion, the smgr IDs need not really be OIDs at all, we just need a few bits for them. regards, tom lane
On 2019-02-28 13:16:02 -0500, Tom Lane wrote: > Shawn Debnath <sdn@amazon.com> writes: > > On Thu, Feb 28, 2019 at 10:35:50AM -0500, Robert Haas wrote: > >> Also, I don't see why we'd need a fake pg_database row in the first > >> place. IIUC, the OID counter wraps around to FirstNormalObjectId, so > >> nobody should have a database with an OID less than that value. > > > We have scripts under catalog directory that can check to ensure OIDs > > aren't re-used accidentally. However, we still have to define an entry > > in a catalog somewhere and I was proposing creating a new one, > > pg_storage_managers?, to track these entries. > > That would fail to capture the property that the smgr OIDs mustn't > conflict with database OIDs, so the whole thing still seems like an > ugly kluge from here. It's definitely a kludge, but it doesn't seem that bad for now. Delaying a nice answer till we have a more efficient bufmgr representation doesn't seem crazy to me. I don't think there's a real conflict risk given that we don't allow for duplicated oids across catalogs at bootstrap time, and this is definitely a bootstrap time issue. > > Another thought: my colleague Anton Shyrabokau suggested potentially > > re-using forknumber to differentiate smgrs. We are using 32 bits to > > map 5 entries today. > > Yeah, that seems like it might be a workable answer. Yea, if we just split that into two 16bit entries, there'd not be much lost. Some mild mild performance regression due to more granular memory->register reads/writes, but I can't even remotely see that matter. > Since, per this discussion, the smgr IDs need not really be OIDs at > all, we just need a few bits for them. Personally I find having them as oids more elegant than the distaste from misusing the database oid for a wihle, but I think it's fair to disagree here. Greetings, Andres Freund
On Fri, Mar 1, 2019 at 7:24 AM Andres Freund <andres@anarazel.de> wrote: > On 2019-02-28 13:16:02 -0500, Tom Lane wrote: > > Shawn Debnath <sdn@amazon.com> writes: > > > Another thought: my colleague Anton Shyrabokau suggested potentially > > > re-using forknumber to differentiate smgrs. We are using 32 bits to > > > map 5 entries today. > > > > Yeah, that seems like it might be a workable answer. > > Yea, if we just split that into two 16bit entries, there'd not be much > lost. Some mild mild performance regression due to more granular > memory->register reads/writes, but I can't even remotely see that > matter. Ok, that's a interesting way to include it in BufferTag without making it wider. But then how about the SMGR interface? I think that value would need to be added to the smgropen() interface, and all existing callers would pass in (say) SGMR_RELATION (meaning "use md.c"), and new code would use SMGR_UNDO, SMGR_SLRU etc. That seems OK to me. Ancient POSTGRES had an extra argument like that and would say eg smgropen(rd->rd_rel->relsmgr, rd), but in this new idea I think it'd always be a constant or a value from a BufferTag, and the BufferTag would have been set with a constant, since the code reading these buffers would always be code that knows which it wants. We'd be using this new argument not as a modifier to control storage location as they did back then, but rather as a whole new namespace for RelFileNode values, that also happens to be stored differently; that might be a hint that it could also go into RelFileNode (but I'm not suggesting that). > > Since, per this discussion, the smgr IDs need not really be OIDs at > > all, we just need a few bits for them. > > Personally I find having them as oids more elegant than the distaste > from misusing the database oid for a wihle, but I think it's fair to > disagree here. It sounds like your buffer mapping redesign would completely change the economics and we could reconsider much of this later without too much drama; that was one of the things that made me feel that the magic database OID approach was acceptable at least in the short term. -- Thomas Munro https://enterprisedb.com
On Fri, Mar 1, 2019 at 4:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > On Thu, Feb 28, 2019 at 7:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Thomas Munro <thomas.munro@gmail.com> writes: > >>> Our current thinking is that smgropen() should know how to map a small > >>> number of special database OIDs to different smgr implementations > > >> Hmm. Maybe mapping based on tablespaces would be a better idea? > > > In the undo log proposal (about which more soon) we are using > > tablespaces for their real purpose, so we need that OID. If you SET > > undo_tablespaces = foo then future undo data created by your session > > will be written there, which might be useful for putting that IO on > > different storage. > > Meh. That's a point, but it doesn't exactly seem like a killer argument. > Just in the abstract, it seems much more likely to me that people would > want per-database special rels than per-tablespace special rels. And > I think your notion of a GUC that can control this is probably pie in > the sky anyway: if we can't afford to look into the catalogs to resolve > names at this code level, how are we going to handle a GUC? I have this working like so: * undo logs have a small amount of meta-data in shared memory, stored in a file at checkpoint time, with all changes WAL logged, visible to users in pg_stat_undo_logs view * one of the properties of an undo log is its tablespace (the point here being that it's not in a catalog) * you don't need access to any catalogs to find the backing files for a RelFileNode (the path via tablespace symlinks is derivable from spcNode) * therefore you can find your way from an UndoLogRecPtr in (say) a zheap page to the relevant blocks on disk without any catalog access; this should work even in the apparently (but not actually) circular case of a pg_tablespace catalog that is stored in zheap (not something we can do right now, but hypothetically speaking), and has undo data that is stored in some non-default tablespace that must be consulted while scanning the catalog (not that I'm suggesting that would necessarily be a good idea to suppose catalogs in non-default tablespaces; I'm just addressing your theoretical point) * the GUC is used to resolve tablespace names to OIDs only by sessions that are writing, when selecting (or creating) an undo log to attach to and begin writing into; those sessions have no trouble reading the catalog to do so without problematic circularities, as above Seems to work; the main complications so far were coming up with reasonable behaviour and interlocking when you drop tablespaces that contain undo logs (short version: if they're not needed for snapshots or rollback, they are dropped, wasting the rest of their undo address space; otherwise they prevents the tablespace from being dropped with a clear message to that effect). It doesn't make any sense to put things like clog or any other SLRU in a non-default tablespace though. It's perfectly OK if not all smgr implementations know how to deal with tablespaces, and the SLRU support should just not support that. > The real reason I'm concerned about this, though, is that for either > a database or a tablespace, you can *not* get away with having a magic > OID just hanging in space with no actual catalog row matching it. > If nothing else, you need an entry there to prevent someone from > reusing the OID for another purpose. And a pg_database row that > doesn't correspond to a real database is going to break all kinds of > code, starting with pg_upgrade and the autovacuum launcher. Special > rows in pg_tablespace are much less likely to cause issues, because > of the precedent of pg_global and pg_default. GetNewObjectId() never returns values < FirstNormalObjectId. I don't think it's impossible for someone to want to put SMGRs in a catalog of some kind some day. Even though the ones for clog, undo etc would still probably need special hard-coded treatment as discussed, I suppose it's remotely possible that someone might some day figure out a useful way to allow extensions that provide different block storage (nvram? zfs zvols? encryption? (see Haribabu's reply)) but I don't have any specific ideas about that or feel inclined to design something for unknown future use. -- Thomas Munro https://enterprisedb.com
On Fri, Mar 01, 2019 at 10:33:06AM +1300, Thomas Munro wrote: > It doesn't make any sense to put things like clog or any other SLRU in > a non-default tablespace though. It's perfectly OK if not all smgr > implementations know how to deal with tablespaces, and the SLRU > support should just not support that. If the generic storage manager, or whatever we end up calling it, ends up being generic enough - its possible that tablespace value would have to be respected. -- Shawn Debnath Amazon Web Services (AWS)
On 2019-03-01 09:48:33 +1300, Thomas Munro wrote: > On Fri, Mar 1, 2019 at 7:24 AM Andres Freund <andres@anarazel.de> wrote: > > On 2019-02-28 13:16:02 -0500, Tom Lane wrote: > > > Shawn Debnath <sdn@amazon.com> writes: > > > > Another thought: my colleague Anton Shyrabokau suggested potentially > > > > re-using forknumber to differentiate smgrs. We are using 32 bits to > > > > map 5 entries today. > > > > > > Yeah, that seems like it might be a workable answer. > > > > Yea, if we just split that into two 16bit entries, there'd not be much > > lost. Some mild mild performance regression due to more granular > > memory->register reads/writes, but I can't even remotely see that > > matter. > > Ok, that's a interesting way to include it in BufferTag without making > it wider. But then how about the SMGR interface? I think that value > would need to be added to the smgropen() interface, and all existing > callers would pass in (say) SGMR_RELATION (meaning "use md.c"), and > new code would use SMGR_UNDO, SMGR_SLRU etc. That seems OK to me. Right, seems like we should do that independent of whether we end up reusing the dboid or not. > > > Since, per this discussion, the smgr IDs need not really be OIDs at > > > all, we just need a few bits for them. > > > > Personally I find having them as oids more elegant than the distaste > > from misusing the database oid for a wihle, but I think it's fair to > > disagree here. > > It sounds like your buffer mapping redesign would completely change > the economics and we could reconsider much of this later without too > much drama; that was one of the things that made me feel that the > magic database OID approach was acceptable at least in the short term. Right. FWIW, I think while distasteful, I could see us actually using oids, just ones that are small enough to fit into 16bit... Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > FWIW, I think while distasteful, I could see us actually using oids, > just ones that are small enough to fit into 16bit... If we suppose that all smgrs must be built-in, that's not even much of a restriction... regards, tom lane
On Thu, Feb 28, 2019 at 02:08:49PM -0800, Andres Freund wrote: > On 2019-03-01 09:48:33 +1300, Thomas Munro wrote: > > On Fri, Mar 1, 2019 at 7:24 AM Andres Freund <andres@anarazel.de> wrote: > > > On 2019-02-28 13:16:02 -0500, Tom Lane wrote: > > > > Shawn Debnath <sdn@amazon.com> writes: > > > > > Another thought: my colleague Anton Shyrabokau suggested potentially > > > > > re-using forknumber to differentiate smgrs. We are using 32 bits to > > > > > map 5 entries today. > > > > > > > > Yeah, that seems like it might be a workable answer. > > > > > > Yea, if we just split that into two 16bit entries, there'd not be much > > > lost. Some mild mild performance regression due to more granular > > > memory->register reads/writes, but I can't even remotely see that > > > matter. > > > > Ok, that's a interesting way to include it in BufferTag without making > > it wider. But then how about the SMGR interface? I think that value > > would need to be added to the smgropen() interface, and all existing > > callers would pass in (say) SGMR_RELATION (meaning "use md.c"), and > > new code would use SMGR_UNDO, SMGR_SLRU etc. That seems OK to me. > > Right, seems like we should do that independent of whether we end up > reusing the dboid or not. Food for thought: if we are going to muck with the smgr APIs, would it make sense to move away from SMgrRelation to something a bit more generic like, say, SMgrHandle to better organize the internal contents of this structure? Internally, we could move things into an union and based on type of handle: relation, undo, slru/generic, translate the contents correctly. As you can guess, SMgrRelationData today is very specific to relations and holding md specific information whose memory would be better served re-used for the other storage managers. Thoughts? -- Shawn Debnath Amazon Web Services (AWS)
On Fri, Mar 1, 2019 at 10:41 AM Shawn Debnath <sdn@amazon.com> wrote: > On Fri, Mar 01, 2019 at 10:33:06AM +1300, Thomas Munro wrote: > > It doesn't make any sense to put things like clog or any other SLRU in > > a non-default tablespace though. It's perfectly OK if not all smgr > > implementations know how to deal with tablespaces, and the SLRU > > support should just not support that. > > If the generic storage manager, or whatever we end up calling it, ends > up being generic enough - its possible that tablespace value would have > to be respected. Right, you and I have discussed this a bit off-list, but for the benefit of others, I think what you're getting at with "generic storage manager" here is something like this: on the one hand, our proposed revival of SMGR as a configuration point is about is supporting alternative file layouts for bufmgr data, but at the same time there is some background noise about direct IO, block encryption, ... and who knows what alternative block storage someone might come up with ... at the block level. So although it sounds a bit contradictory to be saying "let's make all these different SMGRs!" at the same time as saying "but we'll eventually need a single generic SMGR that is smart enough to be parameterised for all of these layouts!", I see why you say it. In fact, the prime motivation for putting SLRUs into shared buffers is to get better buffering, because (anecdotally) slru.c's mini-buffer scheme performs abysmally without the benefit of an OS page cache. If we add optional direct IO support (something I really want), we need it to apply to SLRUs, undo and relations, ideally without duplicating code, so we'd probably want to chop things up differently. At some point I think we'll need to separate the questions "how to map blocks to filenames and offsets" and "how to actually perform IO". I think the first question would be controlled by the SMGR IDs as discussed, but the second question probably needs to be controlled by GUCs that control all IO, and/or special per relation settings (supposing you can encrypt just one table, as a random example I know nothing about); but that seems way out of scope for the present projects. IMHO the best path from here is to leave md.c totally untouched for now as the SMGR for plain old relations, while we work on getting these new kinds of bufmgr data into the tree as a first step, and a later hypothetical direct IO or whatever project can pay for the refactoring to separate IO from layout. -- Thomas Munro https://enterprisedb.com
On Fri, Mar 1, 2019 at 11:31 AM Shawn Debnath <sdn@amazon.com> wrote: > On Thu, Feb 28, 2019 at 02:08:49PM -0800, Andres Freund wrote: > > On 2019-03-01 09:48:33 +1300, Thomas Munro wrote: > > > On Fri, Mar 1, 2019 at 7:24 AM Andres Freund <andres@anarazel.de> wrote: > > > > On 2019-02-28 13:16:02 -0500, Tom Lane wrote: > > > > > Shawn Debnath <sdn@amazon.com> writes: > > > > > > Another thought: my colleague Anton Shyrabokau suggested potentially > > > > > > re-using forknumber to differentiate smgrs. We are using 32 bits to > > > > > > map 5 entries today. > > > > > > > > > > Yeah, that seems like it might be a workable answer. > > > > > > > > Yea, if we just split that into two 16bit entries, there'd not be much > > > > lost. Some mild mild performance regression due to more granular > > > > memory->register reads/writes, but I can't even remotely see that > > > > matter. > > > > > > Ok, that's a interesting way to include it in BufferTag without making > > > it wider. But then how about the SMGR interface? I think that value > > > would need to be added to the smgropen() interface, and all existing > > > callers would pass in (say) SGMR_RELATION (meaning "use md.c"), and > > > new code would use SMGR_UNDO, SMGR_SLRU etc. That seems OK to me. > > > > Right, seems like we should do that independent of whether we end up > > reusing the dboid or not. > > Food for thought: if we are going to muck with the smgr APIs, would it > make sense to move away from SMgrRelation to something a bit more > generic like, say, SMgrHandle to better organize the internal contents > of this structure? Internally, we could move things into an union and > based on type of handle: relation, undo, slru/generic, translate the > contents correctly. As you can guess, SMgrRelationData today is very > specific to relations and holding md specific information whose memory > would be better served re-used for the other storage managers. > > Thoughts? Right, it does contain some md-specific stuff without even apologising. Also smgropen() was rendered non-virtual at some point (I mean that implementations don't even get a chance to initialise anything, which works out only because md-specific code has leaked into smgr.c). In one of my undo patches (which I'll post an updated version of on the appropriate thread soon) I added a void * called private_data so that undo_file.c can keep track of some stuff, but yeah I agree that more tidying up could be done. -- Thomas Munro https://enterprisedb.com
On Fri, Mar 01, 2019 at 11:38:49AM +1300, Thomas Munro wrote: > > Food for thought: if we are going to muck with the smgr APIs, would it > > make sense to move away from SMgrRelation to something a bit more > > generic like, say, SMgrHandle to better organize the internal contents > > of this structure? Internally, we could move things into an union and > > based on type of handle: relation, undo, slru/generic, translate the > > contents correctly. As you can guess, SMgrRelationData today is very > > specific to relations and holding md specific information whose memory > > would be better served re-used for the other storage managers. > > > > Thoughts? > > Right, it does contain some md-specific stuff without even > apologising. Also smgropen() was rendered non-virtual at some point > (I mean that implementations don't even get a chance to initialise > anything, which works out only because md-specific code has leaked > into smgr.c). In one of my undo patches (which I'll post an updated > version of on the appropriate thread soon) I added a void * called > private_data so that undo_file.c can keep track of some stuff, but > yeah I agree that more tidying up could be done. I can send out a patch for this (on a separate thread!) to unblock us both. Unless you are closer to completion on this. I prefer the union approach to make it more readable. I was considering re-factoring the structure to SMgrHandle and having the relation specific structure retain SMgrRelationData. For undo we could have SMgrUndoData, and similarly for SLRU (I will come up with a better name than generic). Then have these be in the union instead of the individual members of the struct. -- Shawn Debnath Amazon Web Services (AWS)
On 01.03.2019 1:32, Thomas Munro wrote: > On Fri, Mar 1, 2019 at 10:41 AM Shawn Debnath <sdn@amazon.com> wrote: >> On Fri, Mar 01, 2019 at 10:33:06AM +1300, Thomas Munro wrote: >>> It doesn't make any sense to put things like clog or any other SLRU in >>> a non-default tablespace though. It's perfectly OK if not all smgr >>> implementations know how to deal with tablespaces, and the SLRU >>> support should just not support that. >> If the generic storage manager, or whatever we end up calling it, ends >> up being generic enough - its possible that tablespace value would have >> to be respected. > Right, you and I have discussed this a bit off-list, but for the > benefit of others, I think what you're getting at with "generic > storage manager" here is something like this: on the one hand, our > proposed revival of SMGR as a configuration point is about is > supporting alternative file layouts for bufmgr data, but at the same > time there is some background noise about direct IO, block encryption, > ... and who knows what alternative block storage someone might come up > with ... at the block level. So although it sounds a bit > contradictory to be saying "let's make all these different SMGRs!" at > the same time as saying "but we'll eventually need a single generic > SMGR that is smart enough to be parameterised for all of these > layouts!", I see why you say it. In fact, the prime motivation for > putting SLRUs into shared buffers is to get better buffering, because > (anecdotally) slru.c's mini-buffer scheme performs abysmally without > the benefit of an OS page cache. If we add optional direct IO support > (something I really want), we need it to apply to SLRUs, undo and > relations, ideally without duplicating code, so we'd probably want to > chop things up differently. At some point I think we'll need to > separate the questions "how to map blocks to filenames and offsets" > and "how to actually perform IO". I think the first question would be > controlled by the SMGR IDs as discussed, but the second question > probably needs to be controlled by GUCs that control all IO, and/or > special per relation settings (supposing you can encrypt just one > table, as a random example I know nothing about); but that seems way > out of scope for the present projects. IMHO the best path from here > is to leave md.c totally untouched for now as the SMGR for plain old > relations, while we work on getting these new kinds of bufmgr data > into the tree as a first step, and a later hypothetical direct IO or > whatever project can pay for the refactoring to separate IO from > layout. > I completely agree with this statement: At some point I think we'll need to separate the questions "how to map blocks to filenames and offsets" and "how to actuallyperform IO". There are two subsystems developed in PgPro which are integrated in Postgres at file IO level: CFS (compressed file system) and SnapFS (fast database snapshots). First one provides page level encryption and compression, second - mechanism for fast restoring of database state. Both are implemented by patching fd.c. My first idea was to implement them as alternative storage devices (alternative to md.c). But it will require duplication of all segment mapping logic from md.c + file descriptors cache from fd.c. It will be nice if it is possible to redefine raw file operations (FileWrite, FileRead,...) without affecting segment mapping logic. One more thing... From my point of view one of the drawbacks of Postgres is that it requires underlaying file system and is not able to work with raw partitions. It seems to me that bypassing fle system layer can significantly improve performance and give more possibilities for IO performance tuning. Certainly it will require a log of changes in Postgres storage layer so this is not what I suggest to implement or even discuss right now. But it may be useful to keep it in mind in discussions concerning "generic storage manager". -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Fri, Mar 1, 2019 at 9:11 PM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote: > One more thing... From my point of view one of the drawbacks of Postgres > is that it requires underlaying file system and is not able to work with > raw partitions. > It seems to me that bypassing fle system layer can significantly improve > performance and give more possibilities for IO performance tuning. > Certainly it will require a log of changes in Postgres storage layer so > this is not what I suggest to implement or even discuss right now. > But it may be useful to keep it in mind in discussions concerning > "generic storage manager". Hmm. Speculation-around-the-water-cooler-mode: I think the arguments for using raw partitions are approximately the same as the arguments for using a big data file that holds many relations. The three I can think of are (1) the space is entirely preallocated, which *might* have performance and safety advantages, (2) if you encrypt it, no one can see the structure (database OIDs, relation OIDs and sizes) and (3) it might allow pages to be moved from one relation to another without copying or interleaved in interesting ways (think SQL Server parallel btree creation that "stitches together" btrees produced by parallel workers, or Oracle "clustered" tables where the pages of two tables are physically interleaved in an order that works nicely when you join those two tables, or perhaps schemes for moving data between relations/partitions quickly). On the other hand, to make that work you have to own the problem of space allocation/management that we currently leave to the authors of XFS et al, and those guys have worked on that for *years and years* and they work really well. If you made all that work for big preallocated data files, then sure, you could also make it work for raw partitions, but I'm not sure how much performance advantage there is for that final step. I suspect that a major reason for Oracle to support raw block devices many years ago was because back then there was no other way to escape from the page cache. Direct IO hasn't always been available or portable, and hasn't always worked well. That said, it does seem plausible that we could do the separation of (1) block -> pathname/offset mappings and (2) actual IO operations in a way that you could potentially write your own pseudo-filesystem that stores a whole PostgreSQL cluster inside big data files or raw partitions. Luckily we don't need to tackle such mountainous terrain to avoid the page cache, today. -- Thomas Munro https://enterprisedb.com
On Thu, Feb 28, 2019 at 7:02 PM Thomas Munro <thomas.munro@gmail.com> wrote: > The type smgr has only one value 'magnetic disk'. ~15 years ago it > also had a value 'main memory', and in Berkeley POSTGRES 4.2 there was > a third value 'sony jukebox'. Back then, all tables had an associated > block storage manager, and it was recorded as an attribute relsmgr of > pg_class (or pg_relation as it was known further back). This was the > type of that attribute, removed by Bruce in 3fa2bb31 (1997). > > Nothing seems to break if you remove it (except for some tests using > it in an incidental way). See attached. Pushed. Thanks all for the interesting discussion. I'm trying out Anton Shyrabokau's suggestion of stealing bits from the fork number. More on that soon. -- Thomas Munro https://enterprisedb.com