Thread: OK, OK, Hiroshi's right: use a seperately-generated filename
After further thought I think there's a lot of merit in Hiroshi's opinion that physical file names should not be tied to relation OID. If we use a separately generated value for the file name, we can solve a lot of problems pretty nicely by means of "table versioning". For example: VACUUM can't compact indexes at the moment, and what it does do (scan the index and delete unused entries) is really slow. The right thing to do is for it to generate an all-new index file, but how do we do that without creating a risk of leaving the index corrupted if we crash partway through? The answer is to build the new index in a new physical file. But how do we install the new file as the real index atomically, when it might span multiple segments? If the physical file name is decoupled from the relation's name *and* OID then there is no problem: the atomic event that makes the new file(s) the real table contents is the commit of the new pg_class row with the new value for the physical filename. Aside from possible improvements in VACUUM, this would let us do a robust implementation of CLUSTER, and we could do the "really change the table" variant of ALTER TABLE DROP COLUMN the same way if anyone wants to do it. The only cost is that we need an additional column in pg_class to hold the physical file name. That's not so bad, especially when you remember that we'd surely need to add something to pg_class for tablespace support anyway. If we bite that bullet, then we could also do something to satisfy Bruce about having legible file names ;-). The column in pg_class could perfectly well be a string, not a pure number, and that means that we can throw in the relname (truncated to fit of course). So the thing would act a lot like the original-relname-plus-OID variant that's been discussed so far. (Original relname because ALTER TABLE RENAME would *not* change the physical file name. But we could think about a form of VACUUM that creates a whole new table by versioning, and that would presumably bring the physical name back in sync with the logical relname.) Here is a sketch of a concrete proposal. I see no need to have separate pg_class columns for tablespace and physical relname; instead, I suggest there be a column of type NAME that is the file pathname (relative to the database directory). Further, instead of the existing convention of appending .N to the base file name to make extension segment names, I propose that we always have a segment number in the physical file name, and that the pg_class entry be required to contain a "%d" somewhere that indicates where. The actual filename is manufactured bysprintf(tempbuf, value_from_pg_class_column, segment_number); As an example, the arrangement I was suggesting earlier today about segments in different subdirectories of a tablespace could be implemented by assigning physical filenames like tablespace/%d/12345_relname where the 12345 is a value generated separately from the table's OID. (We would still use the OID counter to produce these numbers, and in fact there's no reason not to use the table's OID as the initial unique ID for the physical filename. The point is just that the physical filename doesn't have to remain forever equal to the relation's OID.) If we use type NAME for this string then the tablespace part of the path would have to be kept to no more than ~ 15 characters, but that seems workable enough. (Anybody who really didn't like that could recompile with larger NAMEDATALEN. Doesn't seem worth inventing a separate type.) As Hiroshi pointed out, one of the best aspects of this approach is that the physical table layout policy doesn't have to be hard-wired into low-level file access routines. The low-level routines don't need to know much of anything about the format of the pathname, they just stuff in the right segment number and use the name. The layout policy need only be known to one single routine that generates the strings that go into pg_class. So it'd be really easy to change. One thing we'd have to work out is that the critical system tables (eg, pg_class itself, as well as its indexes) would have to have predictable physical names. Otherwise there's no way for a new backend to bootstrap itself up ... it can't very well read pg_class to find out where pg_class is. A brute-force solution is to forbid reversioning of the critical tables, but I suspect we can find a less restrictive answer. This seems like it'd satisfy all the concerns that have been raised. Comments? regards, tom lane
Tom Lane wrote: So > the thing would act a lot like the original-relname-plus-OID variant > that's been discussed so far. (Original relname because ALTER TABLE > RENAME would *not* change the physical file name. But we could > think about a form of VACUUM that creates a whole new table by > versioning, and that would presumably bring the physical name back > in sync with the logical relname.) At least on UNIX, couldn't you use a hard-link and change the name in pg_class immediately? Let the brain-dead operating systems use the vacuum method.
Chris Bitmead <chris@bitmead.com> writes: > At least on UNIX, couldn't you use a hard-link and change the name in > pg_class immediately? Let the brain-dead operating systems use the > vacuum method. Hmm ... maybe, but it doesn't seem worth the portability headache to me. We do have an NT port that we don't want to break, and I don't think RENAME TABLE is worth the trouble of testing/supporting two implementations. Even on Unix, aren't there filesystems that don't do hard links? Not that I'd recommend running Postgres on such a volume, but... regards, tom lane
> -----Original Message----- > From: pgsql-hackers-owner@hub.org [mailto:pgsql-hackers-owner@hub.org]On > Behalf Of Tom Lane > > After further thought I think there's a lot of merit in Hiroshi's > opinion that physical file names should not be tied to relation OID. > If we use a separately generated value for the file name, we can > solve a lot of problems pretty nicely by means of "table versioning". > > For example: VACUUM can't compact indexes at the moment, and what it > does do (scan the index and delete unused entries) is really slow. > The right thing to do is for it to generate an all-new index file, > but how do we do that without creating a risk of leaving the index > corrupted if we crash partway through? The answer is to build the > new index in a new physical file. But how do we install the new > file as the real index atomically, when it might span multiple > segments? If the physical file name is decoupled from the relation's > name *and* OID then there is no problem: the atomic event that makes > the new file(s) the real table contents is the commit of the new > pg_class row with the new value for the physical filename. > > Aside from possible improvements in VACUUM, this would let us do a > robust implementation of CLUSTER, and we could do the "really change > the table" variant of ALTER TABLE DROP COLUMN the same way if anyone > wants to do it. > Yes,I've wondered how do we implement column_is_really_dropped ALTER TABLE DROP COLUMN feature without this kind of mechanism. > The only cost is that we need an additional column in pg_class to > hold the physical file name. That's not so bad, especially when > you remember that we'd surely need to add something to pg_class for > tablespace support anyway. > > If we bite that bullet, then we could also do something to satisfy > Bruce about having legible file names ;-). The column in pg_class > could perfectly well be a string, not a pure number, and that means > that we can throw in the relname (truncated to fit of course). So > the thing would act a lot like the original-relname-plus-OID variant > that's been discussed so far. (Original relname because ALTER TABLE > RENAME would *not* change the physical file name. But we could > think about a form of VACUUM that creates a whole new table by > versioning, and that would presumably bring the physical name back > in sync with the logical relname.) > > As Hiroshi pointed out, one of the best aspects of this approach > is that the physical table layout policy doesn't have to be hard-wired > into low-level file access routines. The low-level routines don't > need to know much of anything about the format of the pathname, > they just stuff in the right segment number and use the name. The > layout policy need only be known to one single routine that generates > the strings that go into pg_class. So it'd be really easy to change. > Ross's approach is fundamentally same though he is using relname+OID naming rule. I've said his trial is most practical one. > One thing we'd have to work out is that the critical system tables > (eg, pg_class itself, as well as its indexes) would have to have > predictable physical names. The only limitation of the relation filename is the uniqueness. So it doesn't introduce any inconsistency that system tables have fixed name. As for system relations it wouldn't be so bad because CLUSTER/ ALTER TABLE DROP COLUMN ... would be unnecessary(maybe). But as for system indexes,it is preferable that VACUUM/REINDEX could rebuild them safely. System indexes never shrink currently. Regards. Hiroshi Inoue Inoue@tpf.co.jp
Tom Lane writes: > tablespace/%d/12345_relname Throwing table spaces and relation names into one pot doesn't excite me very much. For example, before long people will want to * Query what tables are in what space (without using string operations) Consider for example creating a new table and choosing where to put it. * Rename table spaces * Assign attributes of some sort to table spaces (permissions, etc.) * Use table space names with more than 15 characters. :) Somehow table spaces need to be catalogued. You could still make the physical file name 'tablespaceoid/rest' without actually having to look up anything, although that depends on your symlink idea which is still under discussion. Then, why are all nth segments of tables in one directory in that proposal? Also, you said before that an old relname (after rename) is worse than none at all. I couldn't agree more. Why not use OID.[SEGMENT.]VERSION for the physical relname (different order possible)? That way you at least have some guaranteed correspondence between files and tables. Version could probably be an INT2, so you save some space. -- Peter Eisentraut Sernanders väg 10:115 peter_e@gmx.net 75262 Uppsala http://yi.org/peter-e/ Sweden
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes: >> One thing we'd have to work out is that the critical system tables >> (eg, pg_class itself, as well as its indexes) would have to have >> predictable physical names. > The only limitation of the relation filename is the uniqueness. > So it doesn't introduce any inconsistency that system tables > have fixed name. > As for system relations it wouldn't be so bad because CLUSTER/ > ALTER TABLE DROP COLUMN ... would be unnecessary(maybe). > But as for system indexes,it is preferable that VACUUM/REINDEX > could rebuild them safely. System indexes never shrink currently. Right, it's the index-shrinking business that has me worried. Most of the other reasons for swapping in a new file don't apply to system tables, but that one does. One possibility is to say that system *tables* can't be reversioned (at least not the critical ones) but system *indexes* can be. Then we'd have to use your ignore-system-indexes stuff during backend startup, until we'd found out where the indexes are. Might be too big a time penalty however... not sure. Shared cache inval of a system index could be a little tricky too; I don't think the catcache routines are prepared to fall back to non-index scan are they? On the whole it might be better to cheat by using a side data structure like the pg_internal.init file, that a backend could consult to find out where the indexes are now. regards, tom lane
Peter Eisentraut <peter_e@gmx.net> writes: > Somehow table spaces need to be catalogued. Sure. Undoubtedly there'll be a pg_tablespace table somewhere. However, I don't think it's a good idea to have to consult pg_tablespace to find out where a table actually is --- I think the pathname (or smgr access token as Ross would call it ;-)) ought to be determinable from just the pg_class entry. It would probably be best to expend an additional 4 bytes per pg_class entry to record the OID of the table's tablespace, just so you could do joins easily without having to do string matching (and assume an uncomfortable amount about the format of the pathname). Having the pathname in the pg_class entry too represents some denormalization, but I think it's the safest way. > For example, before long people will want to > * Query what tables are in what space (without using string operations) > * Rename table spaces > * Assign attributes of some sort to table spaces (permissions, etc.) > * Use table space names with more than 15 characters. :) Tablespaces can have logical names stored in pg_tablespace; they just can't contribute more than a dozen or so characters to file pathnames under the implementation I'm proposing. That doesn't seem too unreasonable; the pathname part can be some sort of abbreviated name. The alternative is to enlarge smgr access tokens to something like 64 bytes. I'd rather keep them as compact as we can, since we're going to need to store them in places like the bufmgr's shared-buffer headers (remember the blind write problem). > Then, why are all nth segments of tables in one directory in that > proposal? It's better than *all* segments of tables in one directory, which is what you get if the segment number is just a component of a flat file name. We have to have a better answer than that for people who need to cope with tables bigger than a disk. Perhaps someone can think of a better answer than subdirectory-per-segment-number, but I think that will work well enough; and it doesn't add any complexity for file access. > Also, you said before that an old relname (after rename) is worse than > none at all. I couldn't agree more. I'm not the one who wants relnames in the physical names ;-). However, this implementation mechanism will support either policy choice --- original relname in the filename, or just a numeric ID for the filename --- and that seems like a good sign to me. > Why not use OID.[SEGMENT.]VERSION for the physical relname (different > order possible)? Doesn't give you a manageable way to split segments across different disks. regards, tom lane
Tom Lane wrote: > > Also, you said before that an old relname (after rename) is worse than > > none at all. I couldn't agree more. > > I'm not the one who wants relnames in the physical names ;-). However, > this implementation mechanism will support either policy choice --- > original relname in the filename, or just a numeric ID for the filename > --- and that seems like a good sign to me. > > > Why not use OID.[SEGMENT.]VERSION for the physical relname (different > > order possible)? Unless VERSION is globally unique like an oid is, having RELNAME.VERSION would be a problem if you created a table with the same name as a recently renamed table.
Tom Lane writes: > I don't think it's a good idea to have to consult pg_tablespace to find > out where a table actually is --- I think the pathname (or smgr access > token as Ross would call it ;-)) ought to be determinable from just the > pg_class entry. That's why I suggested the table space oid. That would be readily available from pg_class. > Tablespaces can have logical names stored in pg_tablespace; they just > can't contribute more than a dozen or so characters to file pathnames > under the implementation I'm proposing. That doesn't seem too > unreasonable; the pathname part can be some sort of abbreviated name. Since the abbreviated name is really only used internally it might as well be the oid. Otherwise you create a weird functional dependency like the pg_shadow.usesysid field that's just an extra layer of maintenance. > this implementation mechanism will support either policy choice --- > original relname in the filename, or just a numeric ID for the > filename But when you look at a file name `12345_accounts_recei' you know neither * whether the table name was really `accounts_recei' or whether the name was truncated * whether the table still has that name, whatever it was * what table this is at all So in the aggregate you really know less than nothing. :-) > > Why not use OID.[SEGMENT.]VERSION for the physical relname (different > > order possible)? > > Doesn't give you a manageable way to split segments across different > disks. Okay, so maybe ${base}/TABLESPACEOID/SEGMENT/RELOID.VERSION. This doesn't need any catalog lookup outside of pg_class, yet it's still easy to resolve to human-readable names by simple admin tools (SELECT * FROM pg_foo WHERE oid = xxx). VERSION would be unique within a conceptual relation, so you could even see how many times the relation was altered in major ways (kind of). -- Peter Eisentraut Sernanders väg 10:115 peter_e@gmx.net 75262 Uppsala http://yi.org/peter-e/ Sweden
> Tom Lane wrote: > So > > the thing would act a lot like the original-relname-plus-OID variant > > that's been discussed so far. (Original relname because ALTER TABLE > > RENAME would *not* change the physical file name. But we could > > think about a form of VACUUM that creates a whole new table by > > versioning, and that would presumably bring the physical name back > > in sync with the logical relname.) > > At least on UNIX, couldn't you use a hard-link and change the name in > pg_class immediately? Let the brain-dead operating systems use the > vacuum method. Yes, we can hard-link, and let vacuum remove the old link. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> -----Original Message----- > From: pgsql-hackers-owner@hub.org [mailto:pgsql-hackers-owner@hub.org]On > Behalf Of Chris Bitmead > > Tom Lane wrote: > > > > Also, you said before that an old relname (after rename) is worse than > > > none at all. I couldn't agree more. > > > > I'm not the one who wants relnames in the physical names ;-). However, > > this implementation mechanism will support either policy choice --- > > original relname in the filename, or just a numeric ID for the filename > > --- and that seems like a good sign to me. > > > > > Why not use OID.[SEGMENT.]VERSION for the physical relname (different > > > order possible)? > > Unless VERSION is globally unique like an oid is, having RELNAME.VERSION > would be a problem if you created a table with the same name as a > recently renamed table. > In my proposal(relname+unique-id),the unique-id is globally unique and relname is only for dba's convenience. I've said many times that we should be free from the rule of file naming as far as possible. I myself don't mind the name of relation files except that they should be globally unique. I had to propose my opinion for file naming because people have been so enthusiastic about globally_not_unique file naming. Regards. Hiroshi Inoue Inoue@tpf.co.jp
Bruce Momjian wrote: > > > Tom Lane wrote: > > So > > > the thing would act a lot like the original-relname-plus-OID variant > > > that's been discussed so far. (Original relname because ALTER TABLE > > > RENAME would *not* change the physical file name. But we could > > > think about a form of VACUUM that creates a whole new table by > > > versioning, and that would presumably bring the physical name back > > > in sync with the logical relname.) > > > > At least on UNIX, couldn't you use a hard-link and change the name in > > pg_class immediately? Let the brain-dead operating systems use the > > vacuum method. > > Yes, we can hard-link, and let vacuum remove the old link. BTW, how does vacuum know which files are obsolete. Does it just delete files it doesn't know about? What a good application for time travel!
> > Yes, we can hard-link, and let vacuum remove the old link. > > BTW, how does vacuum know which files are obsolete. Does it just delete > files it doesn't know about? > > What a good application for time travel! > I assume it removes files with oid's that match pg_class but who's file names do not match. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> -----Original Message----- > From: pgsql-hackers-owner@hub.org [mailto:pgsql-hackers-owner@hub.org]On > Behalf Of Peter Eisentraut > > Tom Lane writes: > > > I don't think it's a good idea to have to consult pg_tablespace to find > > out where a table actually is --- I think the pathname (or smgr access > > token as Ross would call it ;-)) ought to be determinable from just the > > pg_class entry. > > That's why I suggested the table space oid. That would be readily > available from pg_class. > It seems to me that the following 1)2) has always been mixed up. IMHO,they should be distinguished clearly. 1) Where the table is stored Currently PostgreSQL relies on relname -> filename mapping rule to access *existent* relationsand doesn't have this information in its database. Our(Tom,Ross,me) proposal is to keep the information(token)in pg_class and provide a standard transactional control mechanism for the change of table file allocation.By doing it we would be able to be free from table allocation(naming) rule. Isn't it a kind of thing why wehaven't had it from the first ? 2) Where to store the table Yes,TABLE(DATA)SPACE should encapsulate this concept. I want the decision about 1) first. Ross has already tried it without 2). Comments ? As for 2) every one seems to have each opinion and the discussion has always been divergent. Please don't discard 1) together. Regards. Hiroshi Inoue Inoue@tpf.co.jp
On Fri, 16 Jun 2000, Tom Lane wrote: > Chris Bitmead <chris@bitmead.com> writes: > > At least on UNIX, couldn't you use a hard-link and change the name in > > pg_class immediately? Let the brain-dead operating systems use the > > vacuum method. > > Hmm ... maybe, but it doesn't seem worth the portability headache to > me. We do have an NT port that we don't want to break, and I don't > think RENAME TABLE is worth the trouble of testing/supporting two > implementations. > > Even on Unix, aren't there filesystems that don't do hard links? > Not that I'd recommend running Postgres on such a volume, but... tTo the best of my knowledge, its only symlinks that aren't (weren't?) universally supported ... somehow, I believe taht even extends to NT ...