OK, OK, Hiroshi's right: use a seperately-generated filename - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | OK, OK, Hiroshi's right: use a seperately-generated filename |
Date | |
Msg-id | 7942.961177897@sss.pgh.pa.us Whole thread Raw |
Responses |
RE: OK, OK, Hiroshi's right: use a seperately-generated filename
Re: OK, OK, Hiroshi's right: use a seperately-generated filename |
List | pgsql-hackers |
After further thought I think there's a lot of merit in Hiroshi's opinion that physical file names should not be tied to relation OID. If we use a separately generated value for the file name, we can solve a lot of problems pretty nicely by means of "table versioning". For example: VACUUM can't compact indexes at the moment, and what it does do (scan the index and delete unused entries) is really slow. The right thing to do is for it to generate an all-new index file, but how do we do that without creating a risk of leaving the index corrupted if we crash partway through? The answer is to build the new index in a new physical file. But how do we install the new file as the real index atomically, when it might span multiple segments? If the physical file name is decoupled from the relation's name *and* OID then there is no problem: the atomic event that makes the new file(s) the real table contents is the commit of the new pg_class row with the new value for the physical filename. Aside from possible improvements in VACUUM, this would let us do a robust implementation of CLUSTER, and we could do the "really change the table" variant of ALTER TABLE DROP COLUMN the same way if anyone wants to do it. The only cost is that we need an additional column in pg_class to hold the physical file name. That's not so bad, especially when you remember that we'd surely need to add something to pg_class for tablespace support anyway. If we bite that bullet, then we could also do something to satisfy Bruce about having legible file names ;-). The column in pg_class could perfectly well be a string, not a pure number, and that means that we can throw in the relname (truncated to fit of course). So the thing would act a lot like the original-relname-plus-OID variant that's been discussed so far. (Original relname because ALTER TABLE RENAME would *not* change the physical file name. But we could think about a form of VACUUM that creates a whole new table by versioning, and that would presumably bring the physical name back in sync with the logical relname.) Here is a sketch of a concrete proposal. I see no need to have separate pg_class columns for tablespace and physical relname; instead, I suggest there be a column of type NAME that is the file pathname (relative to the database directory). Further, instead of the existing convention of appending .N to the base file name to make extension segment names, I propose that we always have a segment number in the physical file name, and that the pg_class entry be required to contain a "%d" somewhere that indicates where. The actual filename is manufactured bysprintf(tempbuf, value_from_pg_class_column, segment_number); As an example, the arrangement I was suggesting earlier today about segments in different subdirectories of a tablespace could be implemented by assigning physical filenames like tablespace/%d/12345_relname where the 12345 is a value generated separately from the table's OID. (We would still use the OID counter to produce these numbers, and in fact there's no reason not to use the table's OID as the initial unique ID for the physical filename. The point is just that the physical filename doesn't have to remain forever equal to the relation's OID.) If we use type NAME for this string then the tablespace part of the path would have to be kept to no more than ~ 15 characters, but that seems workable enough. (Anybody who really didn't like that could recompile with larger NAMEDATALEN. Doesn't seem worth inventing a separate type.) As Hiroshi pointed out, one of the best aspects of this approach is that the physical table layout policy doesn't have to be hard-wired into low-level file access routines. The low-level routines don't need to know much of anything about the format of the pathname, they just stuff in the right segment number and use the name. The layout policy need only be known to one single routine that generates the strings that go into pg_class. So it'd be really easy to change. One thing we'd have to work out is that the critical system tables (eg, pg_class itself, as well as its indexes) would have to have predictable physical names. Otherwise there's no way for a new backend to bootstrap itself up ... it can't very well read pg_class to find out where pg_class is. A brute-force solution is to forbid reversioning of the critical tables, but I suspect we can find a less restrictive answer. This seems like it'd satisfy all the concerns that have been raised. Comments? regards, tom lane
pgsql-hackers by date: