OK, OK, Hiroshi's right: use a seperately-generated filename - Mailing list pgsql-hackers

From Tom Lane
Subject OK, OK, Hiroshi's right: use a seperately-generated filename
Date
Msg-id 7942.961177897@sss.pgh.pa.us
Whole thread Raw
Responses RE: OK, OK, Hiroshi's right: use a seperately-generated filename
Re: OK, OK, Hiroshi's right: use a seperately-generated filename
List pgsql-hackers
After further thought I think there's a lot of merit in Hiroshi's
opinion that physical file names should not be tied to relation OID.
If we use a separately generated value for the file name, we can
solve a lot of problems pretty nicely by means of "table versioning".

For example: VACUUM can't compact indexes at the moment, and what it
does do (scan the index and delete unused entries) is really slow.
The right thing to do is for it to generate an all-new index file,
but how do we do that without creating a risk of leaving the index
corrupted if we crash partway through?  The answer is to build the
new index in a new physical file.  But how do we install the new
file as the real index atomically, when it might span multiple
segments?  If the physical file name is decoupled from the relation's
name *and* OID then there is no problem: the atomic event that makes
the new file(s) the real table contents is the commit of the new
pg_class row with the new value for the physical filename.

Aside from possible improvements in VACUUM, this would let us do a
robust implementation of CLUSTER, and we could do the "really change
the table" variant of ALTER TABLE DROP COLUMN the same way if anyone
wants to do it.

The only cost is that we need an additional column in pg_class to
hold the physical file name.  That's not so bad, especially when
you remember that we'd surely need to add something to pg_class for
tablespace support anyway.

If we bite that bullet, then we could also do something to satisfy
Bruce about having legible file names ;-).  The column in pg_class
could perfectly well be a string, not a pure number, and that means
that we can throw in the relname (truncated to fit of course).  So
the thing would act a lot like the original-relname-plus-OID variant
that's been discussed so far.  (Original relname because ALTER TABLE
RENAME would *not* change the physical file name.  But we could
think about a form of VACUUM that creates a whole new table by
versioning, and that would presumably bring the physical name back
in sync with the logical relname.)

Here is a sketch of a concrete proposal.  I see no need to have
separate pg_class columns for tablespace and physical relname;
instead, I suggest there be a column of type NAME that is the
file pathname (relative to the database directory).  Further,
instead of the existing convention of appending .N to the base
file name to make extension segment names, I propose that we
always have a segment number in the physical file name, and that
the pg_class entry be required to contain a "%d" somewhere that
indicates where.  The actual filename is manufactured bysprintf(tempbuf, value_from_pg_class_column, segment_number);

As an example, the arrangement I was suggesting earlier today
about segments in different subdirectories of a tablespace
could be implemented by assigning physical filenames like
tablespace/%d/12345_relname

where the 12345 is a value generated separately from the table's OID.
(We would still use the OID counter to produce these numbers, and
in fact there's no reason not to use the table's OID as the initial
unique ID for the physical filename.  The point is just that the
physical filename doesn't have to remain forever equal to the
relation's OID.)

If we use type NAME for this string then the tablespace part of the path
would have to be kept to no more than ~ 15 characters, but that seems
workable enough.  (Anybody who really didn't like that could recompile
with larger NAMEDATALEN.  Doesn't seem worth inventing a separate type.)

As Hiroshi pointed out, one of the best aspects of this approach
is that the physical table layout policy doesn't have to be hard-wired
into low-level file access routines.  The low-level routines don't
need to know much of anything about the format of the pathname,
they just stuff in the right segment number and use the name.  The
layout policy need only be known to one single routine that generates
the strings that go into pg_class.  So it'd be really easy to change.

One thing we'd have to work out is that the critical system tables
(eg, pg_class itself, as well as its indexes) would have to have
predictable physical names.  Otherwise there's no way for a new
backend to bootstrap itself up ... it can't very well read pg_class
to find out where pg_class is.  A brute-force solution is to forbid
reversioning of the critical tables, but I suspect we can find a
less restrictive answer.

This seems like it'd satisfy all the concerns that have been raised.
Comments?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Ed Loehr
Date:
Subject: planner question re index vs seqscan
Next
From: "G. Anthony Reina"
Date:
Subject: Why does cluster need the indexname?