Thread: OK, OK, Hiroshi's right: use a seperately-generated filename

OK, OK, Hiroshi's right: use a seperately-generated filename

From
Tom Lane
Date:
After further thought I think there's a lot of merit in Hiroshi's
opinion that physical file names should not be tied to relation OID.
If we use a separately generated value for the file name, we can
solve a lot of problems pretty nicely by means of "table versioning".

For example: VACUUM can't compact indexes at the moment, and what it
does do (scan the index and delete unused entries) is really slow.
The right thing to do is for it to generate an all-new index file,
but how do we do that without creating a risk of leaving the index
corrupted if we crash partway through?  The answer is to build the
new index in a new physical file.  But how do we install the new
file as the real index atomically, when it might span multiple
segments?  If the physical file name is decoupled from the relation's
name *and* OID then there is no problem: the atomic event that makes
the new file(s) the real table contents is the commit of the new
pg_class row with the new value for the physical filename.

Aside from possible improvements in VACUUM, this would let us do a
robust implementation of CLUSTER, and we could do the "really change
the table" variant of ALTER TABLE DROP COLUMN the same way if anyone
wants to do it.

The only cost is that we need an additional column in pg_class to
hold the physical file name.  That's not so bad, especially when
you remember that we'd surely need to add something to pg_class for
tablespace support anyway.

If we bite that bullet, then we could also do something to satisfy
Bruce about having legible file names ;-).  The column in pg_class
could perfectly well be a string, not a pure number, and that means
that we can throw in the relname (truncated to fit of course).  So
the thing would act a lot like the original-relname-plus-OID variant
that's been discussed so far.  (Original relname because ALTER TABLE
RENAME would *not* change the physical file name.  But we could
think about a form of VACUUM that creates a whole new table by
versioning, and that would presumably bring the physical name back
in sync with the logical relname.)

Here is a sketch of a concrete proposal.  I see no need to have
separate pg_class columns for tablespace and physical relname;
instead, I suggest there be a column of type NAME that is the
file pathname (relative to the database directory).  Further,
instead of the existing convention of appending .N to the base
file name to make extension segment names, I propose that we
always have a segment number in the physical file name, and that
the pg_class entry be required to contain a "%d" somewhere that
indicates where.  The actual filename is manufactured bysprintf(tempbuf, value_from_pg_class_column, segment_number);

As an example, the arrangement I was suggesting earlier today
about segments in different subdirectories of a tablespace
could be implemented by assigning physical filenames like
tablespace/%d/12345_relname

where the 12345 is a value generated separately from the table's OID.
(We would still use the OID counter to produce these numbers, and
in fact there's no reason not to use the table's OID as the initial
unique ID for the physical filename.  The point is just that the
physical filename doesn't have to remain forever equal to the
relation's OID.)

If we use type NAME for this string then the tablespace part of the path
would have to be kept to no more than ~ 15 characters, but that seems
workable enough.  (Anybody who really didn't like that could recompile
with larger NAMEDATALEN.  Doesn't seem worth inventing a separate type.)

As Hiroshi pointed out, one of the best aspects of this approach
is that the physical table layout policy doesn't have to be hard-wired
into low-level file access routines.  The low-level routines don't
need to know much of anything about the format of the pathname,
they just stuff in the right segment number and use the name.  The
layout policy need only be known to one single routine that generates
the strings that go into pg_class.  So it'd be really easy to change.

One thing we'd have to work out is that the critical system tables
(eg, pg_class itself, as well as its indexes) would have to have
predictable physical names.  Otherwise there's no way for a new
backend to bootstrap itself up ... it can't very well read pg_class
to find out where pg_class is.  A brute-force solution is to forbid
reversioning of the critical tables, but I suspect we can find a
less restrictive answer.

This seems like it'd satisfy all the concerns that have been raised.
Comments?
        regards, tom lane


Re: OK, OK, Hiroshi's right: use a seperately-generated filename

From
Chris Bitmead
Date:
Tom Lane wrote: So
> the thing would act a lot like the original-relname-plus-OID variant
> that's been discussed so far.  (Original relname because ALTER TABLE
> RENAME would *not* change the physical file name.  But we could
> think about a form of VACUUM that creates a whole new table by
> versioning, and that would presumably bring the physical name back
> in sync with the logical relname.)

At least on UNIX, couldn't you use a hard-link and change the name in
pg_class immediately? Let the brain-dead operating systems use the
vacuum method.


Re: OK, OK, Hiroshi's right: use a seperately-generated filename

From
Tom Lane
Date:
Chris Bitmead <chris@bitmead.com> writes:
> At least on UNIX, couldn't you use a hard-link and change the name in
> pg_class immediately? Let the brain-dead operating systems use the
> vacuum method.

Hmm ... maybe, but it doesn't seem worth the portability headache to
me.  We do have an NT port that we don't want to break, and I don't
think RENAME TABLE is worth the trouble of testing/supporting two
implementations.

Even on Unix, aren't there filesystems that don't do hard links?
Not that I'd recommend running Postgres on such a volume, but...
        regards, tom lane


RE: OK, OK, Hiroshi's right: use a seperately-generated filename

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: pgsql-hackers-owner@hub.org [mailto:pgsql-hackers-owner@hub.org]On
> Behalf Of Tom Lane
> 
> After further thought I think there's a lot of merit in Hiroshi's
> opinion that physical file names should not be tied to relation OID.
> If we use a separately generated value for the file name, we can
> solve a lot of problems pretty nicely by means of "table versioning".
> 
> For example: VACUUM can't compact indexes at the moment, and what it
> does do (scan the index and delete unused entries) is really slow.
> The right thing to do is for it to generate an all-new index file,
> but how do we do that without creating a risk of leaving the index
> corrupted if we crash partway through?  The answer is to build the
> new index in a new physical file.  But how do we install the new
> file as the real index atomically, when it might span multiple
> segments?  If the physical file name is decoupled from the relation's
> name *and* OID then there is no problem: the atomic event that makes
> the new file(s) the real table contents is the commit of the new
> pg_class row with the new value for the physical filename.
> 
> Aside from possible improvements in VACUUM, this would let us do a
> robust implementation of CLUSTER, and we could do the "really change
> the table" variant of ALTER TABLE DROP COLUMN the same way if anyone
> wants to do it.
>

Yes,I've wondered how do we implement column_is_really_dropped 
ALTER TABLE DROP COLUMN feature without this kind of mechanism.

> The only cost is that we need an additional column in pg_class to
> hold the physical file name.  That's not so bad, especially when
> you remember that we'd surely need to add something to pg_class for
> tablespace support anyway.
> 
> If we bite that bullet, then we could also do something to satisfy
> Bruce about having legible file names ;-).  The column in pg_class
> could perfectly well be a string, not a pure number, and that means
> that we can throw in the relname (truncated to fit of course).  So
> the thing would act a lot like the original-relname-plus-OID variant
> that's been discussed so far.  (Original relname because ALTER TABLE
> RENAME would *not* change the physical file name.  But we could
> think about a form of VACUUM that creates a whole new table by
> versioning, and that would presumably bring the physical name back
> in sync with the logical relname.)
> 
> As Hiroshi pointed out, one of the best aspects of this approach
> is that the physical table layout policy doesn't have to be hard-wired
> into low-level file access routines.  The low-level routines don't
> need to know much of anything about the format of the pathname,
> they just stuff in the right segment number and use the name.  The
> layout policy need only be known to one single routine that generates
> the strings that go into pg_class.  So it'd be really easy to change.
>

Ross's approach is fundamentally same though he is using relname+OID
naming rule.  I've said his trial is most practical one.
> One thing we'd have to work out is that the critical system tables
> (eg, pg_class itself, as well as its indexes) would have to have
> predictable physical names.

The only limitation of the relation filename is the uniqueness.
So it doesn't introduce any inconsistency that system tables
have fixed name.
As for system relations it wouldn't be so bad because CLUSTER/
ALTER TABLE DROP COLUMN ... would be unnecessary(maybe).
But as for system indexes,it is preferable that VACUUM/REINDEX
could rebuild them safely. System indexes never shrink currently.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp


Re: OK, OK, Hiroshi's right: use a seperately-generated filename

From
Peter Eisentraut
Date:
Tom Lane writes:

>     tablespace/%d/12345_relname

Throwing table spaces and relation names into one pot doesn't excite me
very much. For example, before long people will want to

* Query what tables are in what space (without using string operations)
Consider for example creating a new table and choosing where to put it.

* Rename table spaces

* Assign attributes of some sort to table spaces (permissions, etc.)

* Use table space names with more than 15 characters. :)

Somehow table spaces need to be catalogued. You could still make the
physical file name 'tablespaceoid/rest' without actually having to look up
anything, although that depends on your symlink idea which is still under
discussion.

Then, why are all nth segments of tables in one directory in that
proposal?

Also, you said before that an old relname (after rename) is worse than
none at all. I couldn't agree more.

Why not use OID.[SEGMENT.]VERSION for the physical relname (different
order possible)? That way you at least have some guaranteed correspondence
between files and tables. Version could probably be an INT2, so you save
some space.


-- 
Peter Eisentraut                  Sernanders väg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden



Re: OK, OK, Hiroshi's right: use a seperately-generated filename

From
Tom Lane
Date:
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
>> One thing we'd have to work out is that the critical system tables
>> (eg, pg_class itself, as well as its indexes) would have to have
>> predictable physical names.

> The only limitation of the relation filename is the uniqueness.
> So it doesn't introduce any inconsistency that system tables
> have fixed name.
> As for system relations it wouldn't be so bad because CLUSTER/
> ALTER TABLE DROP COLUMN ... would be unnecessary(maybe).
> But as for system indexes,it is preferable that VACUUM/REINDEX
> could rebuild them safely. System indexes never shrink currently.

Right, it's the index-shrinking business that has me worried.
Most of the other reasons for swapping in a new file don't apply
to system tables, but that one does.

One possibility is to say that system *tables* can't be reversioned
(at least not the critical ones) but system *indexes* can be.
Then we'd have to use your ignore-system-indexes stuff during backend
startup, until we'd found out where the indexes are.  Might be too big
a time penalty however... not sure.  Shared cache inval of a system
index could be a little tricky too; I don't think the catcache routines
are prepared to fall back to non-index scan are they?

On the whole it might be better to cheat by using a side data structure
like the pg_internal.init file, that a backend could consult to find out
where the indexes are now.
        regards, tom lane


Re: OK, OK, Hiroshi's right: use a seperately-generated filename

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> Somehow table spaces need to be catalogued.

Sure.  Undoubtedly there'll be a pg_tablespace table somewhere.  However,
I don't think it's a good idea to have to consult pg_tablespace to find
out where a table actually is --- I think the pathname (or smgr access
token as Ross would call it ;-)) ought to be determinable from just the
pg_class entry.

It would probably be best to expend an additional 4 bytes per pg_class
entry to record the OID of the table's tablespace, just so you could do
joins easily without having to do string matching (and assume an
uncomfortable amount about the format of the pathname).  Having the
pathname in the pg_class entry too represents some denormalization,
but I think it's the safest way.

> For example, before long people will want to
> * Query what tables are in what space (without using string operations)
> * Rename table spaces
> * Assign attributes of some sort to table spaces (permissions, etc.)
> * Use table space names with more than 15 characters. :)

Tablespaces can have logical names stored in pg_tablespace; they just
can't contribute more than a dozen or so characters to file pathnames
under the implementation I'm proposing.  That doesn't seem too
unreasonable; the pathname part can be some sort of abbreviated name.
The alternative is to enlarge smgr access tokens to something like 64
bytes.  I'd rather keep them as compact as we can, since we're going to
need to store them in places like the bufmgr's shared-buffer headers
(remember the blind write problem).

> Then, why are all nth segments of tables in one directory in that
> proposal?

It's better than *all* segments of tables in one directory, which is
what you get if the segment number is just a component of a flat file
name.  We have to have a better answer than that for people who need
to cope with tables bigger than a disk.  Perhaps someone can think of a
better answer than subdirectory-per-segment-number, but I think that
will work well enough; and it doesn't add any complexity for file
access.

> Also, you said before that an old relname (after rename) is worse than
> none at all. I couldn't agree more.

I'm not the one who wants relnames in the physical names ;-).  However,
this implementation mechanism will support either policy choice ---
original relname in the filename, or just a numeric ID for the filename
--- and that seems like a good sign to me.

> Why not use OID.[SEGMENT.]VERSION for the physical relname (different
> order possible)?

Doesn't give you a manageable way to split segments across different
disks.
        regards, tom lane


Re: OK, OK, Hiroshi's right: use a seperately-generated filename

From
Chris Bitmead
Date:
Tom Lane wrote:

> > Also, you said before that an old relname (after rename) is worse than
> > none at all. I couldn't agree more.
> 
> I'm not the one who wants relnames in the physical names ;-).  However,
> this implementation mechanism will support either policy choice ---
> original relname in the filename, or just a numeric ID for the filename
> --- and that seems like a good sign to me.
> 
> > Why not use OID.[SEGMENT.]VERSION for the physical relname (different
> > order possible)?

Unless VERSION is globally unique like an oid is, having RELNAME.VERSION
would be a problem if you created a table with the same name as a 
recently renamed table.


Re: OK, OK, Hiroshi's right: use a seperately-generated filename

From
Peter Eisentraut
Date:
Tom Lane writes:

> I don't think it's a good idea to have to consult pg_tablespace to find
> out where a table actually is --- I think the pathname (or smgr access
> token as Ross would call it ;-)) ought to be determinable from just the
> pg_class entry.

That's why I suggested the table space oid. That would be readily
available from pg_class.


> Tablespaces can have logical names stored in pg_tablespace; they just
> can't contribute more than a dozen or so characters to file pathnames
> under the implementation I'm proposing.  That doesn't seem too
> unreasonable; the pathname part can be some sort of abbreviated name.

Since the abbreviated name is really only used internally it might as well
be the oid. Otherwise you create a weird functional dependency like the
pg_shadow.usesysid field that's just an extra layer of maintenance.


> this implementation mechanism will support either policy choice ---
> original relname in the filename, or just a numeric ID for the
> filename

But when you look at a file name `12345_accounts_recei' you know neither

* whether the table name was really `accounts_recei' or whether the name
was truncated

* whether the table still has that name, whatever it was

* what table this is at all

So in the aggregate you really know less than nothing. :-)


> > Why not use OID.[SEGMENT.]VERSION for the physical relname (different
> > order possible)?
> 
> Doesn't give you a manageable way to split segments across different
> disks.

Okay, so maybe ${base}/TABLESPACEOID/SEGMENT/RELOID.VERSION.

This doesn't need any catalog lookup outside of pg_class, yet it's still
easy to resolve to human-readable names by simple admin tools (SELECT *
FROM pg_foo WHERE oid = xxx). VERSION would be unique within a conceptual
relation, so you could even see how many times the relation was altered in
major ways (kind of).


-- 
Peter Eisentraut                  Sernanders väg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden



Re: OK, OK, Hiroshi's right: use a seperately-generated filename

From
Bruce Momjian
Date:
> Tom Lane wrote:
>   So
> > the thing would act a lot like the original-relname-plus-OID variant
> > that's been discussed so far.  (Original relname because ALTER TABLE
> > RENAME would *not* change the physical file name.  But we could
> > think about a form of VACUUM that creates a whole new table by
> > versioning, and that would presumably bring the physical name back
> > in sync with the logical relname.)
> 
> At least on UNIX, couldn't you use a hard-link and change the name in
> pg_class immediately? Let the brain-dead operating systems use the
> vacuum method.

Yes, we can hard-link, and let vacuum remove the old link.

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


RE: OK, OK, Hiroshi's right: use a seperately-generated filename

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: pgsql-hackers-owner@hub.org [mailto:pgsql-hackers-owner@hub.org]On
> Behalf Of Chris Bitmead
>
> Tom Lane wrote:
>
> > > Also, you said before that an old relname (after rename) is worse than
> > > none at all. I couldn't agree more.
> >
> > I'm not the one who wants relnames in the physical names ;-).  However,
> > this implementation mechanism will support either policy choice ---
> > original relname in the filename, or just a numeric ID for the filename
> > --- and that seems like a good sign to me.
> >
> > > Why not use OID.[SEGMENT.]VERSION for the physical relname (different
> > > order possible)?
>
> Unless VERSION is globally unique like an oid is, having RELNAME.VERSION
> would be a problem if you created a table with the same name as a
> recently renamed table.
>

In my proposal(relname+unique-id),the unique-id is globally unique
and relname is only for dba's convenience. I've said many times that
we should be free from the rule of file naming as far as possible.
I myself don't mind the name of relation files except that they should
be globally unique. I had to propose my opinion for file naming
because people have been so enthusiastic about globally_not_unique
file naming.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp



Re: OK, OK, Hiroshi's right: use a seperately-generated filename

From
Chris Bitmead
Date:
Bruce Momjian wrote:
> 
> > Tom Lane wrote:
> >   So
> > > the thing would act a lot like the original-relname-plus-OID variant
> > > that's been discussed so far.  (Original relname because ALTER TABLE
> > > RENAME would *not* change the physical file name.  But we could
> > > think about a form of VACUUM that creates a whole new table by
> > > versioning, and that would presumably bring the physical name back
> > > in sync with the logical relname.)
> >
> > At least on UNIX, couldn't you use a hard-link and change the name in
> > pg_class immediately? Let the brain-dead operating systems use the
> > vacuum method.
> 
> Yes, we can hard-link, and let vacuum remove the old link.

BTW, how does vacuum know which files are obsolete. Does it just delete
files it doesn't know about?

What a good application for time travel!


Re: OK, OK, Hiroshi's right: use a seperately-generated filename

From
Bruce Momjian
Date:
> > Yes, we can hard-link, and let vacuum remove the old link.
> 
> BTW, how does vacuum know which files are obsolete. Does it just delete
> files it doesn't know about?
> 
> What a good application for time travel!
> 

I assume it removes files with oid's that match pg_class but who's file
names do not match.

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


RE: OK, OK, Hiroshi's right: use a seperately-generatedfilename

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: pgsql-hackers-owner@hub.org [mailto:pgsql-hackers-owner@hub.org]On
> Behalf Of Peter Eisentraut
> 
> Tom Lane writes:
> 
> > I don't think it's a good idea to have to consult pg_tablespace to find
> > out where a table actually is --- I think the pathname (or smgr access
> > token as Ross would call it ;-)) ought to be determinable from just the
> > pg_class entry.
> 
> That's why I suggested the table space oid. That would be readily
> available from pg_class.
>

It seems to me that the following 1)2) has always been mixed up.
IMHO,they should be distinguished clearly.

1) Where the table is stored   Currently PostgreSQL relies on relname -> filename mapping   rule to access *existent*
relationsand doesn't have this   information in its database. Our(Tom,Ross,me) proposal is to   keep the
information(token)in pg_class and provide a standard   transactional control mechanism for the change of table file
allocation.By doing it we would be able to be free from table   allocation(naming) rule.   Isn't it a kind of thing why
wehaven't had it from the first ?  
 
2) Where to store the table   Yes,TABLE(DATA)SPACE should encapsulate this concept.
I want the decision about 1) first. Ross has already tried it without
2).

Comments ?

As for 2) every one seems to have each opinion and the discussion
has always been divergent.   Please don't discard 1) together.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp 



Re: OK, OK, Hiroshi's right: use a seperately-generated filename

From
The Hermit Hacker
Date:
On Fri, 16 Jun 2000, Tom Lane wrote:

> Chris Bitmead <chris@bitmead.com> writes:
> > At least on UNIX, couldn't you use a hard-link and change the name in
> > pg_class immediately? Let the brain-dead operating systems use the
> > vacuum method.
> 
> Hmm ... maybe, but it doesn't seem worth the portability headache to
> me.  We do have an NT port that we don't want to break, and I don't
> think RENAME TABLE is worth the trouble of testing/supporting two
> implementations.
> 
> Even on Unix, aren't there filesystems that don't do hard links?
> Not that I'd recommend running Postgres on such a volume, but...

tTo the best of my knowledge, its only symlinks that aren't
(weren't?) universally supported ... somehow, I believe taht even extends
to NT ...