Re: Storage Location / Tablespaces (try 3) - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Storage Location / Tablespaces (try 3)
Date
Msg-id 15707.1015541196@sss.pgh.pa.us
Whole thread Raw
In response to Re: Storage Location / Tablespaces (try 3)  ("Jim Buttafuoco" <jim@buttafuoco.net>)
List pgsql-hackers
"Jim Buttafuoco" <jim@buttafuoco.net> writes:
> My first try passed the tablespace OID arround but someone pointed out the the
> WAL code doesn't know what the tablespace OID is or what it's location is. 

The low-level file access code (including WAL references) names tables
by two OIDs, which currently are database OID and relfilenode (the
latter is NOT to be considered equivalent to table OID, even though it
presently always is equal).

I believe that the correct implementation approach is to revise things
so that the low-level name of a table is tablespace OID + relfilenode;
this physical table name would in concept be completely distinct from
the logical table identification (database OID + table OID).  The file
reference path would become something like
"$PGDATA/base/tablespaceoid/relfilenode", where tablespaceoid might
reference a symlink to a directory instead of a plain directory.
Tablespace management then consists of setting up those symlinks
correctly, and there is essentially zero impact on the low-level access
code.

The hard part of this is that we are probably being sloppy in some
places about the difference between physical and logical table
identifications.  Those places will need to be found and fixed.
This needs to happen anyway, of course, since the point of introducing
relfilenode was to allow table versioning, which we still want.

Vadim suggested long ago that bufmgr, smgr, and below should have
nothing to do with referencing files by relcache entries; they should
only deal in physical file identifiers.  That requires some tedious but
(in principle) straightforward API changes.

BTW, if tablespaces can be shared by databases then DROP DATABASE
becomes rather tricky: how do you zap the correct files out of a shared
tablespace, keeping in mind that you are not logged into the doomed
database and can't look at its catalogs?  The best idea I've seen for
this so far is:

1. Access path for tables is really$PGDATA/base/databaseoid/tablespaceoid/relfilenode.
(BTW, we could save some work if we chdir'd into
$PGDATA/base/databaseoid at backend start and then used only relative
tablespaceoid/relfilenode paths.  Right now we tend to use absolute
paths because the bootstrap code doesn't do that chdir; which seems
like a stupid solution...)

2. A shared tablespace directory contains a subdirectory for each database
that has files in the tablespace.  Thus, the actual filesystem location
of a table is something like<tablespace>/databaseoid/relfilenode
The symlink from a database's $PGDATA/base/databaseoid/ directory to
the tablespace points at <tablespace>/databaseoid.  The first attempt to
create a table in a tablespace from a particular database will create
the hard subdirectory and set up the symlink; or perhaps that should be
done by an explicit tablespace management operation to "connect" the
database to the tablespace.

3. To drop a database, we examine the symlinks in its
$PGDATA/base/databaseoid/ and rm -rf each referenced tablespace
subdirectory before rm -rf'ing $PGDATA/base/databaseoid.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Fernando Nasser
Date:
Subject: Re: Bad Build
Next
From: Greg Copeland
Date:
Subject: Re: Bad Build