Thread: Storage Location / Tablespaces (try 3)

Storage Location / Tablespaces (try 3)

From
"Jim Buttafuoco"
Date:
Me again, I have some more details on my storage location patch 



This patch would allow the system admin (DBA) to specify the location of
databases, tables/indexes and temporary objects (temp tables and temp sort
space) independent of the database/system default location.  This patch would
replace the current "LOCATION" code.

Please let me know if you have any questions/comments.  I would like to see
this feature make 7.3.  I believe it will take about 1 month of coding and
testing after I get started.

Thanks
Jim

==============================================================================
Storage Location Patch (Try 3)


(If people like TABLESPACE instead of LOCATION then s/LOCATION/TABLESPACE/g
below)


This patch would add the following NEW commands
----------------------------------------------------   CREATE LOCATION name PATH 'dbpath';   DROP   LOCATION name;

where dbpath is any directory that the postgresql backend can write to.
(I know this is how Oracle works, don't know about the other major db systems)

The following NEW GLOBAL system table would be added.  
-----------------------------------------------------
PG_LOCATION
(      LOC_NAME  name,     LOC_PATH  text    -- This should be able to take any path name.
);
(initdb would add (PGDATA,'/usr/local/pgsql/data')

The following system tables would need to be modified
-----------------------------------------------------
PG_DATABASE drop datpath            add  DATA_LOC_NAME  name or DATA_LOC_OID OID              add  INDEX_LOC_NAME name
orINDEX_LOC_OID OID           add  TEMP_LOC_NAME  name or TEMP_LOC_OID OID
 
PG_CLASS to add LOC_NAME name or LOC_OID OID 

DATA_LOC_* and INDEX_LOC_* would default to PGDATA if not specified.

(I like *LOC_NAME better but I believe the rest of the systems tables use OID)


The following command syntax would be modified
------------------------------------------------------
CREATE DATABASE WITH DATA_LOCATION=XXX INDEX_LOCATION=YYY TEMP_LOCATION=ZZZ
CREATE TABLE aaa (...) WITH LOCATION=XXX;
CREATE TABLE bbb (c1 text primary key location CCC) WITH LOCATION=XXX;
CREATE TABLE ccc (c2 text unique location CCC) WITH LOCATION=XXX;
CREATE INDEX XXX on SAMPLE (C2) WITH LOCATION BBB;



Now for an example
------------------------------------------------------
First:   postgresql is installed at /usr/local/pgsql   userid     postgres   the postgres user also is the owner of
/pg01/pg02 /pg03
 

the dba executes the following script
CREATE LOCATION pg01 PATH '/pg01';
CREATE LOCATION pg02 PATH '/pg02';
CREATE LOCATION pg03 PATH '/pg03';
CREATE LOCATION bigdata  PATH '/bigdata';
CREATE LOCATION bigidx  PATH '/bigidx';
\q

PG_LOCATION now has
pg01    | /pg01
pg02    | /pg02
pg03    | /pg03
bigdata | /bigdata
bigidx  | /bigidx

Now the following command is run
CREATE DATABASE jim1 WITH DATA_LOCATION='pg01' INDEX_LOCATION='pg02'
TEMP_LOCATION='pg03'
-- OID of 'jim1' tuple is 1786146

on disk the directories look like this 
/pg01/1786146       <<-- Default DATA Location 
/pg02/1786146       <<-- Default INDEX Location
/pg03/1786146       <<-- Default Temp Location 

All files from the above directories will have symbolic links to
/usr/local/pgsql/data/base/1786146/ 



Now the system will have 1 BIG table that will get its own disk for data and
its own disk for index
create table big (a text,b text ..., primary key (a,b) location 'bigidx');

oid of big table is 1786150
oid of big table primary key index is 1786151

on disk directories look like this
/bigdata/1786146/1786150
/bigidx/1786146/1786151
/usr/local/pgsql/data/base/1786146/1786150 symbolic link to
/bigdata/1786146/1786150
/usr/local/pgsql/data/base/1786146/1786151 symbolic link to
/bigdata/1786146/1786151



The symbolic links will enable the rest of the software to be location
independent.




Re: Storage Location / Tablespaces (try 3)

From
"Zeugswetter Andreas SB SD"
Date:
> (If people like TABLESPACE instead of LOCATION then
> s/LOCATION/TABLESPACE/g
> below)

I like "tablespace" :-)

> This patch would add the following NEW commands
> ----------------------------------------------------
>     CREATE LOCATION name PATH 'dbpath';
>     DROP   LOCATION name;

> The following command syntax would be modified
> ------------------------------------------------------
> CREATE DATABASE WITH DATA_LOCATION=XXX INDEX_LOCATION=YYY
> TEMP_LOCATION=ZZZ
> CREATE TABLE aaa (...) WITH LOCATION=XXX;
> CREATE TABLE bbb (c1 text primary key location CCC) WITH LOCATION=XXX;
> CREATE TABLE ccc (c2 text unique location CCC) WITH LOCATION=XXX;
> CREATE INDEX XXX on SAMPLE (C2) WITH LOCATION BBB;

Sounds great, but shouldn't we use syntax that is already around,
like Oracle's or DB2's or ...

> The symbolic links will enable the rest of the software to be location
> independent.

I see, that this is the least intrusive way, but I am not sure this is the
best way to do it. It would probably be better to pass the Tablespace oid
around (or look it up).

That would also leave the door open for other "Tablespace types" (currently
"Filesystem directory" an OS managed tablespace :-).

Andreas


Re: Storage Location / Tablespaces (try 3)

From
"Jim Buttafuoco"
Date:
Andreas,

My first try passed the tablespace OID arround but someone pointed out the the
WAL code doesn't know what the tablespace OID is or what it's location is. 
This is why I would like to use the symbolic links.  

Tom do you have any ideas on this?

Jim

> > (If people like TABLESPACE instead of LOCATION then 
> > s/LOCATION/TABLESPACE/g
> > below)
> 
> I like "tablespace" :-)
> 
> > This patch would add the following NEW commands
> > ----------------------------------------------------
> >     CREATE LOCATION name PATH 'dbpath';
> >     DROP   LOCATION name;
> 
> > The following command syntax would be modified
> > ------------------------------------------------------
> > CREATE DATABASE WITH DATA_LOCATION=XXX INDEX_LOCATION=YYY 
> > TEMP_LOCATION=ZZZ
> > CREATE TABLE aaa (...) WITH LOCATION=XXX;
> > CREATE TABLE bbb (c1 text primary key location CCC) WITH LOCATION=XXX;
> > CREATE TABLE ccc (c2 text unique location CCC) WITH LOCATION=XXX;
> > CREATE INDEX XXX on SAMPLE (C2) WITH LOCATION BBB;
> 
> Sounds great, but shouldn't we use syntax that is already around,
> like Oracle's or DB2's or ...
> 
> > The symbolic links will enable the rest of the software to be location
> > independent.
> 
> I see, that this is the least intrusive way, but I am not sure this 
> is the best way to do it. It would probably be better to pass the 
> Tablespace oid around (or look it up).
> 
> That would also leave the door open for other "Tablespace types" (currently
> "Filesystem directory" an OS managed tablespace :-).
> 
> Andreas





Re: Storage Location / Tablespaces (try 3)

From
Tom Lane
Date:
"Jim Buttafuoco" <jim@buttafuoco.net> writes:
> My first try passed the tablespace OID arround but someone pointed out the the
> WAL code doesn't know what the tablespace OID is or what it's location is. 

The low-level file access code (including WAL references) names tables
by two OIDs, which currently are database OID and relfilenode (the
latter is NOT to be considered equivalent to table OID, even though it
presently always is equal).

I believe that the correct implementation approach is to revise things
so that the low-level name of a table is tablespace OID + relfilenode;
this physical table name would in concept be completely distinct from
the logical table identification (database OID + table OID).  The file
reference path would become something like
"$PGDATA/base/tablespaceoid/relfilenode", where tablespaceoid might
reference a symlink to a directory instead of a plain directory.
Tablespace management then consists of setting up those symlinks
correctly, and there is essentially zero impact on the low-level access
code.

The hard part of this is that we are probably being sloppy in some
places about the difference between physical and logical table
identifications.  Those places will need to be found and fixed.
This needs to happen anyway, of course, since the point of introducing
relfilenode was to allow table versioning, which we still want.

Vadim suggested long ago that bufmgr, smgr, and below should have
nothing to do with referencing files by relcache entries; they should
only deal in physical file identifiers.  That requires some tedious but
(in principle) straightforward API changes.

BTW, if tablespaces can be shared by databases then DROP DATABASE
becomes rather tricky: how do you zap the correct files out of a shared
tablespace, keeping in mind that you are not logged into the doomed
database and can't look at its catalogs?  The best idea I've seen for
this so far is:

1. Access path for tables is really$PGDATA/base/databaseoid/tablespaceoid/relfilenode.
(BTW, we could save some work if we chdir'd into
$PGDATA/base/databaseoid at backend start and then used only relative
tablespaceoid/relfilenode paths.  Right now we tend to use absolute
paths because the bootstrap code doesn't do that chdir; which seems
like a stupid solution...)

2. A shared tablespace directory contains a subdirectory for each database
that has files in the tablespace.  Thus, the actual filesystem location
of a table is something like<tablespace>/databaseoid/relfilenode
The symlink from a database's $PGDATA/base/databaseoid/ directory to
the tablespace points at <tablespace>/databaseoid.  The first attempt to
create a table in a tablespace from a particular database will create
the hard subdirectory and set up the symlink; or perhaps that should be
done by an explicit tablespace management operation to "connect" the
database to the tablespace.

3. To drop a database, we examine the symlinks in its
$PGDATA/base/databaseoid/ and rm -rf each referenced tablespace
subdirectory before rm -rf'ing $PGDATA/base/databaseoid.
        regards, tom lane