Thread: Storage Location / Tablespaces (try 3)
Me again, I have some more details on my storage location patch This patch would allow the system admin (DBA) to specify the location of databases, tables/indexes and temporary objects (temp tables and temp sort space) independent of the database/system default location. This patch would replace the current "LOCATION" code. Please let me know if you have any questions/comments. I would like to see this feature make 7.3. I believe it will take about 1 month of coding and testing after I get started. Thanks Jim ============================================================================== Storage Location Patch (Try 3) (If people like TABLESPACE instead of LOCATION then s/LOCATION/TABLESPACE/g below) This patch would add the following NEW commands ---------------------------------------------------- CREATE LOCATION name PATH 'dbpath'; DROP LOCATION name; where dbpath is any directory that the postgresql backend can write to. (I know this is how Oracle works, don't know about the other major db systems) The following NEW GLOBAL system table would be added. ----------------------------------------------------- PG_LOCATION ( LOC_NAME name, LOC_PATH text -- This should be able to take any path name. ); (initdb would add (PGDATA,'/usr/local/pgsql/data') The following system tables would need to be modified ----------------------------------------------------- PG_DATABASE drop datpath add DATA_LOC_NAME name or DATA_LOC_OID OID add INDEX_LOC_NAME name orINDEX_LOC_OID OID add TEMP_LOC_NAME name or TEMP_LOC_OID OID PG_CLASS to add LOC_NAME name or LOC_OID OID DATA_LOC_* and INDEX_LOC_* would default to PGDATA if not specified. (I like *LOC_NAME better but I believe the rest of the systems tables use OID) The following command syntax would be modified ------------------------------------------------------ CREATE DATABASE WITH DATA_LOCATION=XXX INDEX_LOCATION=YYY TEMP_LOCATION=ZZZ CREATE TABLE aaa (...) WITH LOCATION=XXX; CREATE TABLE bbb (c1 text primary key location CCC) WITH LOCATION=XXX; CREATE TABLE ccc (c2 text unique location CCC) WITH LOCATION=XXX; CREATE INDEX XXX on SAMPLE (C2) WITH LOCATION BBB; Now for an example ------------------------------------------------------ First: postgresql is installed at /usr/local/pgsql userid postgres the postgres user also is the owner of /pg01/pg02 /pg03 the dba executes the following script CREATE LOCATION pg01 PATH '/pg01'; CREATE LOCATION pg02 PATH '/pg02'; CREATE LOCATION pg03 PATH '/pg03'; CREATE LOCATION bigdata PATH '/bigdata'; CREATE LOCATION bigidx PATH '/bigidx'; \q PG_LOCATION now has pg01 | /pg01 pg02 | /pg02 pg03 | /pg03 bigdata | /bigdata bigidx | /bigidx Now the following command is run CREATE DATABASE jim1 WITH DATA_LOCATION='pg01' INDEX_LOCATION='pg02' TEMP_LOCATION='pg03' -- OID of 'jim1' tuple is 1786146 on disk the directories look like this /pg01/1786146 <<-- Default DATA Location /pg02/1786146 <<-- Default INDEX Location /pg03/1786146 <<-- Default Temp Location All files from the above directories will have symbolic links to /usr/local/pgsql/data/base/1786146/ Now the system will have 1 BIG table that will get its own disk for data and its own disk for index create table big (a text,b text ..., primary key (a,b) location 'bigidx'); oid of big table is 1786150 oid of big table primary key index is 1786151 on disk directories look like this /bigdata/1786146/1786150 /bigidx/1786146/1786151 /usr/local/pgsql/data/base/1786146/1786150 symbolic link to /bigdata/1786146/1786150 /usr/local/pgsql/data/base/1786146/1786151 symbolic link to /bigdata/1786146/1786151 The symbolic links will enable the rest of the software to be location independent.
> (If people like TABLESPACE instead of LOCATION then > s/LOCATION/TABLESPACE/g > below) I like "tablespace" :-) > This patch would add the following NEW commands > ---------------------------------------------------- > CREATE LOCATION name PATH 'dbpath'; > DROP LOCATION name; > The following command syntax would be modified > ------------------------------------------------------ > CREATE DATABASE WITH DATA_LOCATION=XXX INDEX_LOCATION=YYY > TEMP_LOCATION=ZZZ > CREATE TABLE aaa (...) WITH LOCATION=XXX; > CREATE TABLE bbb (c1 text primary key location CCC) WITH LOCATION=XXX; > CREATE TABLE ccc (c2 text unique location CCC) WITH LOCATION=XXX; > CREATE INDEX XXX on SAMPLE (C2) WITH LOCATION BBB; Sounds great, but shouldn't we use syntax that is already around, like Oracle's or DB2's or ... > The symbolic links will enable the rest of the software to be location > independent. I see, that this is the least intrusive way, but I am not sure this is the best way to do it. It would probably be better to pass the Tablespace oid around (or look it up). That would also leave the door open for other "Tablespace types" (currently "Filesystem directory" an OS managed tablespace :-). Andreas
Andreas, My first try passed the tablespace OID arround but someone pointed out the the WAL code doesn't know what the tablespace OID is or what it's location is. This is why I would like to use the symbolic links. Tom do you have any ideas on this? Jim > > (If people like TABLESPACE instead of LOCATION then > > s/LOCATION/TABLESPACE/g > > below) > > I like "tablespace" :-) > > > This patch would add the following NEW commands > > ---------------------------------------------------- > > CREATE LOCATION name PATH 'dbpath'; > > DROP LOCATION name; > > > The following command syntax would be modified > > ------------------------------------------------------ > > CREATE DATABASE WITH DATA_LOCATION=XXX INDEX_LOCATION=YYY > > TEMP_LOCATION=ZZZ > > CREATE TABLE aaa (...) WITH LOCATION=XXX; > > CREATE TABLE bbb (c1 text primary key location CCC) WITH LOCATION=XXX; > > CREATE TABLE ccc (c2 text unique location CCC) WITH LOCATION=XXX; > > CREATE INDEX XXX on SAMPLE (C2) WITH LOCATION BBB; > > Sounds great, but shouldn't we use syntax that is already around, > like Oracle's or DB2's or ... > > > The symbolic links will enable the rest of the software to be location > > independent. > > I see, that this is the least intrusive way, but I am not sure this > is the best way to do it. It would probably be better to pass the > Tablespace oid around (or look it up). > > That would also leave the door open for other "Tablespace types" (currently > "Filesystem directory" an OS managed tablespace :-). > > Andreas
"Jim Buttafuoco" <jim@buttafuoco.net> writes: > My first try passed the tablespace OID arround but someone pointed out the the > WAL code doesn't know what the tablespace OID is or what it's location is. The low-level file access code (including WAL references) names tables by two OIDs, which currently are database OID and relfilenode (the latter is NOT to be considered equivalent to table OID, even though it presently always is equal). I believe that the correct implementation approach is to revise things so that the low-level name of a table is tablespace OID + relfilenode; this physical table name would in concept be completely distinct from the logical table identification (database OID + table OID). The file reference path would become something like "$PGDATA/base/tablespaceoid/relfilenode", where tablespaceoid might reference a symlink to a directory instead of a plain directory. Tablespace management then consists of setting up those symlinks correctly, and there is essentially zero impact on the low-level access code. The hard part of this is that we are probably being sloppy in some places about the difference between physical and logical table identifications. Those places will need to be found and fixed. This needs to happen anyway, of course, since the point of introducing relfilenode was to allow table versioning, which we still want. Vadim suggested long ago that bufmgr, smgr, and below should have nothing to do with referencing files by relcache entries; they should only deal in physical file identifiers. That requires some tedious but (in principle) straightforward API changes. BTW, if tablespaces can be shared by databases then DROP DATABASE becomes rather tricky: how do you zap the correct files out of a shared tablespace, keeping in mind that you are not logged into the doomed database and can't look at its catalogs? The best idea I've seen for this so far is: 1. Access path for tables is really$PGDATA/base/databaseoid/tablespaceoid/relfilenode. (BTW, we could save some work if we chdir'd into $PGDATA/base/databaseoid at backend start and then used only relative tablespaceoid/relfilenode paths. Right now we tend to use absolute paths because the bootstrap code doesn't do that chdir; which seems like a stupid solution...) 2. A shared tablespace directory contains a subdirectory for each database that has files in the tablespace. Thus, the actual filesystem location of a table is something like<tablespace>/databaseoid/relfilenode The symlink from a database's $PGDATA/base/databaseoid/ directory to the tablespace points at <tablespace>/databaseoid. The first attempt to create a table in a tablespace from a particular database will create the hard subdirectory and set up the symlink; or perhaps that should be done by an explicit tablespace management operation to "connect" the database to the tablespace. 3. To drop a database, we examine the symlinks in its $PGDATA/base/databaseoid/ and rm -rf each referenced tablespace subdirectory before rm -rf'ing $PGDATA/base/databaseoid. regards, tom lane