Re: Storage Location / Tablespaces (try 3) - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Storage Location / Tablespaces (try 3) |
Date | |
Msg-id | 15707.1015541196@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Storage Location / Tablespaces (try 3) ("Jim Buttafuoco" <jim@buttafuoco.net>) |
List | pgsql-hackers |
"Jim Buttafuoco" <jim@buttafuoco.net> writes: > My first try passed the tablespace OID arround but someone pointed out the the > WAL code doesn't know what the tablespace OID is or what it's location is. The low-level file access code (including WAL references) names tables by two OIDs, which currently are database OID and relfilenode (the latter is NOT to be considered equivalent to table OID, even though it presently always is equal). I believe that the correct implementation approach is to revise things so that the low-level name of a table is tablespace OID + relfilenode; this physical table name would in concept be completely distinct from the logical table identification (database OID + table OID). The file reference path would become something like "$PGDATA/base/tablespaceoid/relfilenode", where tablespaceoid might reference a symlink to a directory instead of a plain directory. Tablespace management then consists of setting up those symlinks correctly, and there is essentially zero impact on the low-level access code. The hard part of this is that we are probably being sloppy in some places about the difference between physical and logical table identifications. Those places will need to be found and fixed. This needs to happen anyway, of course, since the point of introducing relfilenode was to allow table versioning, which we still want. Vadim suggested long ago that bufmgr, smgr, and below should have nothing to do with referencing files by relcache entries; they should only deal in physical file identifiers. That requires some tedious but (in principle) straightforward API changes. BTW, if tablespaces can be shared by databases then DROP DATABASE becomes rather tricky: how do you zap the correct files out of a shared tablespace, keeping in mind that you are not logged into the doomed database and can't look at its catalogs? The best idea I've seen for this so far is: 1. Access path for tables is really$PGDATA/base/databaseoid/tablespaceoid/relfilenode. (BTW, we could save some work if we chdir'd into $PGDATA/base/databaseoid at backend start and then used only relative tablespaceoid/relfilenode paths. Right now we tend to use absolute paths because the bootstrap code doesn't do that chdir; which seems like a stupid solution...) 2. A shared tablespace directory contains a subdirectory for each database that has files in the tablespace. Thus, the actual filesystem location of a table is something like<tablespace>/databaseoid/relfilenode The symlink from a database's $PGDATA/base/databaseoid/ directory to the tablespace points at <tablespace>/databaseoid. The first attempt to create a table in a tablespace from a particular database will create the hard subdirectory and set up the symlink; or perhaps that should be done by an explicit tablespace management operation to "connect" the database to the tablespace. 3. To drop a database, we examine the symlinks in its $PGDATA/base/databaseoid/ and rm -rf each referenced tablespace subdirectory before rm -rf'ing $PGDATA/base/databaseoid. regards, tom lane
pgsql-hackers by date: