Thread: Tablespaces
Hi all, I've been looking at implementing table spaces for 7.5. Some notes and implementation details follow. ------ Type of table space: There are many different table space implementations in relational database management systems. In my implementation, a table space in PostgreSQL will be the location of a directory on the file system in which files backing database objects can be stored. Global tables and non 'data' objects (WAL, CLOG, config files) will all remain in $PGDATA. $PGDATA/base will be the default table space. A given table space will be identified by a unique table space name. I haven't decided if 'unique' should mean database-wide unique or cross-database unique. It seems to me that we might run into problems with CREATE DATABASE ... TEMPLATE = <database with table spaces> if the uniqueness of table spaces is limited to the database level. A table space parameter will be added to DDL commands which create physical database objects (CREATE DATABASE/INDEX/TABLE/SEQUENCE) and to CREATE SCHEMA. The associated routines, as well as the corresponding DROP commands will need to be updated. Adding the ability to ALTER <object> TABLESPACE <name> seems a little painful. Would people use it? Comments? When an object is created the system will resolve the table space the object is stored in as follows: if the table space paramater is passed to the DDL command, then the object is stored in that table space (given validation of the table space, etc). If it is not passed, the object inherits its "parent's" table space where the parent/child hierarchy is as follows: database > schema > table > [index|sequence]. So, if you issued: create table foo.bar (...); We would first not that there is no TABLESPACE <name>, then cascade to the table space for the schema 'foo' (and possibly cascade to the table space for the database). A database which wasn't created with an explicit table space will be created under the default table space. This ensures backward compatibility. Creating a table space: A table space is a directory structure. The directory structure is as follows: [swm@dev /path/to/tblspc]$ ls OID1/ OID2/ OID1 and OID2 are the OIDs of databases which have created a table space against this file system location. In this respect, a table space resembles $PGDATA/base. I thought it useful to keep this kind of namespace mechanism in place so that administrators do not need to create hierarchies of names on different partitions if they want multiple databases to use the same partition. The actual creation of the table space will be done with: CREATE TABLE SPACE <name> LOCATION </path/to/tblspc>; Before creating the table space we must: 1) Check if the directory exists. If it does, create a sub directory as the OID of the current database. 2) Alternatively, if the directory doesn't exist, attempt to create it, then the sub directory. I wonder if a file, such as PG_TBLSPC, should be added to the table space directory so that, in the case of an existing non-empty directory, we can attempt to test if the directory is being used for something else and error out. Seems like: CREATE TABLESPACE tbl1 LOCATION '/var/' which will result in something like '/var/123443' is a bad idea. Then again, the user should know better. Comments? If everything goes well, we add an entry to pg_tablespace with the table space location and name (and and OID). Tying it all together: The catalogs pg_database, pg_namespace, and pg_class will have a 'tblspc' field. This will be the OID of the table space the object resides in, or 0 (default table space). Since we can then resolve relid/relname, schema and database to a tablespace, there aren't too many cases when extra logic needs to be added to the IO framework. In fact, most of it is taken care of because of the abstraction of relpath(). The creation of table spaces will need to be recorded in xlog in the same way that files are in heap_create() with the corresponding delete logic incase of ABORT. Postmaster startup: Ideally, the postmaster at startup should go into each tblspc/databaseoid directory and check for a postmaster.pid file to see if some other instance is touching the files we're interested in. This will require a control file listing tblspc/databaseoid paths and it will need to plug into WAL in case we die during CREATE TABLESPACE. Comments? Creating a database I think that createdb() is going to have to be reworked if pg_tablespace isn't shared (ie, tablespaces are only database unique). The reason being that if we create a database which has a table space, pg_tablespace in the new database will have to be updated and that cannot be done atomically with the `cp` based mechanism we currently use. I think I'm going to have to get my hands dirty before I can tell the extent to which createdb() will need reworking. pg_dump Obviously pg_dump will need to be able to dump table spaces. pg_dump running against <7.5 will DDL commands without a table space parameter and as such the database's physical layout, when loaded into 7.5, will be the same as for <7.5. --- Comments? Questions? Suggestions? Thanks, Gavin
On Thursday 26 February 2004 15:37, Gavin Sherry wrote: > Tying it all together: > > The catalogs pg_database, pg_namespace, and pg_class will have a 'tblspc' > field. This will be the OID of the table space the object resides in, or 0 > (default table space). Since we can then resolve relid/relname, schema and > database to a tablespace, there aren't too many cases when extra logic > needs to be added to the IO framework. In fact, most of it is taken care > of because of the abstraction of relpath(). > > The creation of table spaces will need to be recorded in xlog in the same > way that files are in heap_create() with the corresponding delete logic > incase of ABORT. Is tablespace some sort of copyrighted? Last I remembered, the discussion was about location/storage Just a thought.. Shridhar
Gavin Sherry wrote: > Hi all, > > I've been looking at implementing table spaces for 7.5. Some notes and > implementation details follow. > > ------ > > Type of table space: > > There are many different table space implementations in relational > database management systems. In my implementation, a table space in > PostgreSQL will be the location of a directory on the file system in > which files backing database objects can be stored. Global tables and > non 'data' objects (WAL, CLOG, config files) will all remain in $PGDATA. > $PGDATA/base will be the default table space. Is it possible to put WALs and CLOGs into different tablespaces? (maybe different RAID systems). Some companies want that ... > A given table space will be identified by a unique table space name. I > haven't decided if 'unique' should mean database-wide unique or > cross-database unique. It seems to me that we might run into problems > with CREATE DATABASE ... TEMPLATE = <database with table spaces> if the > uniqueness of table spaces is limited to the database level. I strongly vote for database cluster wide unique names because somebody could have a tablespace "webusers" or something like that. To me this makes far more sense. > A table space parameter will be added to DDL commands which create > physical database objects (CREATE DATABASE/INDEX/TABLE/SEQUENCE) and to > CREATE SCHEMA. The associated routines, as well as the corresponding DROP > commands will need to be updated. Adding the ability to ALTER <object> > TABLESPACE <name> seems a little painful. Would people use it? Comments? I think people won't need it in first place because this seems to be really painful. What really matters is that the number of tablespaces and file / tablespace is unlimited. SAP DB has limited the number of devspaces to 32 (I think). This is real bull.... because if your database grows unexpectedly you are in deep trouble (expert database design by SAP, MySQL and 100000....0000 others). > When an object is created the system will resolve the table space the > object is stored in as follows: if the table space paramater is passed to > the DDL command, then the object is stored in that table space (given > validation of the table space, etc). If it is not passed, the object > inherits its "parent's" table space where the parent/child hierarchy is as > follows: database > schema > table > [index|sequence]. So, if you issued: > > create table foo.bar (...); > > We would first not that there is no TABLESPACE <name>, then cascade to > the table space for the schema 'foo' (and possibly cascade to the table > space for the database). A database which wasn't created with an explicit > table space will be created under the default table space. This ensures > backward compatibility. Will users automatically be assigned to a certain table space? How is this going to work? > Creating a table space: > > A table space is a directory structure. The directory structure is as > follows: > > [swm@dev /path/to/tblspc]$ ls > OID1/ OID2/ > > OID1 and OID2 are the OIDs of databases which have created a table space > against this file system location. In this respect, a table space > resembles $PGDATA/base. I thought it useful to keep this kind of > namespace mechanism in place so that administrators do not need to create > hierarchies of names on different partitions if they want multiple > databases to use the same partition. > > The actual creation of the table space will be done with: > > CREATE TABLE SPACE <name> LOCATION </path/to/tblspc>; > > Before creating the table space we must: > > 1) Check if the directory exists. If it does, create a sub directory as > the OID of the current database. > > 2) Alternatively, if the directory doesn't exist, attempt to create it, > then the sub directory. > > I wonder if a file, such as PG_TBLSPC, should be added to the table space > directory so that, in the case of an existing non-empty directory, we can > attempt to test if the directory is being used for something else and > error out. Seems like: > > CREATE TABLESPACE tbl1 LOCATION '/var/' > > which will result in something like '/var/123443' is a bad idea. Then > again, the user should know better. Comments? > > If everything goes well, we add an entry to pg_tablespace with the table > space location and name (and and OID). > > > Tying it all together: > > The catalogs pg_database, pg_namespace, and pg_class will have a 'tblspc' > field. This will be the OID of the table space the object resides in, or 0 > (default table space). Since we can then resolve relid/relname, schema and > database to a tablespace, there aren't too many cases when extra logic > needs to be added to the IO framework. In fact, most of it is taken care > of because of the abstraction of relpath(). > > The creation of table spaces will need to be recorded in xlog in the same > way that files are in heap_create() with the corresponding delete logic > incase of ABORT. > > > Postmaster startup: > > Ideally, the postmaster at startup should go into each tblspc/databaseoid > directory and check for a postmaster.pid file to see if some other > instance is touching the files we're interested in. This will require a > control file listing tblspc/databaseoid paths and it will need to plug > into WAL in case we die during CREATE TABLESPACE. Comments? > > > Creating a database > > I think that createdb() is going to have to be reworked if pg_tablespace > isn't shared (ie, tablespaces are only database unique). The reason being > that if we create a database which has a table space, pg_tablespace in the > new database will have to be updated and that cannot be done atomically > with the `cp` based mechanism we currently use. > > I think I'm going to have to get my hands dirty before I can tell the > extent to which createdb() will need reworking. > > > pg_dump > > Obviously pg_dump will need to be able to dump table spaces. pg_dump > running against <7.5 will DDL commands without a table space parameter and > as such the database's physical layout, when loaded into 7.5, will be the > same as for <7.5. > > --- > > Comments? Questions? Suggestions? > > Thanks, > > Gavin Do you plan support for limiting the size of a tablespace? ISPs will vote for that because they can limit the size of a database on the database level rather than on the operating system level. Of course this can and (should???) be done on the operation system level but people will definitely ask for that. If sizing is not supported we should definitely provide minor documentation which tells people how to do that on the operating system level (at least poting to some useful information). Best regards, Hans -- Cybertec Geschwinde u Schoenig Schoengrabern 134, A-2020 Hollabrunn, Austria Tel: +43/2952/30706 or +43/664/233 90 75 www.cybertec.at, www.postgresql.at, kernel.cybertec.at
On Thursday 26 February 2004 10:07, Gavin Sherry wrote: > > CREATE TABLESPACE tbl1 LOCATION '/var/' > > which will result in something like '/var/123443' is a bad idea. Then > again, the user should know better. Comments? The LOCATION should have the same owner and permissions as $PGDATA - that should catch mistyping. Unless you're running as root, of course. In which case you clearly know better than everyone else, so off you go! > Comments? Questions? Suggestions? Presumably I'm using this to deal with performance/space issues, so there clearly needs to be something in the pg_stat_xxx system to show figures based on tablespace - not sure what you'd measure though - disk I/O, number of nodes? Perhaps something in contrib/ too - tablespace_disk_usage.pl or some such. -- Richard Huxton Archonet Ltd
> Is it possible to put WALs and CLOGs into different tablespaces? (maybe > different RAID systems). Some companies want that ... I wasn't going to look at that just yet. There is of course the temporary hack of symlinking WAL else where. I'd be interested to see the performance difference between WAL and data on the same RAID/controller and WAL and data on different RAID/controller with Jan's improvements to the buffer management. Gavin
On Thu, 26 Feb 2004, Gavin Sherry wrote: > Comments? Questions? Suggestions? Is that plan that in the future one can split a single table into different table spaces? Like storing all rows with year < 1999 in one tablespace and the rest in another? With the rule system and two underlying tables one could make it work by hand I think. I've never used tablespaces in oracle so I don't know what it can offer. I though it could do things like the above. True? What is the syntax and for example, how does it effect indexes (not at all maybe). If you don't want to discuss this now, I understand. It's not part of the design as it is now. I'm just curious at what direction we are moving and what is possible to do. -- /Dennis Björklund
On Thu, 26 Feb 2004, Dennis Bjorklund wrote: > On Thu, 26 Feb 2004, Gavin Sherry wrote: > > > Comments? Questions? Suggestions? > > Is that plan that in the future one can split a single table into > different table spaces? Like storing all rows with year < 1999 in one > tablespace and the rest in another? These are called partitions in oracle. You can approximate this with table spaces by using a partial index and putting it in a different table space. The problem, of course, is seq scans. > > With the rule system and two underlying tables one could make it work by > hand I think. > > I've never used tablespaces in oracle so I don't know what it can offer. I Certainly, table spaces are used in many ways in oracle, db2, etc. You can mirror data across them, have different buffer sizes for example. In some implementations, they can be raw disk partitions (no file system). I don't intend going this far, however. > If you don't want to discuss this now, I understand. It's not part of the > design as it is now. I'm just curious at what direction we are moving and > what is possible to do. Well, partitions are something else entirely. Mirroring would be interesting, but RAID designers are better at parallelisation of IO than (some) database developers. Might be better to keep the problem seperate. Gavin
> Is it possible to put WALs and CLOGs into different tablespaces? (maybe > different RAID systems). Some companies want that ... You can do this now, but it would be nice to be able to have it actually configurable versus the hacked linked method. J > -- Co-Founder Command Prompt, Inc. The wheel's spinning but the hamster's dead
Gavin Sherry wrote: >>Is it possible to put WALs and CLOGs into different tablespaces? (maybe >>different RAID systems). Some companies want that ... > > > I wasn't going to look at that just yet. > > There is of course the temporary hack of symlinking WAL else where. that's what we do now. we symlink databases and wals ... > I'd be interested to see the performance difference between WAL and data > on the same RAID/controller and WAL and data on different RAID/controller > with Jan's improvements to the buffer management. > > Gavin yes, that's what i am looking for. i should do some testing. in case of enough i/o power additional cpus scale almost linearily (depending on the application of course; i have done some testing on a customer's aix box ...). it would be interesting to see what jan's buffer strategy does (and bg writer) ... -- Cybertec Geschwinde u Schoenig Schoengrabern 134, A-2020 Hollabrunn, Austria Tel: +43/2952/30706 or +43/664/233 90 75 www.cybertec.at, www.postgresql.at, kernel.cybertec.at
On Thu, Feb 26, 2004 at 11:22:28PM +1100, Gavin Sherry wrote: > Certainly, table spaces are used in many ways in oracle, db2, etc. You can > mirror data across them, have different buffer sizes for example. > In some implementations, they can be raw disk partitions (no file system). > I don't intend going this far, however. Perhaps now would be a good time to bring up my directio on Solaris question from a year or so back? Is there any interest in the ability to use raw disk? Alex (who is overjoyed to hear discussion of tablespaces again) -- alex@posixnap.net Alex J. Avriette, Unix Systems Gladiator Sep 25 12:52:39 buggle /bsd: wsdisplay0 at vga1: removing /dev/radio/*
Gavin Sherry <swm@linuxworld.com.au> writes: > A table space is a directory structure. The directory structure is as > follows: > [swm@dev /path/to/tblspc]$ ls > OID1/ OID2/ > OID1 and OID2 are the OIDs of databases which have created a table space > against this file system location. In this respect, a table space > resembles $PGDATA/base. I thought it useful to keep this kind of > namespace mechanism in place ... Actually, this is *necessary* AFAICT. The case that forces it is DROP DATABASE. Since you have to execute that from another database, there's no reasonable way to look into the target database's catalogs. That means that the OID of the database has to be sufficient information to get rid of all its files. You can do this fairly easily if in each tablespace (whose locations you know from the shared pg_tablespace table) you can look for a subdirectory matching the target database's OID. If we tried to put the database's files just "loose" in each tablespace directory then we'd be in trouble. I think this is also an implementation reason for favoring cluster-wide tablespaces over database-local ones. I'm not sure how you drop a database from outside if you can't see where its tablespaces are. I believe that it will be necessary to expand RelFileNode to three OIDs (tablespace, database, relation). I had once hoped that it could be kept at two (tablespace, relation) but with a physical layout like this you more or less have to have three. One issue that needs to be agreed on early is how the low-level file access code finds a tablespace. What I would personally like is for $PGDATA to contain symlinks to the tablespace top directories. The actual access path for any relation could then be built trivially from its RelFileNode:$PGDATA/tablespaces/TBOID/DBOID/RELFILENODE ------------------------- The underlined part references a symlink that leads to the directory containing the per-database subdirectories. I am expecting to hear some bleating about this from people whose preferred platforms don't support symlinks ;-). However, if we don't rely on symlinks for this then the tablespace-OID-to-physical-path mapping has to be explicitly known at very low levels of the system (md.c in particular). We can't expect md.c to get that information by reading pg_tablespace. It would have to rely on some backdoor path, such as a flat text file it could read at backend startup. I think this approach will leave us fighting a lot of problems with locking and out-of-date information. Speaking of locking, can we do anything to prevent people from shooting themselves in the foot by changing active tablespaces? Are we even going to have a DROP TABLESPACE command, and if so what would it do? regards, tom lane
Gavin, After creating a tablespace what (if any) changes can be done to it. Can you DROP a tablespace, or once created will it always exist? Can you RENAME a tablespace? Can you change the location of a tablespace (i.e you did a disk reorg and move the contents to a different location and now want to point to the new location)? What are the permissions necessary to create a tablespace (can any use connected to the database create a tablespace, or only superuser, or ...)? Overall this will be a great addition to postgres. I am looking forward to this feature. thanks, --Barry Gavin Sherry wrote: > Hi all, > > I've been looking at implementing table spaces for 7.5. Some notes and > implementation details follow. > > ------ > > Type of table space: > > There are many different table space implementations in relational > database management systems. In my implementation, a table space in > PostgreSQL will be the location of a directory on the file system in > which files backing database objects can be stored. Global tables and > non 'data' objects (WAL, CLOG, config files) will all remain in $PGDATA. > $PGDATA/base will be the default table space. > > A given table space will be identified by a unique table space name. I > haven't decided if 'unique' should mean database-wide unique or > cross-database unique. It seems to me that we might run into problems > with CREATE DATABASE ... TEMPLATE = <database with table spaces> if the > uniqueness of table spaces is limited to the database level. > > A table space parameter will be added to DDL commands which create > physical database objects (CREATE DATABASE/INDEX/TABLE/SEQUENCE) and to > CREATE SCHEMA. The associated routines, as well as the corresponding DROP > commands will need to be updated. Adding the ability to ALTER <object> > TABLESPACE <name> seems a little painful. Would people use it? Comments? > > When an object is created the system will resolve the table space the > object is stored in as follows: if the table space paramater is passed to > the DDL command, then the object is stored in that table space (given > validation of the table space, etc). If it is not passed, the object > inherits its "parent's" table space where the parent/child hierarchy is as > follows: database > schema > table > [index|sequence]. So, if you issued: > > create table foo.bar (...); > > We would first not that there is no TABLESPACE <name>, then cascade to > the table space for the schema 'foo' (and possibly cascade to the table > space for the database). A database which wasn't created with an explicit > table space will be created under the default table space. This ensures > backward compatibility. > > > Creating a table space: > > A table space is a directory structure. The directory structure is as > follows: > > [swm@dev /path/to/tblspc]$ ls > OID1/ OID2/ > > OID1 and OID2 are the OIDs of databases which have created a table space > against this file system location. In this respect, a table space > resembles $PGDATA/base. I thought it useful to keep this kind of > namespace mechanism in place so that administrators do not need to create > hierarchies of names on different partitions if they want multiple > databases to use the same partition. > > The actual creation of the table space will be done with: > > CREATE TABLE SPACE <name> LOCATION </path/to/tblspc>; > > Before creating the table space we must: > > 1) Check if the directory exists. If it does, create a sub directory as > the OID of the current database. > > 2) Alternatively, if the directory doesn't exist, attempt to create it, > then the sub directory. > > I wonder if a file, such as PG_TBLSPC, should be added to the table space > directory so that, in the case of an existing non-empty directory, we can > attempt to test if the directory is being used for something else and > error out. Seems like: > > CREATE TABLESPACE tbl1 LOCATION '/var/' > > which will result in something like '/var/123443' is a bad idea. Then > again, the user should know better. Comments? > > If everything goes well, we add an entry to pg_tablespace with the table > space location and name (and and OID). > > > Tying it all together: > > The catalogs pg_database, pg_namespace, and pg_class will have a 'tblspc' > field. This will be the OID of the table space the object resides in, or 0 > (default table space). Since we can then resolve relid/relname, schema and > database to a tablespace, there aren't too many cases when extra logic > needs to be added to the IO framework. In fact, most of it is taken care > of because of the abstraction of relpath(). > > The creation of table spaces will need to be recorded in xlog in the same > way that files are in heap_create() with the corresponding delete logic > incase of ABORT. > > > Postmaster startup: > > Ideally, the postmaster at startup should go into each tblspc/databaseoid > directory and check for a postmaster.pid file to see if some other > instance is touching the files we're interested in. This will require a > control file listing tblspc/databaseoid paths and it will need to plug > into WAL in case we die during CREATE TABLESPACE. Comments? > > > Creating a database > > I think that createdb() is going to have to be reworked if pg_tablespace > isn't shared (ie, tablespaces are only database unique). The reason being > that if we create a database which has a table space, pg_tablespace in the > new database will have to be updated and that cannot be done atomically > with the `cp` based mechanism we currently use. > > I think I'm going to have to get my hands dirty before I can tell the > extent to which createdb() will need reworking. > > > pg_dump > > Obviously pg_dump will need to be able to dump table spaces. pg_dump > running against <7.5 will DDL commands without a table space parameter and > as such the database's physical layout, when loaded into 7.5, will be the > same as for <7.5. > > --- > > Comments? Questions? Suggestions? > > Thanks, > > Gavin > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings
On Thu, 26 Feb 2004, Alex J. Avriette wrote: > On Thu, Feb 26, 2004 at 11:22:28PM +1100, Gavin Sherry wrote: > > > Certainly, table spaces are used in many ways in oracle, db2, etc. You can > > mirror data across them, have different buffer sizes for example. > > In some implementations, they can be raw disk partitions (no file system). > > I don't intend going this far, however. > > Perhaps now would be a good time to bring up my directio on Solaris question > from a year or so back? Is there any interest in the ability to use raw > disk? I do not intend to undertake raw disk tablespaces for 7.5. I'd be interested if anyone could provide some real world benchmarking of file system vs. raw disk. Postgres benefits a lot from kernel file system cache at the moment. Also, I believe that database designers have traditionally made bad file system designers. Raw database partitions often lack the tools essential to a scalable environment. For example, the ability to resize partitions. Gavin
On Thu, 26 Feb 2004, Tom Lane wrote: > Gavin Sherry <swm@linuxworld.com.au> writes: > > A table space is a directory structure. The directory structure is as > > follows: > > [swm@dev /path/to/tblspc]$ ls > > OID1/ OID2/ > > OID1 and OID2 are the OIDs of databases which have created a table space > > against this file system location. In this respect, a table space > > resembles $PGDATA/base. I thought it useful to keep this kind of > > namespace mechanism in place ... > > Actually, this is *necessary* AFAICT. The case that forces it is DROP > DATABASE. Since you have to execute that from another database, there's > no reasonable way to look into the target database's catalogs. That > means that the OID of the database has to be sufficient information to > get rid of all its files. You can do this fairly easily if in each > tablespace (whose locations you know from the shared pg_tablespace > table) you can look for a subdirectory matching the target database's > OID. If we tried to put the database's files just "loose" in each > tablespace directory then we'd be in trouble. Ahhh. Yes. > > I think this is also an implementation reason for favoring cluster-wide > tablespaces over database-local ones. I'm not sure how you drop a > database from outside if you can't see where its tablespaces are. Naturally. > > I believe that it will be necessary to expand RelFileNode to three OIDs > (tablespace, database, relation). I had once hoped that it could be > kept at two (tablespace, relation) but with a physical layout like this > you more or less have to have three. Yes. I agree. > > One issue that needs to be agreed on early is how the low-level file > access code finds a tablespace. What I would personally like is for > $PGDATA to contain symlinks to the tablespace top directories. The > actual access path for any relation could then be built trivially from > its RelFileNode: > $PGDATA/tablespaces/TBOID/DBOID/RELFILENODE > ------------------------- > The underlined part references a symlink that leads to the directory > containing the per-database subdirectories. > > I am expecting to hear some bleating about this from people whose > preferred platforms don't support symlinks ;-). However, if we don't Actually, I think that's a pretty good idea :-). I'd solves a bunch of issues in the backend (postmaster start up can recurse through $PGDATA/tablespaces looking for postmaster.pid files) and will also assist admins with complex configurations (perhaps). > Speaking of locking, can we do anything to prevent people from shooting > themselves in the foot by changing active tablespaces? Are we even > going to have a DROP TABLESPACE command, and if so what would it do? Ahh. I forgot to detail my ideas on this. It seems to me that we cannot drop a table space until the directory is empty. We will need a shared invalidation message so that backends do not attempt to create an object just after we drop the table space. Thanks, Gavin
On Thu, 2004-02-26 at 13:22, Gavin Sherry wrote: > Postgres benefits a lot from kernel file system cache > at the moment. With the implementation of much smarter and more adaptive cache replacement algorithm i.e. ARC, I would expect the benefit of using the kernel file system cache to diminish significantly. It appears to me, and I could be wrong, that the reason Postgres has depended on the kernel file system cache isn't that this is obviously better in some absolute sense (though it might be depending on the deployment scenario), but that the original cache replacement algorithm in Postgres was sufficiently poor that the better cache replacement algorithms in the kernel cache more than offset any sub-optimality that might result from doing so. I would expect that with ARC and the redesign of some of the buffer management bits for more scalability, you might very well get better performance by allocating most of the memory to the buffer cache rather than leaving it to the kernel file cache. I'm actually fairly curious to see what the new buffer management scheme will mean in terms of real world performance and parameter tuning. -James Rogersjrogers@neopolitan.com
>Gavin Sherry > The creation of table spaces will need to be recorded in xlog in the same > way that files are in heap_create() with the corresponding delete logic > incase of ABORT. Overall, sounds very cool. Please could we record the OID of the tablespace in the WAL logs, not the path to the tablespace? That way, we run no risks of having the WAL logs not work correctly should things change slightly... .. need to record drop tablespaces in the WAL logs also. I'm sure you meant both of those, just checking. Can drop tablespace require a specific privelege too? It's too easy to drop parts of a database without thinking... Best Regards, Simon Riggs
Gavin Sherry <swm@linuxworld.com.au> writes: >> Speaking of locking, can we do anything to prevent people from shooting >> themselves in the foot by changing active tablespaces? Are we even >> going to have a DROP TABLESPACE command, and if so what would it do? > Ahh. I forgot to detail my ideas on this. It seems to me that we cannot > drop a table space until the directory is empty. How would it get to be empty? Are you thinking of some sort of "connect database to tablespace" and "disconnect database from tablespace" commands that would respectively create and delete the per-database subdirectory? That seems moderately reasonable to me. We could then invent a locking protocol that requires backends to lock a tablespace before they can execute either of these operations (or delete the tablespace of course). regards, tom lane
On Fri, Feb 27, 2004 at 08:22:25AM +1100, Gavin Sherry wrote: > interested if anyone could provide some real world benchmarking of file > system vs. raw disk. Postgres benefits a lot from kernel file system cache > at the moment. Also, I believe that database designers have traditionally > made bad file system designers. Raw database partitions often lack the > tools essential to a scalable environment. For example, the ability to > resize partitions. The only reason I mentioned it to begin with was the recommendation of directio for databases in the Sun Blueprint, _Tuning Databases on the Solaris Platform_ (and being a Solaris geek, I asked, but apparently nobody else is worried enough about performance or not using Solaris enough to care). It's not critical, of course. I think, however, that many of us would like to see some of the features of Oracle and DB2 available to users of postgres. Some of these features are raw disk, tablespaces, and replication. We're getting there, and making really terrific progress (I fully expect replication to be ready for primetime in the 8-12 mos timeframe), but we're not quite there yet. As I said, I'm very glad to hear tablespaces mentioned again and see what looks like work being done on it. Thanks! Alex -- alex@posixnap.net Alex J. Avriette, Solaris Artillery Officer "Among the many misdeeds of the British rule in India, history will look upon the act of depriving a whole nation of arms,as the blackest." - Mahatma Gandhi
James Rogers <jrogers@neopolitan.com> writes: > With the implementation of much smarter and more adaptive cache > replacement algorithm i.e. ARC, I would expect the benefit of using the > kernel file system cache to diminish significantly. It appears to me, > and I could be wrong, that the reason Postgres has depended on the > kernel file system cache isn't that this is obviously better in some > absolute sense (though it might be depending on the deployment > scenario), but that the original cache replacement algorithm in Postgres > was sufficiently poor that the better cache replacement algorithms in > the kernel cache more than offset any sub-optimality that might result > from doing so. The question of optimality of replacement algorithm is only one of the arguments for using a small buffer cache. IMHO a considerably stronger argument is that the kernel's memory management is more flexible: it can use that memory for either disk cache or program workspace, and it can change the allocation on-the-fly as load demands. If you dedicate most of RAM to Postgres buffers then you are likely to be wasting RAM or swapping heavily. Possibly both :-( Another gotcha is that unless the OS allows you to lock shared memory into RAM, the shared buffers themselves could get swapped out, which is a no-win scenario by any measure. Keeping the shared buffer arena small helps prevent that by ensuring all the buffers are "hot". Of course, this is all speculation until we get some real-world experience with ARC. But I don't expect it to be a magic bullet. regards, tom lane
> > I am expecting to hear some bleating about this from people whose > > preferred platforms don't support symlinks ;-). However, if we don't Well, one option would be to have the low level filesystem storage (md.c?) routines implement a kind of symlink themselves. Just a file with a special magic number followed by a path. I'm normally against reimplementing OS services but symlinks are really a very simple concept and simple to implement. Especially if you can make a few simplifying assumptions: they only ever need to appear as the final path element not as parent directories and tablespaces don't need symlinks pointing to symlinks. Ideally postgres also doesn't need to implement relative links either. -- greg
> > Ahh. I forgot to detail my ideas on this. It seems to me that we cannot > > drop a table space until the directory is empty. Agreed. > > How would it get to be empty? Are you thinking of some sort of "connect > database to tablespace" and "disconnect database from tablespace" > commands that would respectively create and delete the per-database > subdirectory? That seems moderately reasonable to me. We could then I would only allow the drop if the directory only contains empty db oid directories. Andreas
> I do not intend to undertake raw disk tablespaces for 7.5. I'd be > interested if anyone could provide some real world benchmarking of file > system vs. raw disk. Postgres benefits a lot from kernel file system cache > at the moment. Yes, and don't forget that pg also relys on the OS for grouping and sorting the physical writes and doing readahead where appropriate. The use of raw disks is usually paired with the use of kernel aio. The difference is said to be up to 30% on Solaris. I can assert, that it made the difference between a bogged down system and a much better behaved DB on Sun here. My experience with kaio on AIX Informix is, that kaio is faster as long as IO is not the bottleneck (disk 100% busy is the metric to watch, not Mb/s), while for an IO bound system the Informix builtin IO threads that can be used instead win. (Since they obviously do better at grouping, sorting and readahead than the AIX kernel does for kaio) Overall I think the price and komplexity is too high, especially since there are enough platforms where the kernel does a pretty good job at grouping, sorting and readahead. Additionally the kernel takes non PostgreSQL IO into account. Andreas
On Fri, 27 Feb 2004, Zeugswetter Andreas SB SD wrote: > > > > Ahh. I forgot to detail my ideas on this. It seems to me that we cannot > > > drop a table space until the directory is empty. > > Agreed. > > > > > How would it get to be empty? Are you thinking of some sort of "connect > > database to tablespace" and "disconnect database from tablespace" > > commands that would respectively create and delete the per-database > > subdirectory? That seems moderately reasonable to me. We could then > > I would only allow the drop if the directory only contains empty db oid > directories. Wouldn't this be better tracked in the dependency tracking that's already built into postgresql? Checking to see if the directory is empty is open to race conditions, but locking the dependency tracking while dropping a tablespace isn't.
On Fri, 27 Feb 2004, Gavin Sherry wrote: > On Thu, 26 Feb 2004, Alex J. Avriette wrote: > > > On Thu, Feb 26, 2004 at 11:22:28PM +1100, Gavin Sherry wrote: > > > > > Certainly, table spaces are used in many ways in oracle, db2, etc. You can > > > mirror data across them, have different buffer sizes for example. > > > In some implementations, they can be raw disk partitions (no file system). > > > I don't intend going this far, however. > > > > Perhaps now would be a good time to bring up my directio on Solaris question > > from a year or so back? Is there any interest in the ability to use raw > > disk? > > I do not intend to undertake raw disk tablespaces for 7.5. I'd be > interested if anyone could provide some real world benchmarking of file > system vs. raw disk. Postgres benefits a lot from kernel file system cache > at the moment. Also, I believe that database designers have traditionally > made bad file system designers. Raw database partitions often lack the > tools essential to a scalable environment. For example, the ability to > resize partitions. Is possible / reasonable / smart and or dumb to look at implementing the tablespaces as riding atop the initlocation handled stuff. I.e. postgresql can only create tablespaces in areas that are created by initlocation, thus keeping it in its box, so to speak?
"scott.marlowe" <scott.marlowe@ihs.com> writes: > On Fri, 27 Feb 2004, Zeugswetter Andreas SB SD wrote: >>> How would it get to be empty? Are you thinking of some sort of "connect >>> database to tablespace" and "disconnect database from tablespace" >>> commands that would respectively create and delete the per-database >>> subdirectory? That seems moderately reasonable to me. We could then >> >> I would only allow the drop if the directory only contains empty db oid >> directories. That's subject to race conditions (ie, someone creating a table about the same time you are deciding it's okay to drop the tablespace). There needs to be some interlock, and I think that associating that lock with infrequently executed connect/disconnect operations would be good from a performance standpoint. > Wouldn't this be better tracked in the dependency tracking that's already > built into postgresql? No, because dependencies are local to individual databases. regards, tom lane
"scott.marlowe" <scott.marlowe@ihs.com> writes: > Is possible / reasonable / smart and or dumb to look at implementing the > tablespaces as riding atop the initlocation handled stuff. In my mind, one of the main benefits of this work will be that we'll be able to get *rid* of the initlocation stuff. It's a crock. regards, tom lane
On Fri, 27 Feb 2004, Tom Lane wrote: > "scott.marlowe" <scott.marlowe@ihs.com> writes: > > Is possible / reasonable / smart and or dumb to look at implementing the > > tablespaces as riding atop the initlocation handled stuff. > > In my mind, one of the main benefits of this work will be that we'll be > able to get *rid* of the initlocation stuff. It's a crock. OK, that's fine, but I keep thinking that a superuser should have to create the tablespace itself, and then tables can be assigned by users based on the rights assigned by the dba / superuser. Is that how we're looking at doing it, or will any user be able to create a tablespace anywhere postgresql has write permission, or will only dbas be able to create AND use table spaces. I'm just not sure how that's gonna be handled, and haven't seen it addressed.
"scott.marlowe" <scott.marlowe@ihs.com> writes: > On Fri, 27 Feb 2004, Tom Lane wrote: >> In my mind, one of the main benefits of this work will be that we'll be >> able to get *rid* of the initlocation stuff. It's a crock. > OK, that's fine, but I keep thinking that a superuser should have to > create the tablespace itself, and then tables can be assigned by users > based on the rights assigned by the dba / superuser. Yeah, we haven't yet gotten to the issue of permissions, but certainly creating or deleting a tablespace has to be a superuser-only operation, if only because you probably have also got some manual filesystem work to do to set up the associated directory; and that has to be done as root or postgres. It might be a good idea to restrict connect/disconnect (if we use those operations) to superusers as well. regards, tom lane
> >> I do not intend to undertake raw disk tablespaces for 7.5. I'd be >> interested if anyone could provide some real world benchmarking of file >> system vs. raw disk. Postgres benefits a lot from kernel file system >> cache >> at the moment. > > Yes, and don't forget that pg also relys on the OS for grouping and > sorting the physical writes and doing readahead where appropriate. > > Most people I know want tablespaces in order to limit or preallocate the disk space used by a table or database in addition to controlling the physical location of a table or database. I know on linux, there is the option of creating an empty file or a specific size using dd, mounting it through loopback, formatting it, symlinking the appropriate OID/TID (or mounting the lpb device in the appropriate directory) and then you control how much space that directory/mount point can contain. Of course, with MVCC you would have to vacuum frequently, as you could miss some updates if there weren't enough tuples marked as free. If there were "in-place" updates, the preallocation and limitation much easier, but that's not how PG works. If the tablespace disk space allocation is exceeded there would need to be some graceful reporting condition back to the client. "UPDATE/INSERT failed (tablespace size exceeded)", "(tablespace full)", "(disk full)" or some other error may need to be handled/reported.
On Fri, 27 Feb 2004 tswan@idigx.com wrote: > > > >> I do not intend to undertake raw disk tablespaces for 7.5. I'd be > >> interested if anyone could provide some real world benchmarking of file > >> system vs. raw disk. Postgres benefits a lot from kernel file system > >> cache > >> at the moment. > > > > Yes, and don't forget that pg also relys on the OS for grouping and > > sorting the physical writes and doing readahead where appropriate. > > > > > > Most people I know want tablespaces in order to limit or preallocate the > disk space used by a table or database in addition to controlling the > physical location of a table or database. > > I know on linux, there is the option of creating an empty file or a > specific size using dd, mounting it through loopback, formatting it, > symlinking the appropriate OID/TID (or mounting the lpb device in the > appropriate directory) and then you control how much space that > directory/mount point can contain. > > Of course, with MVCC you would have to vacuum frequently, as you could > miss some updates if there weren't enough tuples marked as free. If there > were "in-place" updates, the preallocation and limitation much easier, but > that's not how PG works. I do not intend to work on such a system for the initial introduction of table spaces. The problem is, of course, knowing when you're actually out of space in a table space in any given transaction. Given that WAL is on a different partition (at least for the moment) the table space will not have transaction X's data written to it until after transaction X is finished. And we cannot error out a transaction which is already commited. The solution is to keep track of free space and error out at some percentage of free space remaining. But I don't want to complicate tablespaces too much in 7.5. Thanks, Gavin
Gavin Sherry <swm@linuxworld.com.au> writes: > I do not intend to work on such a system for the initial introduction of > table spaces. The problem is, of course, knowing when you're actually out > of space in a table space in any given transaction. It should not be that hard, at least not on local filesystems. When PG realizes that a new page must be added to a table, it does a write() to append a page of zeroes to the physical table. This happens immediately. It's true that actual data may not be written into that section of the file till long after commit, but the kernel should do space allocation checking upon the first write. I have heard tell that this may not happen when you are dealing with NFS (yet another reason not to run databases across NFS) but on all local filesystems I know of, out-of-space should result in a failure before transaction commit. I say "should" because I suspect this isn't a very heavily tested code path in Postgres. But in theory it should work. Feel free to submit bug reports if you find it doesn't. regards, tom lane
Gavin, #1: I really think that we should have a way to set a "default tablespace" for any database in a cluster. This property would be vitally important for anyone wishing to use tablespaces to impose quotas. First, the superuser would: ALTER DATABASE db1 ALTER DEFAULT_TABLESPACE partition2; then any regular users creating tables in that database would, by default, have TABLESPACE partition2 automatically appended to them by the parser unless overridden in the creation statement by specifying another, specific, tablespace. Alternately, the default tablespace could be set through a GUC. In my mind, this would be inferior on 2 counts: 1) It would require adding Yet Another Miscellaneos GUC Variable. 2) It would preclude large, multisuer installations from seamlessly using tablespaces for quotas, becuase there would be no way to transparently set the GUC differently for each user or database. #2: Permissions: I see the permissions issue as quite transparent. First, I agree that only the superuser should have the right to create, alter, or drop tablespaces. 'nuff said. Second, as far as I can see, there is only one relevant permission for regular users: USE. Either the user is permitted to create objects in that tablespace, or he/she is not. Other permissions, such as read access, should NOT be set by tablespace, as such permissions are already governed by database, table, and schema; to add a SELECT restriction to tablespaces would frequently result in paralytic snarls of conflicting permissions on complex installations. Thus, by my proposal, the only GRANT for tablespaces (executed by a superuser) would be: GRANT USE ON tablespace1 TO user; This permission would ONLY be accessed for CREATE/ALTER TABLE, and CREATE INDEX statements. Easy, neh? #3: ALTER TABLE .... CHANGE TABLESPACE: This is strictly in the class of "would be a very nice & useful feature if it's not too difficult". Given how painful it is to drop & replace a table with multiple dependencies (on some databases, only possible by droping & re-loading the entire database) it would be nice to have an ALTER TABLE command that moved the table to another tablespace. It doesn't *seem* to me that this would be a very challenging bit of programming, as the operation would be very similar to REINDEX in the manipulation of files. (But what I know, really?) Once tablespaces are a feature and some users start using them for quota management, there will quickly develop situations where the original tablespace for a db runs out of room and can't be resized. Being able to move the table "in situ" then becomes vital, especially on very large databases ... and when someday combined with partitioned tables, will become essential. Further, we will get an *immediate* flurry of requests from users who just upgraded to 7.5 and want to make use of the tablespaces feature on an existing production database. ALTER INDEX ... CHANGE TABLESPACE is *not* needed, though, as there are no issues other than time which I know of with dropping & re-creating an index. If ALTER TABLE CHANGE TABLESPACE has some major technical hurdles, then I think it's one of those things that could be put off until the next version of tablespaces, or even held until Partition Tables is developed for a combined solution. But it would be nice to have. -- Josh Berkus Aglio Database Solutions San Francisco
On Thu, Feb 26, 2004 at 05:28:41PM -0500, Alex J. Avriette wrote: > The only reason I mentioned it to begin with was the recommendation of > directio for databases in the Sun Blueprint, _Tuning Databases on the > Solaris Platform_ (and being a Solaris geek, I asked, but apparently > nobody else is worried enough about performance or not using Solaris > enough to care). That recommendation itself is a few years old. While it may still be true that directio is still fastest for Oracle on Solaris, I'd sure like to see some recent evidence. I've a funny feeling that this is an old rule of thumb which is now true in the sense that everyone believes it, but maybe not in the sense that a test would reveal it to be a sensible rule. > like to see some of the features of Oracle and DB2 available to users > of postgres. Some of these features are raw disk, tablespaces, and > replication. We're getting there, and making really terrific progress I don't think we want features for their own sake, though, and I'm not convinced that raw filesystems are actually useful. Course, it's not my itch, and PostgreSQL _is_ free software. A -- Andrew Sullivan
>Gavin Sherry > On Fri, 27 Feb 2004 tswan@idigx.com wrote: > > Most people I know want tablespaces in order to limit or preallocate the > > disk space used by a table or database in addition to controlling the > > physical location of a table or database. > I do not intend to work on such a system for the initial introduction of > table spaces. The problem is, of course, knowing when you're actually out > of space in a table space in any given transaction. Given that WAL is on a > different partition (at least for the moment) the table space will not > have transaction X's data written to it until after transaction X is > finished. And we cannot error out a transaction which is already commited. > > The solution is to keep track of free space and error out at some > percentage of free space remaining. But I don't want to complicate > tablespaces too much in 7.5. You're absolutely right about the not-knowing when you're out of space issue. However, if the xlog has been written then it is not desirable, but at least acceptable that the checkpoint/bgwriter cannot complete on an already committed txn. It's not the txn which is getting the error, that's all. Hmmm...I'm not sure that we'll be able or should avoid the out of space situation completely. The question is...what will we do when we hit it? It doesn't matter whether you stop at 100% or 90% or whatever, you still have to stop and then what? Stay up as long as possible hopefully: If there wasn't enough space to write to the tablespace, going into recovery won't help the situation either; youre still out of space until you fix that. We now have the option not to crash, since it might be perfectly viable to keep on chugging away on one Tablespace even though all txn work on the out-of-space tablespace is frozen/barred etc. Sounds like a refinement, but something to keep in mind at the design stage if we can. The problem is that tablespaces do complicate space management (that's what people want though, so that's OK). That complicates admin and so pg will hit many more out of space errors than we've seen previously. Trying to work out how to spot these ahead of time, accept user defined limits on each tablespace etc sounds like extra complexity for the initial drop. I guess my own suggested approach is to start by handling the error cases, then go back and try to avoid some of them. All of this exposes for me the complication that doing PITR and tablespaces at the same time is likely to be more complex for us both than either had envisaged. The reduced complexity for PITR was what I was shooting for, also! I'm happy to work together on any issues that arise. For PITR, I think we would need: - a very accessible list of tablespace locations, so taking a full physical database backup can be easily accomplished using OS utilities. Hopefully a list maintained external to the database? We have the equivalent now with env variables. - decisions about what occurs when for-whatever-reason one or more tablespaces are not recoverable from backup? - it might be desirable to allow recovery with less than all of the original tablespces - it might also be desirable to allow recovery when the tablespaces txn Ids don't match (though that is forbidden on many other dbms) Best Regards, Simon Riggs
"Simon Riggs" <simon@2ndquadrant.com> writes: > Gavin Sherry wrote: >> I do not intend to work on such a system for the initial introduction of >> table spaces. The problem is, of course, knowing when you're actually out >> of space in a table space in any given transaction. Given that WAL is on a >> different partition (at least for the moment) the table space will not >> have transaction X's data written to it until after transaction X is >> finished. And we cannot error out a transaction which is already >> commited. As long as the kernel doesn't lie about file extension, we will not commit any transaction that requires a disallowed increase in the allocated size of data files, because allocation of another table page is checked with the kernel during the transaction. So on most filesystems (maybe not NFS) the problem Gavin is worried about doesn't exist. > You're absolutely right about the not-knowing when you're out of space > issue. However, if the xlog has been written then it is not desirable, > but at least acceptable that the checkpoint/bgwriter cannot complete on > an already committed txn. It's not the txn which is getting the error, > that's all. Right. This is in fact not a fatal situation, as long as you don't run out of preallocated WAL space. For a recent practical example of our behavior under zero-free-space conditions, see this thread: http://archives.postgresql.org/pgsql-hackers/2004-01/msg00530.php particularly the post-mortem here: http://archives.postgresql.org/pgsql-hackers/2004-01/msg00606.php Barring one small bug, the database would likely have stayed up, and continued to service at least the read-only transactions, until Chris got around to freeing some disk space. I think it is sufficient (at least in the near term) to expect people to use partition size limits if they want to control database size --- that is, make a partition of the desired size and put the database directory in there. Tablespaces as per the design we are discussing would make it easier to apply such a policy to a sub-area of a database cluster than it is today, but they needn't in themselves implement the restriction. regards, tom lane
Tom Lane [mailto:tgl@sss.pgh.pa.us] > "Simon Riggs" <simon@2ndquadrant.com> writes: > > You're absolutely right about the not-knowing when you're out of space > > issue. However, if the xlog has been written then it is not desirable, > > but at least acceptable that the checkpoint/bgwriter cannot complete on > > an already committed txn. It's not the txn which is getting the error, > > that's all. > > Right. This is in fact not a fatal situation, as long as you don't run > out of preallocated WAL space. ...following on also from thoughts on [PERFORM] list... Clearly running out of pre-allocated WAL space is likely to be the next issue. Running out of space in the first place is likely to be because of an intense workload, which is exactly the thing which also makes you run out of pre-allocated WAL space. Does that make sense? Best regards, Simon Riggs
> A table space parameter will be added to DDL commands which create > physical database objects (CREATE DATABASE/INDEX/TABLE/SEQUENCE) and to > CREATE SCHEMA. The associated routines, as well as the corresponding DROP > commands will need to be updated. Adding the ability to ALTER <object> > TABLESPACE <name> seems a little painful. Would people use it? Comments? How about allowing the specification on schemas and databases of different default tablespaces for TEMP, TABLE and INDEX?? Is there any point to that? Chris
> I've been looking at implementing table spaces for 7.5. Some notes and > implementation details follow. Ah sorry, other things you might need to consider: Privileges on tablespaces: GRANT USAGE ON TABLESPACE tbsp TO ...; Different disk settings for different tablespaces (since they will likely be on different disks): ALTER TABLESPACE tbsp SET random_page_cost TO 2.5; Chris
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: > How about allowing the specification on schemas and databases of > different default tablespaces for TEMP, TABLE and INDEX?? Is there any > point to that? TEMP tables are not local to any particular schema, so it wouldn't make sense to have a schema-level default for their placement. The other five combinations are at least theoretically sensible, but do we need 'em all? It seems to me that a reasonable compromise is to offer database-level default tablespaces for TEMP, TABLE, and INDEX, ignoring the schema level. This is simple and understandable, and if you don't like it, you're probably the kind of guy who will want to override it per-table anyway ... BTW, another dimension to think about is where TOAST tables and their indexes will get placed. regards, tom lane
"Simon Riggs" <simon@2ndquadrant.com> writes: > Tom Lane wrote: >> Right. This is in fact not a fatal situation, as long as you don't >> run out of preallocated WAL space. > Clearly running out of pre-allocated WAL space is likely to be the next > issue. Running out of space in the first place is likely to be because > of an intense workload, which is exactly the thing which also makes you > run out of pre-allocated WAL space. Does that make sense? I think one of the first things people would do with tablespaces is stick the data files onto a separate partition from the WAL and clog files. (Actually you can do this today with a simple symlink hack, but tablespaces will make it easier and clearer.) The space usage for WAL is really pretty predictable, because of the checkpoint-at-least- every-N-segments setting. clog is not exactly a space hog either. Once you have that separation established, out-of-disk-space can kill individual transactions but never the database as a whole. One of the things that bothers me about the present PITR design is that it presumes that individual WAL log segments can be kept until the external archiver process feels like writing them somewhere. If there's no guarantee that that happens within X amount of time, then you can't bound the amount of space needed on the WAL drive, and so you are back facing the possibility of an out-of-WAL-space panic. I suspect that we cannot really do anything about that, but it's annoying. Any bright ideas out there? regards, tom lane
Gavin Sherry wrote: > The actual creation of the table space will be done with: > > CREATE TABLE SPACE <name> LOCATION </path/to/tblspc>; Seems you should use CREATE TABLESPACE (no space) so it is more distinct from CREATE TABLE. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Richard Huxton wrote: > On Thursday 26 February 2004 10:07, Gavin Sherry wrote: > > > > CREATE TABLESPACE tbl1 LOCATION '/var/' > > > > which will result in something like '/var/123443' is a bad idea. Then > > again, the user should know better. Comments? > > The LOCATION should have the same owner and permissions as $PGDATA - that > should catch mistyping. > > Unless you're running as root, of course. In which case you clearly know > better than everyone else, so off you go! FYI, you can't run the postmaster as root. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Joshua D. Drake wrote: > > Is it possible to put WALs and CLOGs into different tablespaces? (maybe > > different RAID systems). Some companies want that ... > > You can do this now, but it would be nice to be able to have it actually > configurable versus the hacked linked method. Agreed, but because the system has to be down to move pg_xlog, I think we should write a command-line utility to assist with this, perhaps. It could check permissions and stuff. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Tom Lane wrote: > Gavin Sherry <swm@linuxworld.com.au> writes: > > A table space is a directory structure. The directory structure is as > > follows: > > [swm@dev /path/to/tblspc]$ ls > > OID1/ OID2/ > > OID1 and OID2 are the OIDs of databases which have created a table space > > against this file system location. In this respect, a table space > > resembles $PGDATA/base. I thought it useful to keep this kind of > > namespace mechanism in place ... > > Actually, this is *necessary* AFAICT. The case that forces it is DROP > DATABASE. Since you have to execute that from another database, there's > no reasonable way to look into the target database's catalogs. That > means that the OID of the database has to be sufficient information to > get rid of all its files. You can do this fairly easily if in each > tablespace (whose locations you know from the shared pg_tablespace > table) you can look for a subdirectory matching the target database's > OID. If we tried to put the database's files just "loose" in each > tablespace directory then we'd be in trouble. > Gavin, let us know if you want us to create the global pg_tablespace for you. Some of us have done a lot of system catalog work. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Tue, 2 Mar 2004, Bruce Momjian wrote: > Gavin Sherry wrote: > > The actual creation of the table space will be done with: > > > > CREATE TABLE SPACE <name> LOCATION </path/to/tblspc>; > > Seems you should use CREATE TABLESPACE (no space) so it is more distinct > from CREATE TABLE. Oops. Typo. > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 359-1001 > + If your life is a hard drive, | 13 Roberts Road > + Christ can be your backup. | Newtown Square, Pennsylvania 19073 > > > !DSPAM:40455de0297537578347468! > >
Tom Lane wrote: > Gavin Sherry <swm@linuxworld.com.au> writes: > >> Speaking of locking, can we do anything to prevent people from shooting > >> themselves in the foot by changing active tablespaces? Are we even > >> going to have a DROP TABLESPACE command, and if so what would it do? > > > Ahh. I forgot to detail my ideas on this. It seems to me that we cannot > > drop a table space until the directory is empty. > > How would it get to be empty? Are you thinking of some sort of "connect > database to tablespace" and "disconnect database from tablespace" > commands that would respectively create and delete the per-database > subdirectory? That seems moderately reasonable to me. We could then > invent a locking protocol that requires backends to lock a tablespace > before they can execute either of these operations (or delete the > tablespace of course). One crude solution would be to remove the tablespace oid directory only when the database is dropped, and require an empty tablespace directory to drop the tablespace. This allows a lock only on tablespace creation, and not a lock on object creation in each tablespace. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Tom Lane wrote: > One of the things that bothers me about the present PITR design is that > it presumes that individual WAL log segments can be kept until the > external archiver process feels like writing them somewhere. If there's > no guarantee that that happens within X amount of time, then you can't > bound the amount of space needed on the WAL drive, and so you are back > facing the possibility of an out-of-WAL-space panic. I suspect that we > cannot really do anything about that, but it's annoying. Any bright > ideas out there? Maybe specify an archive location (that of course could be on a separate partition) that the external archiver should check in addition to the normal WAL location. At some predetermined interval, push WAL log segments no longer needed to the archive location. Joe
Greg Stark wrote: > > > > I am expecting to hear some bleating about this from people whose > > > preferred platforms don't support symlinks ;-). However, if we don't > > Well, one option would be to have the low level filesystem storage (md.c?) > routines implement a kind of symlink themselves. Just a file with a special > magic number followed by a path. > > I'm normally against reimplementing OS services but symlinks are really a very > simple concept and simple to implement. Especially if you can make a few > simplifying assumptions: they only ever need to appear as the final path > element not as parent directories and tablespaces don't need symlinks pointing > to symlinks. Ideally postgres also doesn't need to implement relative links > either. I just checked from the MinGW console and I see: # touch a # ln -s a b # echo test >a # cat b # l ? -rw-r--r-- 1 Bruce Mo Administ 5 Mar 2 23:30 a -rw-r--r-- 1 Bruce Mo Administ 0 Mar 2 23:30 b # cat a test # cat b # It accepts ln -s, but does nothing with it. For tablespaces on OS's that don't support it, I think we will have to store the path name in the file and read it via the backend. Somehow we should cache those lookups. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Tom Lane wrote: > "scott.marlowe" <scott.marlowe@ihs.com> writes: > > Is possible / reasonable / smart and or dumb to look at implementing the > > tablespaces as riding atop the initlocation handled stuff. > > In my mind, one of the main benefits of this work will be that we'll be > able to get *rid* of the initlocation stuff. It's a crock. Agreed. It should be ripped out once we have tablespaces, and if we keep it for one extra release, there will be confusion over which to use. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Josh Berkus wrote: > #3: ALTER TABLE .... CHANGE TABLESPACE: > This is strictly in the class of "would be a very nice & useful feature if > it's not too difficult". ? > > Given how painful it is to drop & replace a table with multiple dependencies > (on some databases, only possible by droping & re-loading the entire > database) it would be nice to have an ALTER TABLE command that moved the > table to another tablespace. ? ?It doesn't *seem* to me that this would be a > very challenging bit of programming, as the operation would be very similar > to REINDEX in the manipulation of files. ? (But what I know, really?) > > Once tablespaces are a feature and some users start using them for quota > management, there will quickly develop situations where the original > tablespace for a db runs out of room and can't be resized. ? Being able to > move the table "in situ" then becomes vital, especially on very large > databases ... and when someday combined with partitioned tables, will become > essential. > > Further, we will get an *immediate* flurry of requests from users who just > upgraded to 7.5 and want to make use of the tablespaces feature on an > existing production database. If we don't implement moving tables between tablespaces, we should add a stub for it in the grammer and mention it is not implemented yet, because if we don't, we will get tons of questions. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Gavin Sherry wrote: > Actually, I think that's a pretty good idea :-). I'd solves a bunch of > issues in the backend (postmaster start up can recurse through > $PGDATA/tablespaces looking for postmaster.pid files) and will also assist > admins with complex configurations (perhaps). Why are you asking about postmaster.pid files. That file goes in the top level /data directory, no? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Tue, 2 Mar 2004, Bruce Momjian wrote: > Gavin Sherry wrote: > > Actually, I think that's a pretty good idea :-). I'd solves a bunch of > > issues in the backend (postmaster start up can recurse through > > $PGDATA/tablespaces looking for postmaster.pid files) and will also assist > > admins with complex configurations (perhaps). > > Why are you asking about postmaster.pid files. That file goes in the > top level /data directory, no? I was trying to be paranoid about users who have multiple postmasters on the same machine and want to share a table space while both systems are live. There'd be no mechanism to test for that situation if we didn't have something like a postmaster.pid file. Is this being a little too paranoid? Gavin
Joe Conway <mail@joeconway.com> writes: > Tom Lane wrote: >> facing the possibility of an out-of-WAL-space panic. I suspect that we >> cannot really do anything about that, but it's annoying. Any bright >> ideas out there? > Maybe specify an archive location (that of course could be on a separate > partition) that the external archiver should check in addition to the > normal WAL location. At some predetermined interval, push WAL log > segments no longer needed to the archive location. Does that really help? The panic happens when you fill the "normal" and "archive" partitions, how's that different from one partition? regards, tom lane
Gavin Sherry <swm@linuxworld.com.au> writes: > I was trying to be paranoid about users who have multiple postmasters on > the same machine and want to share a table space while both systems are > live. There'd be no mechanism to test for that situation if we didn't have > something like a postmaster.pid file. Is this being a little too paranoid? Hm. AFAICS there is no safe situation in which a tablespace directory could be shared by two different installations (== toplevel $PGDATA directories). I don't think we need a dynamic postmaster.pid-type lock to protect them. What might make sense is some sort of marker file in a tablespace directory that links back to the owning $PGDATA directory. CREATE TABLESPACE should create this, or reject if it already exists. regards, tom lane
Gavin Sherry wrote: > On Tue, 2 Mar 2004, Bruce Momjian wrote: > > > Gavin Sherry wrote: > > > Actually, I think that's a pretty good idea :-). I'd solves a bunch of > > > issues in the backend (postmaster start up can recurse through > > > $PGDATA/tablespaces looking for postmaster.pid files) and will also assist > > > admins with complex configurations (perhaps). > > > > Why are you asking about postmaster.pid files. That file goes in the > > top level /data directory, no? > > I was trying to be paranoid about users who have multiple postmasters on > the same machine and want to share a table space while both systems are > live. There'd be no mechanism to test for that situation if we didn't have > something like a postmaster.pid file. Is this being a little too paranoid? Oh, yikes, I see. Right now we have the interlock on the /data directory, but once you start moving stuff out from under /data using tablespaces, we do perhaps loose the interlock. However, I assume the CREATE TABLESPACE is going to create the tablespace directory, so I don't see how two postmasters could both create the directory. For example, if you say CREATE TABLESPACE tb IN '/var/tb1' I assume you have to create: /var/tb1/pgsql_tablespace and then /var/tb1/pgsql_tablespace/oid1/var/tb1/pgsql_tablespace/oid2 or something like that, and set the proper permissions on pgsql_tablespace. We will have write permission on the directory they pass to us, but we might not have permissions to change the mode of the directory they pass, so we have to create a subdirectory anyway, and that is our interlock. For example:# run as root$ chmod a+w .$ ls -ld .drwxrwxrwx 2 root wheel 512 Mar 2 23:51 .# run as the postmaster$ mkdirnew$ ls -ld newdrwxr-xr-x 2 postgres wheel 512 Mar 2 23:52 new$ chmod 700 new$ chmod 700 .chmod: .: Operation notpermittedchmod: .: Operation not permitted As you can see, I have permission to create the /new directory, but no ability to set its mode, so we have to create a directory that matches the permissions of /data: drwx------ 6 postgres postgres 512 Mar 2 12:48 /u/pg/data/ We could require the admin to create a directory that we own instead of just one that we have write permission in, but why bother when we can use the new directory as an interlock from multiple postmasters anyway. Right now we do require the directory used as /data be one where we can create a /data subdirectory, so this seems similar. We don't put the data directly in the passed directory, but in /data under that. In fact, we could just call it /var/tb1/data instead of /var/tb1/pgsql_tablespace. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Tom Lane wrote: > Joe Conway <mail@joeconway.com> writes: >>Maybe specify an archive location (that of course could be on a separate >>partition) that the external archiver should check in addition to the >>normal WAL location. At some predetermined interval, push WAL log >>segments no longer needed to the archive location. > > Does that really help? The panic happens when you fill the "normal" and > "archive" partitions, how's that different from one partition? I see your point. But it would allow you to use a relatively modest local partition for WAL segments, while you might be using a 1TB netapp tray over NFS for the archive segments. I guess if the archive partition fills up, I would err on the side of dropping archive segments on the floor. That would mean a new full backup would be needed, but at least it wouldn't result in a corrupt, or shut down, database. Joe
Bruce Momjian <pgman@candle.pha.pa.us> writes: > For tablespaces on OS's that don't support it, I think we will have to > store the path name in the file and read it via the backend. Somehow we > should cache those lookups. My feeling is that we need not support tablespaces on OS's without symlinks. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > For tablespaces on OS's that don't support it, I think we will have to > > store the path name in the file and read it via the backend. Somehow we > > should cache those lookups. > > My feeling is that we need not support tablespaces on OS's without > symlinks. Agreed, but are we going to support non-tablespace installs? I wasn't sure that was an option. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Wed, 3 Mar 2004, Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > For tablespaces on OS's that don't support it, I think we will have to > > store the path name in the file and read it via the backend. Somehow we > > should cache those lookups. > > My feeling is that we need not support tablespaces on OS's without > symlinks. I'm going to focus on implementing this on the system(s) I'm used to developing on (ie, those which support symlinks). Once that is done, I'll talk with the Win32 guys about what, if anything, we can do about getting this to work on Win32 (and possibly other non-symlink supporting OSs). Thanks, Gavin
Joe Conway <mail@joeconway.com> writes: > Tom Lane wrote: >> Joe Conway <mail@joeconway.com> writes: >>> Maybe specify an archive location (that of course could be on a separate >>> partition) that the external archiver should check in addition to the >>> normal WAL location. At some predetermined interval, push WAL log >>> segments no longer needed to the archive location. >> >> Does that really help? The panic happens when you fill the "normal" and >> "archive" partitions, how's that different from one partition? > I see your point. But it would allow you to use a relatively modest > local partition for WAL segments, while you might be using a 1TB netapp > tray over NFS for the archive segments. Fair enough, but it seems to me that that sort of setup really falls in the category of a user-defined archiving process --- that is, the hook that Postgres calls will push WAL segments from the local partition to the NFS server, and then pushing them off NFS to tape is the responsibility of some other user-defined subprocess. Database panic happens if and only if the local partition overflows. I don't see that making Postgres explicitly aware of the secondary NFS arrangement will buy anything. > I guess if the archive partition fills up, I would err on the side of > dropping archive segments on the floor. That should be user-scriptable policy, in my worldview. We haven't yet talked much about what the WAL-segment-archiving API should look like, but if it cannot support implementing the above kind of arrangement outside the database, then we've dropped the ball. IMHO anyway. regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Tom Lane wrote: >> My feeling is that we need not support tablespaces on OS's without >> symlinks. > Agreed, but are we going to support non-tablespace installs? I wasn't > sure that was an option. A setup containing only the default tablespace cannot use any symlinks. That doesn't seem hard though. regards, tom lane
Bruce Momjian writes: > I just checked from the MinGW console and I see: > [snip] > It accepts ln -s, but does nothing with it. And even if it had worked, it wouldn't really matter, since we don't actually want to *run* the system under MinGW/msys, just build it. I think the idea of implementing in symlinks for non-compliant platforms in md.c has some merit. FWIW, looks like that is how cygwin implements symlinks... Cheers, Claudio --- Certain disclaimers and policies apply to all email sent from Memetrics. For the full text of these disclaimers and policies see <a href="http://www.memetrics.com/emailpolicy.html">http://www.memetrics.com/em ailpolicy.html</a>
On Wed, 2004-03-03 at 04:59, Tom Lane wrote: > What might make sense is some sort of marker file in a > tablespace directory that links back to the owning $PGDATA directory. > CREATE TABLESPACE should create this, or reject if it already exists. It will not be enough for the marker to list the path of the parent $PGDATA, since that path might get changed by system administration action. The marker should contain some sort of unique string which would match the same string somewhere in $PGDATA. Then, if either tablespace or $PGDATA were moved, it would be possible to tie the two back together. It wouldn't be an issue on most normal systems, but might be of crucial importance for an ISP running numerous separate clusters. -- Oliver Elphick <olly@lfix.co.uk> LFIX Ltd
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Tom Lane wrote: > >> My feeling is that we need not support tablespaces on OS's without > >> symlinks. > > > Agreed, but are we going to support non-tablespace installs? I wasn't > > sure that was an option. > > A setup containing only the default tablespace cannot use any symlinks. > That doesn't seem hard though. Yea, I think you are right. We just disable CREATE TABLESPACE and the rest should just work. Basically, pg_tablespace will only have one entry on those platforms. The initdb directory structure will have a single tablespace, but that doesn't use symlinks. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Claudio Natoli wrote: > > Bruce Momjian writes: > > I just checked from the MinGW console and I see: > > [snip] > > It accepts ln -s, but does nothing with it. > > And even if it had worked, it wouldn't really matter, since we don't > actually want to *run* the system under MinGW/msys, just build it. > > I think the idea of implementing in symlinks for non-compliant platforms in > md.c has some merit. FWIW, looks like that is how cygwin implements > symlinks... Why can't we use MS Win32 shortcut files to simulate symlinks? MinGW doesn't do it, so I suppose it isn't possible. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Oliver Elphick <olly@lfix.co.uk> writes: > It will not be enough for the marker to list the path of the parent > $PGDATA, since that path might get changed by system administration > action. The marker should contain some sort of unique string which > would match the same string somewhere in $PGDATA. We have already added a notion of a "unique installation identifier" for PITR purposes (look in pg_control). So we could use that for this purpose if we wanted to. But I'm not sure how important it really is. AFAICS the behavior of CREATE TABLESPACE will be "create marker file, if it already exists then abort". It has no need to actually look in the file and so there's no need for the contents to be unique. regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Greg Stark wrote: > > > > > > I am expecting to hear some bleating about this from people whose > > > > preferred platforms don't support symlinks ;-). However, if we don't > > > > Well, one option would be to have the low level filesystem storage (md.c?) > > routines implement a kind of symlink themselves. Just a file with a special > > magic number followed by a path. On further contemplation it doesn't seem like using symlinks really ought to be necessary. It should be possible to drive everything off the catalog tables while avoidin having the low level filesystem code know anything about them. Instead of having the low level code fetch the pg_* records themselves, some piece of higher level code would do the query and call down to storage layer to inform it of the locations for everything. It would have to do this on database initialization and on any subsequent object creation. Basically maintain an in-memory hash table of oid -> path, and call down to the low level code whenever that hash changes. (Or more likely oid->ts_id and a separate list of ts_id -> path.) -- greg
Oliver Elphick wrote: > On Wed, 2004-03-03 at 04:59, Tom Lane wrote: > >> What might make sense is some sort of marker file in a >>tablespace directory that links back to the owning $PGDATA directory. >>CREATE TABLESPACE should create this, or reject if it already exists. > > > It will not be enough for the marker to list the path of the parent > $PGDATA, since that path might get changed by system administration > action. The marker should contain some sort of unique string which > would match the same string somewhere in $PGDATA. Then, if either > tablespace or $PGDATA were moved, it would be possible to tie the two > back together. It wouldn't be an issue on most normal systems, but > might be of crucial importance for an ISP running numerous separate > clusters. Taking this one step further would be to do something like Oracle does. Every datafile in Oracle (because the Oracle storagemanager stores multiple objects inside datafiles, one could say there is some similarity between Oracle datafiles and the proposed pg tablespaces), has meta info that tells it which database instance it belongs to and the last checkpoint that occured (It might actually be more granular than checkpoint, such that on a clean shutdown you can tell that all datafiles are consistent with each other and form a consistent database instance). So Oracle on every checkpoint updates all datafiles with an identifier. Now you might ask why is this useful. Well in normal day to day operation it isn't, but it can be usefull in disaster recovery. If you loose a disk and need to restore the entire database from backups it can be difficult to make sure you have done it all correctly (do I have all the necessary files/directories? did I get the right ones from the right tapes?) Especially if you have directories spread across various different disks that might be backed up to different tapes. So by having additional information stored in each datafile Oracle can provide additional checks that the set of files that are being used when the database starts up are consistent and all belong together. Oracle also ensures that all the datafiles that are suposed to exist actually do as well. So what might this mean for postgres and tablespaces? It could mean that on startup the database checks to verify that all the tablespaces that are registered actually exist. And that the data in each tablespace is consistent with the current WAL status. (i.e. someone didn't restore a tablespace from backup while the database was down that is old and needs recovery. A lot of what I am talking about here become PITR issues. But since PITR and tablespaces are both potential features for 7.5, how they interact probably should be thought about in the designs for each. thanks, --Barry
Greg Stark wrote: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > Greg Stark wrote: > > > > > > > > I am expecting to hear some bleating about this from people whose > > > > > preferred platforms don't support symlinks ;-). However, if we don't > > > > > > Well, one option would be to have the low level filesystem storage (md.c?) > > > routines implement a kind of symlink themselves. Just a file with a special > > > magic number followed by a path. > > On further contemplation it doesn't seem like using symlinks really ought to > be necessary. It should be possible to drive everything off the catalog tables > while avoidin having the low level filesystem code know anything about them. > > Instead of having the low level code fetch the pg_* records themselves, some > piece of higher level code would do the query and call down to storage layer > to inform it of the locations for everything. It would have to do this on > database initialization and on any subsequent object creation. > > Basically maintain an in-memory hash table of oid -> path, and call down to > the low level code whenever that hash changes. (Or more likely oid->ts_id and > a separate list of ts_id -> path.) The advantage of symlinks is that an administrator could see how things are laid out from the command line. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
>Tom Lane [mailto:tgl@sss.pgh.pa.us] > Joe Conway <mail@joeconway.com> writes: > > Tom Lane wrote: > >> Joe Conway <mail@joeconway.com> writes: > >>> Maybe specify an archive location (that of course could be on a > separate > >>> partition) that the external archiver should check in addition to the > >>> normal WAL location. At some predetermined interval, push WAL log > >>> segments no longer needed to the archive location. > >> > >> Does that really help? The panic happens when you fill the "normal" > and > >> "archive" partitions, how's that different from one partition? > > > I see your point. But it would allow you to use a relatively modest > > local partition for WAL segments, while you might be using a 1TB netapp > > tray over NFS for the archive segments. > > Fair enough, but it seems to me that that sort of setup really falls in > the category of a user-defined archiving process --- that is, the hook > that Postgres calls will push WAL segments from the local partition to > the NFS server, and then pushing them off NFS to tape is the > responsibility of some other user-defined subprocess. Database panic > happens if and only if the local partition overflows. I don't see that > making Postgres explicitly aware of the secondary NFS arrangement will > buy anything. Tom's last sentence there summarises the design I was working with. I had considered Joe's suggested approach (which was Oracle's also). However, the PITR design will come with a usable low-function program which can easily copy logs from pg_xlog to another archive directory. That's needed as a test harness anyway, so it may as well be part of the package. You'd be able to use that in production to copy xlogs to another larger directory as a staging area to tape/failover on another system: effectively Joe's idea is catered for in the basic package. Anyway I'm answering questions before publishing the design as stands...though people do keep spurring me to refine it as I'm writing it down! That's why its good to document it I guess. > > I guess if the archive partition fills up, I would err on the side of > > dropping archive segments on the floor. > > That should be user-scriptable policy, in my worldview. Hmmm. Very difficult that one. My experience is in commercial systems. Dropping archive segments on the floor is just absolutely NOT GOOD, if that is the only behaviour. The whole purpose of having a dbms is so that you can protect your business data, while using it. Such behaviour would most likely be a barrier to wider commercial adoption. [Oracle and other dbms will freeze when this situation is hit, rather than continue and drop archive logs.] User-selectable behaviour? OK. That's how we deal with fsync; I can relate to that. That hadn't been part of my thinking because of the importance I'd attached to the log files themselves, but I can go with that, if that's what was meant. So, if we had a parameter called Wal_archive_policy that has 3 settings: None = no archiving Optimistic = archive, but if for some reason log space runs out then make space by dropping the oldest archive logs Strict = if log space runs out, stop further write transactions from committing, by whatever means, even if this takes down dbms. That way, we've got something akin to transaction isolation level with various levels of protection. Best Regards, Simon Riggs
Joe Conway <mail@joeconway.com> writes: > Simon Riggs wrote: >> O... and other dbms will freeze when this situation is hit, rather >> than continue and drop archive logs.] > Been there, done that, don't see how it's any better. I hesitate to be > real specific here, but let's just say the end result was restore from > backup :-( It's hard for me to imagine a situation in which killing the database would be considered a more attractive option than dropping old log data. You may or may not ever need the old log data, but you darn well do need a functioning database. (If you don't, you wouldn't be going to all this work.) I think also that Simon completely misunderstood my intent in saying that this could be "user-scriptable policy". By that I meant that the *user* could write the code to behave whichever way he liked. Not that we were going to go into a mad rush of feature invention and try to support every combination we could think of. I repeat: code that pushes logs into a secondary area is not ours to write. We should concentrate on providing an API that lets users write it. We have only limited manpower for this project and we need to spend it on getting the core functionality done right, not on inventing frammishes. regards, tom lane
Simon Riggs wrote: > User-selectable behaviour? OK. That's how we deal with fsync; I can > relate to that. That hadn't been part of my thinking because of the > importance I'd attached to the log files themselves, but I can go with > that, if that's what was meant. > > So, if we had a parameter called Wal_archive_policy that has 3 settings: > None = no archiving > Optimistic = archive, but if for some reason log space runs out then > make space by dropping the oldest archive logs > Strict = if log space runs out, stop further write transactions from > committing, by whatever means, even if this takes down dbms. > > That way, we've got something akin to transaction isolation level with > various levels of protection. Yep, we will definately need something like that. Basically whenever the logs are being archived, you have to stop the database if you can't archive, no? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Simon Riggs wrote: >> Tom Lane [mailto:tgl@sss.pgh.pa.us] That should be user-scriptable >> policy, in my worldview. > O... and other dbms will freeze when this situation is hit, rather > than continue and drop archive logs.] Been there, done that, don't see how it's any better. I hesitate to be real specific here, but let's just say the end result was restore from backup :-( > So, if we had a parameter called Wal_archive_policy that has 3 > settings: None = no archiving Optimistic = archive, but if for some > reason log space runs out then make space by dropping the oldest > archive logs Strict = if log space runs out, stop further write > transactions from committing, by whatever means, even if this takes > down dbms. That sounds good to me. For the "Optimistic" case, we need to yell loudly if we do find ourselves needing to drop segments. For the "Strict" case, we just need to be sure it works correctly ;-) Joe
Bruce Momjian wrote: >Greg Stark wrote: > > >>Bruce Momjian <pgman@candle.pha.pa.us> writes: >> >> >> >>>Greg Stark wrote: >>> >>> >>>>>>I am expecting to hear some bleating about this from people whose >>>>>>preferred platforms don't support symlinks ;-). However, if we don't >>>>>> >>>>>> >>>>Well, one option would be to have the low level filesystem storage (md.c?) >>>>routines implement a kind of symlink themselves. Just a file with a special >>>>magic number followed by a path. >>>> >>>> >>On further contemplation it doesn't seem like using symlinks really ought to >>be necessary. It should be possible to drive everything off the catalog tables >>while avoidin having the low level filesystem code know anything about them. >> >>Instead of having the low level code fetch the pg_* records themselves, some >>piece of higher level code would do the query and call down to storage layer >>to inform it of the locations for everything. It would have to do this on >>database initialization and on any subsequent object creation. >> >>Basically maintain an in-memory hash table of oid -> path, and call down to >>the low level code whenever that hash changes. (Or more likely oid->ts_id and >>a separate list of ts_id -> path.) >> >> > >The advantage of symlinks is that an administrator could see how things >are laid out from the command line. > > > That's a poor reason to require symlinks. The administrator can just as easily open up psql and query pg_tablespace to see that same information. Besides, the postgres doesn't know where to look on the filesystem for the /path/to/oid without a system catalog lookup. There doesn't seem to be any sensible reason to force a filesystem requirement when the core operations are diffferent to begin with. If a more global view of all databases is necessary, perhaps there ought to be a system wide view which could display all of that information at once: dbname, relation name, and physical location.
Thomas Swan wrote: > >The advantage of symlinks is that an administrator could see how things > >are laid out from the command line. > > > > > > > That's a poor reason to require symlinks. The administrator can just as > easily open up psql and query pg_tablespace to see that same > information. Besides, the postgres doesn't know where to look on the > filesystem for the /path/to/oid without a system catalog lookup. There > doesn't seem to be any sensible reason to force a filesystem requirement > when the core operations are diffferent to begin with. > > If a more global view of all databases is necessary, perhaps there ought > to be a system wide view which could display all of that information at > once: dbname, relation name, and physical location. Who doesn't have symlinks these days, and is going to be using tablespaces? Even Win32 has them. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Thomas Swan <tswan@idigx.com> writes: > Bruce Momjian wrote: >> The advantage of symlinks is that an administrator could see how things >> are laid out from the command line. >> > That's a poor reason to require symlinks. The administrator can just as > easily open up psql and query pg_tablespace to see that same > information. Something to keep in mind here is that one of the times you would most likely need that information is when the database is broken and you *can't* simply "open up psql" and inspect system catalogs. I like the fact that a symlink implementation can be inspected without depending on a working database. If we were going to build a non-symlink implementation, I'd want the highlevel-to-lowlevel data transfer to take the form of a flat ASCII file that could be inspected by hand, rather than some hidden in-memory datastructure. But given the previous discussion in this thread, I cannot see any strong reason not to rely on symlinks for the purpose. We are not in the business of building replacements for OS features. regards, tom lane
Tom Lane wrote: >Thomas Swan <tswan@idigx.com> writes: > > >>Bruce Momjian wrote: >> >> >>>The advantage of symlinks is that an administrator could see how things >>>are laid out from the command line. >>> >>> >>> >>That's a poor reason to require symlinks. The administrator can just as >>easily open up psql and query pg_tablespace to see that same >>information. >> >> > >Something to keep in mind here is that one of the times you would most >likely need that information is when the database is broken and you >*can't* simply "open up psql" and inspect system catalogs. I like the >fact that a symlink implementation can be inspected without depending on >a working database. > > > That's a sufficient argument, to allow for it. Recoverability would be one reason. >If we were going to build a non-symlink implementation, I'd want the >highlevel-to-lowlevel data transfer to take the form of a flat ASCII >file that could be inspected by hand, rather than some hidden in-memory >datastructure. But given the previous discussion in this thread, >I cannot see any strong reason not to rely on symlinks for the purpose. >We are not in the business of building replacements for OS features. > > > I do like the flat file output at least for a record of what went where. Regardless of whether or not symlinks are used, the admin would need to know what directories/files/filesystems are to be backed up. I am concerned as to what extent different filesystems do when you back the directories up. Would NTFS containing symlinks be able to be backed up with a tar/zip command, or is something more elaborate needed? In the past, before upgrading, I have had to tarthe pgdata directory with the postmaster shutdown to insure a quick restoration of the database in case an upgrade didn't proceed uneventfully. Also, in the event of a major version upgrade the restored information may or may not proceed uneventfully. I just wanted to point out something I thought might be an issue further down the road. Perhaps the system catalog / flat file approach would be a more solid approach, both of which would not involve replacing or duplicating OS features.
Tom Lane wrote: > I think also that Simon completely misunderstood my intent in saying > that this could be "user-scriptable policy". By that I meant that the > *user* could write the code to behave whichever way he liked. Not that > we were going to go into a mad rush of feature invention and try to > support every combination we could think of. I repeat: code that pushes > logs into a secondary area is not ours to write. We should concentrate > on providing an API that lets users write it. We have only limited > manpower for this project and we need to spend it on getting the core > functionality done right, not on inventing frammishes. Hmm... I totally agree. I think the backend could just offer a shared memory segment and a marker message to another process to allow copy from it. then it is the applications business to do the things. Of course there has to be a two way agreement about it but an API is a real nice thing rather than an application. Shridhar
On 3 March 2004, at 19:52, Bruce Momjian wrote: > The advantage of symlinks is that an administrator could see how things > are laid out from the command line. One thing to keep in mind is that system administrators don't see symlinks as being informational -- they see them as the actual UI for the redirection in question. So their expectation is that they'll be able to move the actual directory around at will (as long as they update the symlink to match). If symlinks are used, the rule of least surprise would mean that no information whatsoever about the physical location of a tablespace should be stored in the system catalogs. Otherwise their relationship with the information stored in the symlink is ambiguous. mk
On Mon, Mar 08, 2004 at 02:07:35AM +0200, Marko Karppinen wrote: > One thing to keep in mind is that system administrators don't see > symlinks as being informational -- they see them as the actual UI > for the redirection in question. So their expectation is that they'll > be able to move the actual directory around at will (as long as they > update the symlink to match). This is a good point. It's worth keeping in mind, too, that in large shops, the DBAs and the sysadmins often are in separate departments with separate management, precisely because the database system has traditionally been somewhat divorced from the OS (as an aside, I suspect that this sort of separation is part of the reason for the popularity of raw filesystems among DBAs. Even if they didn't provide better speed, it's just preferable not to have to involve another department). System administrators in such places have been known to decide to "reorganise the disks", assuming that the database just has its own home. For such a sysadmin, a pile of symlinks would be fair game for reorganisation. A -- Andrew Sullivan | ajs@crankycanuck.ca
Andrew Sullivan wrote: >eorganise the disks", assuming that the database >just has its own home. For such a sysadmin, a pile of symlinks would >be fair game for reorganisation. > > Please take into consideration that symlinks might be every day work for *nix admins, but for win admins it's very uncommon. Additionally, win admins are accustomed to gui tools, and many of them will stumble if forced to use a command line. For worse, junctions are not distinguishable in explorer; only the shell's dir command knows about junctions. This obfuscation makes junctions quite a bad choice for admin purposes. IMHO there are only two viable options: - no tablespaces for win32, i.e. recommend *ix for bigger installations - a different tablespace storage approach., e.g. simple desktop links (or alike) redirecting to a directory. Regards, Andreas
Andrew Sullivan wrote: > On Mon, Mar 08, 2004 at 02:07:35AM +0200, Marko Karppinen wrote: > > One thing to keep in mind is that system administrators don't see > > symlinks as being informational -- they see them as the actual UI > > for the redirection in question. So their expectation is that they'll > > be able to move the actual directory around at will (as long as they > > update the symlink to match). > > This is a good point. It's worth keeping in mind, too, that in large > shops, the DBAs and the sysadmins often are in separate departments > with separate management, precisely because the database system has > traditionally been somewhat divorced from the OS (as an aside, I > suspect that this sort of separation is part of the reason for the > popularity of raw filesystems among DBAs. Even if they didn't > provide better speed, it's just preferable not to have to involve > another department). System administrators in such places have been > known to decide to "reorganise the disks", assuming that the database > just has its own home. For such a sysadmin, a pile of symlinks would > be fair game for reorganisation. Agreed. I think the idea is to use lstat to query the symlink, rather than storing that information in the database. My idea was to create an lstat server-side function that could be used by pg_dump and friends. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
>Joe Conway [mailto:mail@joeconway.com] > Simon Riggs wrote: > >> Tom Lane [mailto:tgl@sss.pgh.pa.us] That should be user-scriptable > >> policy, in my worldview. > > > O... and other dbms will freeze when this situation is hit, rather > > than continue and drop archive logs.] > > Been there, done that, don't see how it's any better. I hesitate to be > real specific here, but let's just say the end result was restore from > backup :-( > > > So, if we had a parameter called Wal_archive_policy that has 3 > > settings: None = no archiving Optimistic = archive, but if for some > > reason log space runs out then make space by dropping the oldest > > archive logs Strict = if log space runs out, stop further write > > transactions from committing, by whatever means, even if this takes > > down dbms. > > That sounds good to me. For the "Optimistic" case, we need to yell > loudly if we do find ourselves needing to drop segments. For the > "Strict" case, we just need to be sure it works correctly ;-) Good. Yell loudly really needs to happen sometime earlier, which is as Gavin originally thought something to do with tablespaces. Strict behaviour is fairly straightforward, you just PANIC! I'd think we could rename these to Fail Operational rather than Optimistic Fail Safe rather than Strict ...the other names were a bit like "I'm right" and "but I'll do yours too" ;} Best Regards, Simon Riggs
>Bruce Momjian > Simon Riggs wrote: > > User-selectable behaviour? OK. That's how we deal with fsync; I can > > relate to that. That hadn't been part of my thinking because of the > > importance I'd attached to the log files themselves, but I can go with > > that, if that's what was meant. > > > > So, if we had a parameter called Wal_archive_policy that has 3 settings: > > None = no archiving > > Optimistic = archive, but if for some reason log space runs out then > > make space by dropping the oldest archive logs > > Strict = if log space runs out, stop further write transactions from > > committing, by whatever means, even if this takes down dbms. > > > > That way, we've got something akin to transaction isolation level with > > various levels of protection. > > Yep, we will definately need something like that. Basically whenever > the logs are being archived, you have to stop the database if you can't > archive, no? That certainly was my initial feeling, though I believe it is possible to accommodate both viewpoints. I would not want to have only the alternative viewpoint, I must confess. Best Regards, Simon Riggs
Please excuse the delay in replying.. >Tom Lane > Joe Conway <mail@joeconway.com> writes: > > Simon Riggs wrote: > >> O... and other dbms will freeze when this situation is hit, rather > >> than continue and drop archive logs.] > > > Been there, done that, don't see how it's any better. I hesitate to be > > real specific here, but let's just say the end result was restore from > > backup :-( Myself also. I accept your experience and insight, I apologise if my own seemed overblown. My take on that is that if you're in a situation that has a high probability of going bad, the last thing you would want is to drop xlogs. Same technical experience, different viewpoint on what to learn from it. > It's hard for me to imagine a situation in which killing the database > would be considered a more attractive option than dropping old log > data. You may or may not ever need the old log data, but you darn well > do need a functioning database. (If you don't, you wouldn't be going to > all this work.) The main point here for me is that the choice of keeping archived (not old) log files against keeping the database up isn't actually mine to make; that choice belongs to the owner of the database, not me as developer or administrator, consultant or whatever. Although I admit I did not at first comprehend that such a view was possible, I did flex to allow yours and Joe's perspective when that was voiced. The point is one of risk: does the owner wish to risk the possibility that a transaction may be lost in order to keep the database up? The possibility of lost rows must be balanced against the probably higher possibility of being unable to write new data. But which is worse? Who can say? In some environments where I have worked, (again forgive any seeming personal arrogance or posturing), such as banks or finance generally, it has been desirable to stop the system rather than risk losing even a single row. In other situations, lost rows must be balanced against the money lost through downtime. Guess it depends whether you've got a contract for uptime or for data integrity?? ;) > I repeat: code that pushes > logs into a secondary area is not ours to write. We should concentrate > on providing an API that lets users write it. Agreed. > We have only limited > manpower for this project and we need to spend it on getting the core > functionality done right, not on inventing frammishes. Love that word "frammish"...seriously, I understand and agree. My understanding is that existing logic will cause a PANIC if the xlog directory cannot be written to. Helping the database stay up by dropping logs would require extra code... This was an edge case anyhow... Best Regards, Simon Riggs
Simon Riggs wrote: > >Bruce Momjian > > Simon Riggs wrote: > > > User-selectable behaviour? OK. That's how we deal with fsync; I can > > > relate to that. That hadn't been part of my thinking because of the > > > importance I'd attached to the log files themselves, but I can go > with > > > that, if that's what was meant. > > > > > > So, if we had a parameter called Wal_archive_policy that has 3 > settings: > > > None = no archiving > > > Optimistic = archive, but if for some reason log space runs out then > > > make space by dropping the oldest archive logs > > > Strict = if log space runs out, stop further write transactions from > > > committing, by whatever means, even if this takes down dbms. > > > > > > That way, we've got something akin to transaction isolation level > with > > > various levels of protection. > > > > Yep, we will definately need something like that. Basically whenever > > the logs are being archived, you have to stop the database if you > can't > > archive, no? > > That certainly was my initial feeling, though I believe it is possible > to accommodate both viewpoints. I would not want to have only the > alternative viewpoint, I must confess. > Added to PITR TODO list. Anything else to add: http://momjian.postgresql.org/main/writings/pgsql/project -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
"Simon Riggs" <simon@2ndquadrant.com> writes: > Strict behaviour is fairly straightforward, you just PANIC! There is another mode possible as well. Oracle for example neither panics nor continues, it just freezes. It keeps retrying the transaction until it finds it has space. The sysadmin or dba just has to somehow create additional space by removing old files or however and the database will continue where it left off. That seems a bit nicer than panicing. When I first heard that I was shocked. It means implementing archive logs *created* a new failure mode where there was none before. I thought that was the dumbest idea in the world: who needed a backup process that increased the chances of an outage? Now I can see the logic, but I'm still not sure which mode I would pick if it was up to me. As others have said, I guess it would depend on the situation. -- greg
Dennis Bjorklund <db@zigo.dhs.org> writes: > On Thu, 26 Feb 2004, Gavin Sherry wrote: > > > Comments? Questions? Suggestions? > > Is that plan that in the future one can split a single table into > different table spaces? Like storing all rows with year < 1999 in one > tablespace and the rest in another? That's a separate orthogonal feature called "partitioned tables". There's some amount of resistance to the idea amongst postgres people, and there's nobody who has spoken up interested in implementing it, but there's also lots of interest from users. A good patch would probably go a long way to convincing people :) Table spaces are being able to store different tables in different physical locations on disk. A first version of this has actually been implemented for 7.5 using symlinks. Partitioned tables and tablespaces do indeed have a certain amount of synergy. But even in a single tablespace your example makes sense. > With the rule system and two underlying tables one could make it work by > hand I think. The rule system could be used to do this, but there was some discussion of using inherited tables to handle it. However neither handles the really hard part of detecting queries that use only a part of the table and taking that into account in generating the plan. -- greg
> > With the rule system and two underlying tables one could make it work by > > hand I think. > > The rule system could be used to do this, but there was some discussion of > using inherited tables to handle it. However neither handles the really hard > part of detecting queries that use only a part of the table and taking that > into account in generating the plan. I think the consensus should be to add smarts to the planner to include static constraint information to reduce table access. e.g if you have a constraint "acol integer, check acol < 5" and you have a query with a "where acol = 10" you could reduce that to "where false". This would help in all sorts of situations not only partitioned/inherited tables. I am not sure what the runtime cost of such an inclusion would be, so maybe it needs smarts to only try in certain cases ? Andreas
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes: > e.g if you have a constraint "acol integer, check acol < 5" > and you have a query with a "where acol = 10" you could reduce that > to "where false". I think part of the question is how much work do you put into checking this. Checking constant known values like above is probably not too expensive. Checking for ranges like "where acol between 5 and 10" is probably doable. And that might be enough for partitioned tables. I think that's about all Oracle bothers to check, for example. More complex where clauses and check expressions might be hard to prove are true or false. But then the work's still not done, you still have to add an optimization that prunes members of a UNION ALL (or equivalent if it's done using inherited tables or some other infrastructure) if they are known to provably produce zero rows. And then there are more subtle cases. Like if the query is "where acol = ?". Then you know it only has to read one partition, but you don't know which one at compile time. And it's important to handle that case because that might be the only clause. So knowing that you only need one partition might be the difference between a sequential scan of one partition, or an index scan of many thousands of records because they're only a small percentage of the entire table. -- greg
> I don't think we want features for their own sake, though, and I'm > not convinced that raw filesystems are actually useful. Course, it's > not my itch, and PostgreSQL _is_ free software. > I agree that raw file systems are seldom useful with one caveat, more advanced file systems are sometimes detrimental to database access. Conceptually, a file system and a database are redundant, both are doing their best to preserve data integrity. This is especially true with journalling file systems. Not to mention technologies like reiserfs which attempts to do sub-block allocation. What I think would go a long way to improving database performance on non-raw partitions would be a simplified file system -- SFS anyone? The simplified file system would not track access time. It would not overly try to manage disk space. The target applications are going to allocate disk space on a block level, rather than quibble about 4K here or 8K here, have a user defined standard allocation unit of 64K, 128K, or so on. Reduction on allocation overhead also reduces meta-data updating I/O. I can almost imagine 32BIT FAT with large clusers, only with real inodes. The idea would be that a database, like PostgreSQL, would be managing the data not the file system. The file systems job would only to be the most minimalist interface to the OS. The benefts would be awesome, near-raw partition access and standard OS tools for maintainence.