Thread: Tablespaces

Tablespaces

From

Gavin Sherry

Date:

26 February 2004, 06:08:12

Hi all,

I've been looking at implementing table spaces for 7.5. Some notes and
implementation details follow.

------

Type of table space:

There are many different table space implementations in relational
database management systems. In my implementation, a table space in
PostgreSQL will be the location of a directory on the file system in
which files backing database objects can be stored. Global tables and
non 'data' objects (WAL, CLOG, config files) will all remain in $PGDATA.
$PGDATA/base will be the default table space.

A given table space will be identified by a unique table space name. I
haven't decided if 'unique' should mean database-wide unique or
cross-database unique. It seems to me that we might run into problems
with CREATE DATABASE ... TEMPLATE = <database with table spaces> if the
uniqueness of table spaces is limited to the database level.

A table space parameter will be added to DDL commands which create
physical database objects (CREATE DATABASE/INDEX/TABLE/SEQUENCE) and to
CREATE SCHEMA. The associated routines, as well as the corresponding DROP
commands will need to be updated. Adding the ability to ALTER <object>
TABLESPACE <name> seems a little painful. Would people use it? Comments?

When an object is created the system will resolve the table space the
object is stored in as follows: if the table space paramater is passed to
the DDL command, then the object is stored in that table space (given
validation of the table space, etc). If it is not passed, the object
inherits its "parent's" table space where the parent/child hierarchy is as
follows: database > schema > table > [index|sequence]. So, if you issued:
create table foo.bar (...);

We would first not that there is no TABLESPACE <name>, then cascade to
the table space for the schema 'foo' (and possibly cascade to the table
space for the database). A database which wasn't created with an explicit
table space will be created under the default table space. This ensures
backward compatibility.


Creating a table space:

A table space is a directory structure. The directory structure is as
follows:

[swm@dev /path/to/tblspc]$ ls
OID1/    OID2/

OID1 and OID2 are the OIDs of databases which have created a table space
against this file system location. In this respect, a table space
resembles $PGDATA/base. I thought it useful to keep this kind of
namespace mechanism in place so that administrators do not need to create
hierarchies of names on different partitions if they want multiple
databases to use the same partition.

The actual creation of the table space will be done with:
CREATE TABLE SPACE <name> LOCATION </path/to/tblspc>;

Before creating the table space we must:

1) Check if the directory exists. If it does, create a sub directory as
the OID of the current database.

2) Alternatively, if the directory doesn't exist, attempt to create it,
then the sub directory.

I wonder if a file, such as PG_TBLSPC, should be added to the table space
directory so that, in the case of an existing non-empty directory, we can
attempt to test if the directory is being used for something else and
error out. Seems like:

CREATE TABLESPACE tbl1 LOCATION '/var/'

which will result in something like '/var/123443' is a bad idea. Then
again, the user should know better. Comments?

If everything goes well, we add an entry to pg_tablespace with the table
space location and name (and and OID).


Tying it all together:

The catalogs pg_database, pg_namespace, and pg_class will have a 'tblspc'
field. This will be the OID of the table space the object resides in, or 0
(default table space). Since we can then resolve relid/relname, schema and
database to a tablespace, there aren't too many cases when extra logic
needs to be added to the IO framework. In fact, most of it is taken care
of because of the abstraction of relpath().

The creation of table spaces will need to be recorded in xlog in the same
way that files are in heap_create() with the corresponding delete logic
incase of ABORT.


Postmaster startup:

Ideally, the postmaster at startup should go into each tblspc/databaseoid
directory and check for a postmaster.pid file to see if some other
instance is touching the files we're interested in. This will require a
control file listing tblspc/databaseoid paths and it will need to plug
into WAL in case we die during CREATE TABLESPACE. Comments?


Creating a database

I think that createdb() is going to have to be reworked if pg_tablespace
isn't shared (ie, tablespaces are only database unique). The reason being
that if we create a database which has a table space, pg_tablespace in the
new database will have to be updated and that cannot be done atomically
with the `cp` based mechanism we currently use.

I think I'm going to have to get my hands dirty before I can tell the
extent to which createdb() will need reworking.


pg_dump

Obviously pg_dump will need to be able to dump table spaces. pg_dump
running against <7.5 will DDL commands without a table space parameter and
as such the database's physical layout, when loaded into 7.5, will be the
same as for <7.5.

---

Comments? Questions? Suggestions?

Thanks,

Gavin

Re: Tablespaces

From

Shridhar Daithankar

Date:

26 February 2004, 06:39:35

On Thursday 26 February 2004 15:37, Gavin Sherry wrote:
> Tying it all together:
>
> The catalogs pg_database, pg_namespace, and pg_class will have a 'tblspc'
> field. This will be the OID of the table space the object resides in, or 0
> (default table space). Since we can then resolve relid/relname, schema and
> database to a tablespace, there aren't too many cases when extra logic
> needs to be added to the IO framework. In fact, most of it is taken care
> of because of the abstraction of relpath().
>
> The creation of table spaces will need to be recorded in xlog in the same
> way that files are in heap_create() with the corresponding delete logic
> incase of ABORT.

Is tablespace some sort of copyrighted? Last I remembered, the discussion was 
about location/storage
Just a thought..
Shridhar

Re: Tablespaces

From

Hans-Jürgen Schönig

Date:

26 February 2004, 07:24:07

Gavin Sherry wrote:
> Hi all,
> 
> I've been looking at implementing table spaces for 7.5. Some notes and
> implementation details follow.
> 
> ------
> 
> Type of table space:
> 
> There are many different table space implementations in relational
> database management systems. In my implementation, a table space in
> PostgreSQL will be the location of a directory on the file system in
> which files backing database objects can be stored. Global tables and
> non 'data' objects (WAL, CLOG, config files) will all remain in $PGDATA.
> $PGDATA/base will be the default table space.

Is it possible to put WALs and CLOGs into different tablespaces? (maybe 
different RAID systems). Some companies want that ...


> A given table space will be identified by a unique table space name. I
> haven't decided if 'unique' should mean database-wide unique or
> cross-database unique. It seems to me that we might run into problems
> with CREATE DATABASE ... TEMPLATE = <database with table spaces> if the
> uniqueness of table spaces is limited to the database level.


I strongly vote for database cluster wide unique names because somebody 
could have a tablespace "webusers" or something like that. To me this 
makes far more sense.


> A table space parameter will be added to DDL commands which create
> physical database objects (CREATE DATABASE/INDEX/TABLE/SEQUENCE) and to
> CREATE SCHEMA. The associated routines, as well as the corresponding DROP
> commands will need to be updated. Adding the ability to ALTER <object>
> TABLESPACE <name> seems a little painful. Would people use it? Comments?

I think people won't need it in first place because this seems to be 
really painful.
What really matters is that the number of tablespaces and file / 
tablespace is unlimited. SAP DB has limited the number of devspaces to 
32 (I think). This is real bull.... because if your database grows 
unexpectedly you are in deep trouble (expert database design by SAP, 
MySQL and 100000....0000 others).


> When an object is created the system will resolve the table space the
> object is stored in as follows: if the table space paramater is passed to
> the DDL command, then the object is stored in that table space (given
> validation of the table space, etc). If it is not passed, the object
> inherits its "parent's" table space where the parent/child hierarchy is as
> follows: database > schema > table > [index|sequence]. So, if you issued:
> 
>     create table foo.bar (...);
> 
> We would first not that there is no TABLESPACE <name>, then cascade to
> the table space for the schema 'foo' (and possibly cascade to the table
> space for the database). A database which wasn't created with an explicit
> table space will be created under the default table space. This ensures
> backward compatibility.

Will users automatically be assigned to a certain table space? How is 
this going to work?

> Creating a table space:
> 
> A table space is a directory structure. The directory structure is as
> follows:
> 
> [swm@dev /path/to/tblspc]$ ls
> OID1/    OID2/
> 
> OID1 and OID2 are the OIDs of databases which have created a table space
> against this file system location. In this respect, a table space
> resembles $PGDATA/base. I thought it useful to keep this kind of
> namespace mechanism in place so that administrators do not need to create
> hierarchies of names on different partitions if they want multiple
> databases to use the same partition.
> 
> The actual creation of the table space will be done with:
> 
>     CREATE TABLE SPACE <name> LOCATION </path/to/tblspc>;
> 
> Before creating the table space we must:
> 
> 1) Check if the directory exists. If it does, create a sub directory as
> the OID of the current database.
> 
> 2) Alternatively, if the directory doesn't exist, attempt to create it,
> then the sub directory.
> 
> I wonder if a file, such as PG_TBLSPC, should be added to the table space
> directory so that, in the case of an existing non-empty directory, we can
> attempt to test if the directory is being used for something else and
> error out. Seems like:
> 
> CREATE TABLESPACE tbl1 LOCATION '/var/'
> 
> which will result in something like '/var/123443' is a bad idea. Then
> again, the user should know better. Comments?
> 
> If everything goes well, we add an entry to pg_tablespace with the table
> space location and name (and and OID).
> 
> 
> Tying it all together:
> 
> The catalogs pg_database, pg_namespace, and pg_class will have a 'tblspc'
> field. This will be the OID of the table space the object resides in, or 0
> (default table space). Since we can then resolve relid/relname, schema and
> database to a tablespace, there aren't too many cases when extra logic
> needs to be added to the IO framework. In fact, most of it is taken care
> of because of the abstraction of relpath().
> 
> The creation of table spaces will need to be recorded in xlog in the same
> way that files are in heap_create() with the corresponding delete logic
> incase of ABORT.
> 
> 
> Postmaster startup:
> 
> Ideally, the postmaster at startup should go into each tblspc/databaseoid
> directory and check for a postmaster.pid file to see if some other
> instance is touching the files we're interested in. This will require a
> control file listing tblspc/databaseoid paths and it will need to plug
> into WAL in case we die during CREATE TABLESPACE. Comments?
> 
> 
> Creating a database
> 
> I think that createdb() is going to have to be reworked if pg_tablespace
> isn't shared (ie, tablespaces are only database unique). The reason being
> that if we create a database which has a table space, pg_tablespace in the
> new database will have to be updated and that cannot be done atomically
> with the `cp` based mechanism we currently use.
> 
> I think I'm going to have to get my hands dirty before I can tell the
> extent to which createdb() will need reworking.
> 
> 
> pg_dump
> 
> Obviously pg_dump will need to be able to dump table spaces. pg_dump
> running against <7.5 will DDL commands without a table space parameter and
> as such the database's physical layout, when loaded into 7.5, will be the
> same as for <7.5.
> 
> ---
> 
> Comments? Questions? Suggestions?
> 
> Thanks,
> 
> Gavin


Do you plan support for limiting the size of a tablespace? ISPs will 
vote for that because they can limit the size of a database on the 
database level rather than on the operating system level.  Of course 
this can and (should???) be done on the operation system level but 
people will definitely ask for that.
If sizing is not supported we should definitely provide minor 
documentation which tells people how to do that on the operating system 
level (at least poting to some useful information).
Best regards,
    Hans


-- 
Cybertec Geschwinde u Schoenig
Schoengrabern 134, A-2020 Hollabrunn, Austria
Tel: +43/2952/30706 or +43/664/233 90 75
www.cybertec.at, www.postgresql.at, kernel.cybertec.at

Re: Tablespaces

From

Richard Huxton

Date:

26 February 2004, 07:25:35

On Thursday 26 February 2004 10:07, Gavin Sherry wrote:
>
> CREATE TABLESPACE tbl1 LOCATION '/var/'
>
> which will result in something like '/var/123443' is a bad idea. Then
> again, the user should know better. Comments?

The LOCATION should have the same owner and permissions as $PGDATA - that 
should catch mistyping.

Unless you're running as root, of course. In which case you clearly know 
better than everyone else, so off you go!

> Comments? Questions? Suggestions?

Presumably I'm using this to deal with performance/space issues, so there 
clearly needs to be something in the pg_stat_xxx system to show figures based 
on tablespace - not sure what you'd measure though - disk I/O, number of 
nodes?

Perhaps something in contrib/ too - tablespace_disk_usage.pl or some such.
--  Richard Huxton Archonet Ltd

Re: Tablespaces

From

Gavin Sherry

Date:

26 February 2004, 07:58:31

> Is it possible to put WALs and CLOGs into different tablespaces? (maybe
> different RAID systems). Some companies want that ...

I wasn't going to look at that just yet.

There is of course the temporary hack of symlinking WAL else where.

I'd be interested to see the performance difference between WAL and data
on the same RAID/controller and WAL and data on different RAID/controller
with Jan's improvements to the buffer management.

Gavin

Re: Tablespaces

From

Dennis Bjorklund

Date:

26 February 2004, 08:12:44

On Thu, 26 Feb 2004, Gavin Sherry wrote:

> Comments? Questions? Suggestions?

Is that plan that in the future one can split a single table into 
different table spaces? Like storing all rows with year < 1999 in one 
tablespace and the rest in another?

With the rule system and two underlying tables one could make it work by 
hand I think.

I've never used tablespaces in oracle so I don't know what it can offer. I 
though it could do things like the above. True? What is the syntax and 
for example, how does it effect indexes (not at all maybe).

If you don't want to discuss this now, I understand. It's not part of the
design as it is now. I'm just curious at what direction we are moving and
what is possible to do.

-- 
/Dennis Björklund

Re: Tablespaces

From

Gavin Sherry

Date:

26 February 2004, 08:22:33

On Thu, 26 Feb 2004, Dennis Bjorklund wrote:

> On Thu, 26 Feb 2004, Gavin Sherry wrote:
>
> > Comments? Questions? Suggestions?
>
> Is that plan that in the future one can split a single table into
> different table spaces? Like storing all rows with year < 1999 in one
> tablespace and the rest in another?

These are called partitions in oracle. You can approximate this with table
spaces by using a partial index and putting it in a different table space.
The problem, of course, is seq scans.

>
> With the rule system and two underlying tables one could make it work by
> hand I think.
>
> I've never used tablespaces in oracle so I don't know what it can offer. I

Certainly, table spaces are used in many ways in oracle, db2, etc. You can
mirror data across them, have different buffer sizes for example.
In some implementations, they can be raw disk partitions (no file system).
I don't intend going this far, however.

> If you don't want to discuss this now, I understand. It's not part of the
> design as it is now. I'm just curious at what direction we are moving and
> what is possible to do.

Well, partitions are something else entirely. Mirroring would be
interesting, but RAID designers are better at parallelisation of IO than
(some) database developers. Might be better to keep the problem seperate.

Gavin

Re: Tablespaces

From

"Joshua D. Drake"

Date:

26 February 2004, 08:25:34

> Is it possible to put WALs and CLOGs into different tablespaces? (maybe 
> different RAID systems). Some companies want that ...

You can do this now, but it would be nice to be able to have it actually 
configurable versus the hacked linked method.


J



> 

-- 
Co-Founder
Command Prompt, Inc.
The wheel's spinning but the hamster's dead

Re: Tablespaces

From

Hans-Jürgen Schönig

Date:

26 February 2004, 08:32:40

Gavin Sherry wrote:
>>Is it possible to put WALs and CLOGs into different tablespaces? (maybe
>>different RAID systems). Some companies want that ...
> 
> 
> I wasn't going to look at that just yet.
> 
> There is of course the temporary hack of symlinking WAL else where.

that's what we do now.
we symlink databases and wals ...

> I'd be interested to see the performance difference between WAL and data
> on the same RAID/controller and WAL and data on different RAID/controller
> with Jan's improvements to the buffer management.
> 
> Gavin

yes, that's what i am looking for. i should do some testing.

in case of enough i/o power additional cpus scale almost linearily 
(depending on the application of course; i have done some testing on a 
customer's aix box ...).
it would be interesting to see what jan's buffer strategy does (and bg 
writer) ...

-- 
Cybertec Geschwinde u Schoenig
Schoengrabern 134, A-2020 Hollabrunn, Austria
Tel: +43/2952/30706 or +43/664/233 90 75
www.cybertec.at, www.postgresql.at, kernel.cybertec.at

Re: Tablespaces

From

"Alex J. Avriette"

Date:

26 February 2004, 12:10:24

On Thu, Feb 26, 2004 at 11:22:28PM +1100, Gavin Sherry wrote:

> Certainly, table spaces are used in many ways in oracle, db2, etc. You can
> mirror data across them, have different buffer sizes for example.
> In some implementations, they can be raw disk partitions (no file system).
> I don't intend going this far, however.

Perhaps now would be a good time to bring up my directio on Solaris question
from a year or so back? Is there any interest in the ability to use raw
disk?

Alex (who is overjoyed to hear discussion of tablespaces again)

--
alex@posixnap.net
Alex J. Avriette, Unix Systems Gladiator
Sep 25 12:52:39 buggle /bsd: wsdisplay0 at vga1: removing /dev/radio/*

Re: Tablespaces

From

Tom Lane

Date:

26 February 2004, 12:46:25

Gavin Sherry <swm@linuxworld.com.au> writes:
> A table space is a directory structure. The directory structure is as
> follows:
> [swm@dev /path/to/tblspc]$ ls
> OID1/    OID2/
> OID1 and OID2 are the OIDs of databases which have created a table space
> against this file system location. In this respect, a table space
> resembles $PGDATA/base. I thought it useful to keep this kind of
> namespace mechanism in place ...

Actually, this is *necessary* AFAICT.  The case that forces it is DROP
DATABASE.  Since you have to execute that from another database, there's
no reasonable way to look into the target database's catalogs.  That
means that the OID of the database has to be sufficient information to
get rid of all its files.  You can do this fairly easily if in each
tablespace (whose locations you know from the shared pg_tablespace
table) you can look for a subdirectory matching the target database's
OID.  If we tried to put the database's files just "loose" in each
tablespace directory then we'd be in trouble.

I think this is also an implementation reason for favoring cluster-wide
tablespaces over database-local ones.  I'm not sure how you drop a
database from outside if you can't see where its tablespaces are.

I believe that it will be necessary to expand RelFileNode to three OIDs
(tablespace, database, relation).  I had once hoped that it could be
kept at two (tablespace, relation) but with a physical layout like this
you more or less have to have three.

One issue that needs to be agreed on early is how the low-level file
access code finds a tablespace.  What I would personally like is for
$PGDATA to contain symlinks to the tablespace top directories.  The
actual access path for any relation could then be built trivially from
its RelFileNode:$PGDATA/tablespaces/TBOID/DBOID/RELFILENODE       -------------------------
The underlined part references a symlink that leads to the directory
containing the per-database subdirectories.

I am expecting to hear some bleating about this from people whose
preferred platforms don't support symlinks ;-).  However, if we don't
rely on symlinks for this then the tablespace-OID-to-physical-path
mapping has to be explicitly known at very low levels of the system
(md.c in particular).  We can't expect md.c to get that information by
reading pg_tablespace.  It would have to rely on some backdoor path,
such as a flat text file it could read at backend startup.  I think
this approach will leave us fighting a lot of problems with locking
and out-of-date information.

Speaking of locking, can we do anything to prevent people from shooting
themselves in the foot by changing active tablespaces?  Are we even
going to have a DROP TABLESPACE command, and if so what would it do?
        regards, tom lane

Re: Tablespaces

From

Barry Lind

Date:

26 February 2004, 17:09:08

Gavin,

After creating a tablespace what (if any) changes can be done to it. 
Can you DROP a tablespace, or once created will it always exist?  Can 
you RENAME a tablespace?  Can you change the location of a tablespace 
(i.e you did a disk reorg and move the contents to a different location 
and now want to point to the new location)?  What are the permissions 
necessary to create a tablespace (can any use connected to the database 
create a tablespace, or only superuser, or ...)?

Overall this will be a great addition to postgres.  I am looking forward 
to this feature.

thanks,
--Barry


Gavin Sherry wrote:
> Hi all,
> 
> I've been looking at implementing table spaces for 7.5. Some notes and
> implementation details follow.
> 
> ------
> 
> Type of table space:
> 
> There are many different table space implementations in relational
> database management systems. In my implementation, a table space in
> PostgreSQL will be the location of a directory on the file system in
> which files backing database objects can be stored. Global tables and
> non 'data' objects (WAL, CLOG, config files) will all remain in $PGDATA.
> $PGDATA/base will be the default table space.
> 
> A given table space will be identified by a unique table space name. I
> haven't decided if 'unique' should mean database-wide unique or
> cross-database unique. It seems to me that we might run into problems
> with CREATE DATABASE ... TEMPLATE = <database with table spaces> if the
> uniqueness of table spaces is limited to the database level.
> 
> A table space parameter will be added to DDL commands which create
> physical database objects (CREATE DATABASE/INDEX/TABLE/SEQUENCE) and to
> CREATE SCHEMA. The associated routines, as well as the corresponding DROP
> commands will need to be updated. Adding the ability to ALTER <object>
> TABLESPACE <name> seems a little painful. Would people use it? Comments?
> 
> When an object is created the system will resolve the table space the
> object is stored in as follows: if the table space paramater is passed to
> the DDL command, then the object is stored in that table space (given
> validation of the table space, etc). If it is not passed, the object
> inherits its "parent's" table space where the parent/child hierarchy is as
> follows: database > schema > table > [index|sequence]. So, if you issued:
> 
>     create table foo.bar (...);
> 
> We would first not that there is no TABLESPACE <name>, then cascade to
> the table space for the schema 'foo' (and possibly cascade to the table
> space for the database). A database which wasn't created with an explicit
> table space will be created under the default table space. This ensures
> backward compatibility.
> 
> 
> Creating a table space:
> 
> A table space is a directory structure. The directory structure is as
> follows:
> 
> [swm@dev /path/to/tblspc]$ ls
> OID1/    OID2/
> 
> OID1 and OID2 are the OIDs of databases which have created a table space
> against this file system location. In this respect, a table space
> resembles $PGDATA/base. I thought it useful to keep this kind of
> namespace mechanism in place so that administrators do not need to create
> hierarchies of names on different partitions if they want multiple
> databases to use the same partition.
> 
> The actual creation of the table space will be done with:
> 
>     CREATE TABLE SPACE <name> LOCATION </path/to/tblspc>;
> 
> Before creating the table space we must:
> 
> 1) Check if the directory exists. If it does, create a sub directory as
> the OID of the current database.
> 
> 2) Alternatively, if the directory doesn't exist, attempt to create it,
> then the sub directory.
> 
> I wonder if a file, such as PG_TBLSPC, should be added to the table space
> directory so that, in the case of an existing non-empty directory, we can
> attempt to test if the directory is being used for something else and
> error out. Seems like:
> 
> CREATE TABLESPACE tbl1 LOCATION '/var/'
> 
> which will result in something like '/var/123443' is a bad idea. Then
> again, the user should know better. Comments?
> 
> If everything goes well, we add an entry to pg_tablespace with the table
> space location and name (and and OID).
> 
> 
> Tying it all together:
> 
> The catalogs pg_database, pg_namespace, and pg_class will have a 'tblspc'
> field. This will be the OID of the table space the object resides in, or 0
> (default table space). Since we can then resolve relid/relname, schema and
> database to a tablespace, there aren't too many cases when extra logic
> needs to be added to the IO framework. In fact, most of it is taken care
> of because of the abstraction of relpath().
> 
> The creation of table spaces will need to be recorded in xlog in the same
> way that files are in heap_create() with the corresponding delete logic
> incase of ABORT.
> 
> 
> Postmaster startup:
> 
> Ideally, the postmaster at startup should go into each tblspc/databaseoid
> directory and check for a postmaster.pid file to see if some other
> instance is touching the files we're interested in. This will require a
> control file listing tblspc/databaseoid paths and it will need to plug
> into WAL in case we die during CREATE TABLESPACE. Comments?
> 
> 
> Creating a database
> 
> I think that createdb() is going to have to be reworked if pg_tablespace
> isn't shared (ie, tablespaces are only database unique). The reason being
> that if we create a database which has a table space, pg_tablespace in the
> new database will have to be updated and that cannot be done atomically
> with the `cp` based mechanism we currently use.
> 
> I think I'm going to have to get my hands dirty before I can tell the
> extent to which createdb() will need reworking.
> 
> 
> pg_dump
> 
> Obviously pg_dump will need to be able to dump table spaces. pg_dump
> running against <7.5 will DDL commands without a table space parameter and
> as such the database's physical layout, when loaded into 7.5, will be the
> same as for <7.5.
> 
> ---
> 
> Comments? Questions? Suggestions?
> 
> Thanks,
> 
> Gavin
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings

Re: Tablespaces

From

Gavin Sherry

Date:

26 February 2004, 17:22:53

On Thu, 26 Feb 2004, Alex J. Avriette wrote:

> On Thu, Feb 26, 2004 at 11:22:28PM +1100, Gavin Sherry wrote:
>
> > Certainly, table spaces are used in many ways in oracle, db2, etc. You can
> > mirror data across them, have different buffer sizes for example.
> > In some implementations, they can be raw disk partitions (no file system).
> > I don't intend going this far, however.
>
> Perhaps now would be a good time to bring up my directio on Solaris question
> from a year or so back? Is there any interest in the ability to use raw
> disk?

I do not intend to undertake raw disk tablespaces for 7.5. I'd be
interested if anyone could provide some real world benchmarking of file
system vs. raw disk. Postgres benefits a lot from kernel file system cache
at the moment. Also, I believe that database designers have traditionally
made bad file system designers. Raw database partitions often lack the
tools essential to a scalable environment. For example, the ability to
resize partitions.

Gavin

Re: Tablespaces

From

Gavin Sherry

Date:

26 February 2004, 17:30:50

On Thu, 26 Feb 2004, Tom Lane wrote:

> Gavin Sherry <swm@linuxworld.com.au> writes:
> > A table space is a directory structure. The directory structure is as
> > follows:
> > [swm@dev /path/to/tblspc]$ ls
> > OID1/    OID2/
> > OID1 and OID2 are the OIDs of databases which have created a table space
> > against this file system location. In this respect, a table space
> > resembles $PGDATA/base. I thought it useful to keep this kind of
> > namespace mechanism in place ...
>
> Actually, this is *necessary* AFAICT.  The case that forces it is DROP
> DATABASE.  Since you have to execute that from another database, there's
> no reasonable way to look into the target database's catalogs.  That
> means that the OID of the database has to be sufficient information to
> get rid of all its files.  You can do this fairly easily if in each
> tablespace (whose locations you know from the shared pg_tablespace
> table) you can look for a subdirectory matching the target database's
> OID.  If we tried to put the database's files just "loose" in each
> tablespace directory then we'd be in trouble.

Ahhh. Yes.

>
> I think this is also an implementation reason for favoring cluster-wide
> tablespaces over database-local ones.  I'm not sure how you drop a
> database from outside if you can't see where its tablespaces are.

Naturally.

>
> I believe that it will be necessary to expand RelFileNode to three OIDs
> (tablespace, database, relation).  I had once hoped that it could be
> kept at two (tablespace, relation) but with a physical layout like this
> you more or less have to have three.

Yes. I agree.

>
> One issue that needs to be agreed on early is how the low-level file
> access code finds a tablespace.  What I would personally like is for
> $PGDATA to contain symlinks to the tablespace top directories.  The
> actual access path for any relation could then be built trivially from
> its RelFileNode:
>     $PGDATA/tablespaces/TBOID/DBOID/RELFILENODE
>         -------------------------
> The underlined part references a symlink that leads to the directory
> containing the per-database subdirectories.
>
> I am expecting to hear some bleating about this from people whose
> preferred platforms don't support symlinks ;-).  However, if we don't

Actually, I think that's a pretty good idea :-). I'd solves a bunch of
issues in the backend (postmaster start up can recurse through
$PGDATA/tablespaces looking for postmaster.pid files) and will also assist
admins with complex configurations (perhaps).

> Speaking of locking, can we do anything to prevent people from shooting
> themselves in the foot by changing active tablespaces?  Are we even
> going to have a DROP TABLESPACE command, and if so what would it do?

Ahh. I forgot to detail my ideas on this. It seems to me that we cannot
drop a table space until the directory is empty. We will need a shared
invalidation message so that backends do not attempt to create an object
just after we drop the table space.

Thanks,

Gavin

Re: Tablespaces

From

James Rogers

Date:

26 February 2004, 17:41:56

On Thu, 2004-02-26 at 13:22, Gavin Sherry wrote:
> Postgres benefits a lot from kernel file system cache
> at the moment.

With the implementation of much smarter and more adaptive cache
replacement algorithm i.e. ARC, I would expect the benefit of using the
kernel file system cache to diminish significantly.  It appears to me,
and I could be wrong, that the reason Postgres has depended on the
kernel file system cache isn't that this is obviously better in some
absolute sense (though it might be depending on the deployment
scenario), but that the original cache replacement algorithm in Postgres
was sufficiently poor that the better cache replacement algorithms in
the kernel cache more than offset any sub-optimality that might result
from doing so.

I would expect that with ARC and the redesign of some of the buffer
management bits for more scalability, you might very well get better
performance by allocating most of the memory to the buffer cache rather
than leaving it to the kernel file cache.

I'm actually fairly curious to see what the new buffer management scheme
will mean in terms of real world performance and parameter tuning.

-James Rogersjrogers@neopolitan.com

Re: Tablespaces

From

"Simon Riggs"

Date:

26 February 2004, 17:57:04

>Gavin Sherry
> The creation of table spaces will need to be recorded in xlog in the
same
> way that files are in heap_create() with the corresponding delete
logic
> incase of ABORT.

Overall, sounds very cool.

Please could we record the OID of the tablespace in the WAL logs, not
the path to the tablespace? That way, we run no risks of having the WAL
logs not work correctly should things change slightly...

.. need to record drop tablespaces in the WAL logs also. 

I'm sure you meant both of those, just checking.

Can drop tablespace require a specific privelege too? It's too easy to
drop parts of a database without thinking...
Best Regards, Simon Riggs

Re: Tablespaces

From

Tom Lane

Date:

26 February 2004, 18:05:34

Gavin Sherry <swm@linuxworld.com.au> writes:
>> Speaking of locking, can we do anything to prevent people from shooting
>> themselves in the foot by changing active tablespaces?  Are we even
>> going to have a DROP TABLESPACE command, and if so what would it do?

> Ahh. I forgot to detail my ideas on this. It seems to me that we cannot
> drop a table space until the directory is empty.

How would it get to be empty?  Are you thinking of some sort of "connect
database to tablespace" and "disconnect database from tablespace"
commands that would respectively create and delete the per-database
subdirectory?  That seems moderately reasonable to me.  We could then
invent a locking protocol that requires backends to lock a tablespace
before they can execute either of these operations (or delete the
tablespace of course).
        regards, tom lane

Re: Tablespaces

From

"Alex J. Avriette"

Date:

26 February 2004, 18:24:08

On Fri, Feb 27, 2004 at 08:22:25AM +1100, Gavin Sherry wrote:

> interested if anyone could provide some real world benchmarking of file
> system vs. raw disk. Postgres benefits a lot from kernel file system cache
> at the moment. Also, I believe that database designers have traditionally
> made bad file system designers. Raw database partitions often lack the
> tools essential to a scalable environment. For example, the ability to
> resize partitions.

The only reason I mentioned it to begin with was the recommendation of
directio for databases in the Sun Blueprint, _Tuning Databases on the
Solaris Platform_ (and being a Solaris geek, I asked, but apparently
nobody else is worried enough about performance or not using Solaris
enough to care).

It's not critical, of course. I think, however, that many of us would
like to see some of the features of Oracle and DB2 available to users
of postgres.  Some of these features are raw disk, tablespaces, and
replication. We're getting there, and making really terrific progress
(I fully expect replication to be ready for primetime in the 8-12 mos
timeframe), but we're not quite there yet.

As I said, I'm very glad to hear tablespaces mentioned again and see
what looks like work being done on it.

Thanks!
Alex

--
alex@posixnap.net
Alex J. Avriette, Solaris Artillery Officer
"Among the many misdeeds of the British rule in India, history will look upon the act of depriving a whole nation of
arms,as the blackest." - Mahatma Gandhi

Re: Tablespaces

From

Tom Lane

Date:

26 February 2004, 18:43:05

James Rogers <jrogers@neopolitan.com> writes:
> With the implementation of much smarter and more adaptive cache
> replacement algorithm i.e. ARC, I would expect the benefit of using the
> kernel file system cache to diminish significantly.  It appears to me,
> and I could be wrong, that the reason Postgres has depended on the
> kernel file system cache isn't that this is obviously better in some
> absolute sense (though it might be depending on the deployment
> scenario), but that the original cache replacement algorithm in Postgres
> was sufficiently poor that the better cache replacement algorithms in
> the kernel cache more than offset any sub-optimality that might result
> from doing so.

The question of optimality of replacement algorithm is only one of the
arguments for using a small buffer cache.  IMHO a considerably stronger
argument is that the kernel's memory management is more flexible: it can
use that memory for either disk cache or program workspace, and it can
change the allocation on-the-fly as load demands.  If you dedicate most
of RAM to Postgres buffers then you are likely to be wasting RAM or
swapping heavily.  Possibly both :-(

Another gotcha is that unless the OS allows you to lock shared memory
into RAM, the shared buffers themselves could get swapped out, which is
a no-win scenario by any measure.  Keeping the shared buffer arena small
helps prevent that by ensuring all the buffers are "hot".

Of course, this is all speculation until we get some real-world
experience with ARC.  But I don't expect it to be a magic bullet.
        regards, tom lane

Re: Tablespaces

From

Greg Stark

Date:

27 February 2004, 01:26:12

> > I am expecting to hear some bleating about this from people whose
> > preferred platforms don't support symlinks ;-).  However, if we don't

Well, one option would be to have the low level filesystem storage (md.c?)
routines implement a kind of symlink themselves. Just a file with a special
magic number followed by a path.

I'm normally against reimplementing OS services but symlinks are really a very
simple concept and simple to implement. Especially if you can make a few
simplifying assumptions: they only ever need to appear as the final path
element not as parent directories and tablespaces don't need symlinks pointing
to symlinks. Ideally postgres also doesn't need to implement relative links
either.

-- 
greg

Re: Tablespaces

From

"Zeugswetter Andreas SB SD"

Date:

27 February 2004, 05:00:40

> > Ahh. I forgot to detail my ideas on this. It seems to me that we cannot
> > drop a table space until the directory is empty.

Agreed.

>
> How would it get to be empty?  Are you thinking of some sort of "connect
> database to tablespace" and "disconnect database from tablespace"
> commands that would respectively create and delete the per-database
> subdirectory?  That seems moderately reasonable to me.  We could then

I would only allow the drop if the directory only contains empty db oid
directories.

Andreas

Re: Tablespaces

From

"Zeugswetter Andreas SB SD"

Date:

27 February 2004, 07:39:17

> I do not intend to undertake raw disk tablespaces for 7.5. I'd be
> interested if anyone could provide some real world benchmarking of file
> system vs. raw disk. Postgres benefits a lot from kernel file system cache
> at the moment.

Yes, and don't forget that pg also relys on the OS for grouping and
sorting the physical writes and doing readahead where appropriate.

The use of raw disks is usually paired with the use of kernel aio.
The difference is said to be up to 30% on Solaris. I can assert, that
it made the difference between a bogged down system and a much better behaved
DB on Sun here.

My experience with kaio on AIX Informix is, that kaio is faster as long as IO
is not the bottleneck (disk 100% busy is the metric to watch, not Mb/s), while
for an IO bound system the Informix builtin IO threads that can be used instead
win. (Since they obviously do better at grouping, sorting and readahead
than the AIX kernel does for kaio)

Overall I think the price and komplexity is too high, especially since there are
enough platforms where the kernel does a pretty good job at grouping, sorting and
readahead. Additionally the kernel takes non PostgreSQL IO into account.

Andreas

Re: Tablespaces

From

"scott.marlowe"

Date:

27 February 2004, 11:37:39

On Fri, 27 Feb 2004, Zeugswetter Andreas SB SD wrote:

> 
> > > Ahh. I forgot to detail my ideas on this. It seems to me that we cannot
> > > drop a table space until the directory is empty.
> 
> Agreed.
> 
> > 
> > How would it get to be empty?  Are you thinking of some sort of "connect
> > database to tablespace" and "disconnect database from tablespace"
> > commands that would respectively create and delete the per-database
> > subdirectory?  That seems moderately reasonable to me.  We could then
> 
> I would only allow the drop if the directory only contains empty db oid 
> directories.

Wouldn't this be better tracked in the dependency tracking that's already 
built into postgresql?  Checking to see if the directory is empty is open 
to race conditions, but locking the dependency tracking while dropping a 
tablespace isn't.

Re: Tablespaces

From

"scott.marlowe"

Date:

27 February 2004, 11:38:11

On Fri, 27 Feb 2004, Gavin Sherry wrote:

> On Thu, 26 Feb 2004, Alex J. Avriette wrote:
> 
> > On Thu, Feb 26, 2004 at 11:22:28PM +1100, Gavin Sherry wrote:
> >
> > > Certainly, table spaces are used in many ways in oracle, db2, etc. You can
> > > mirror data across them, have different buffer sizes for example.
> > > In some implementations, they can be raw disk partitions (no file system).
> > > I don't intend going this far, however.
> >
> > Perhaps now would be a good time to bring up my directio on Solaris question
> > from a year or so back? Is there any interest in the ability to use raw
> > disk?
> 
> I do not intend to undertake raw disk tablespaces for 7.5. I'd be
> interested if anyone could provide some real world benchmarking of file
> system vs. raw disk. Postgres benefits a lot from kernel file system cache
> at the moment. Also, I believe that database designers have traditionally
> made bad file system designers. Raw database partitions often lack the
> tools essential to a scalable environment. For example, the ability to
> resize partitions.

Is possible / reasonable / smart and or dumb to look at implementing the 
tablespaces as riding atop the initlocation handled stuff.  I.e. 
postgresql can only create tablespaces in areas that are created by 
initlocation, thus keeping it in its box, so to speak?

Re: Tablespaces

From

Tom Lane

Date:

27 February 2004, 12:44:21

"scott.marlowe" <scott.marlowe@ihs.com> writes:
> On Fri, 27 Feb 2004, Zeugswetter Andreas SB SD wrote:
>>> How would it get to be empty?  Are you thinking of some sort of "connect
>>> database to tablespace" and "disconnect database from tablespace"
>>> commands that would respectively create and delete the per-database
>>> subdirectory?  That seems moderately reasonable to me.  We could then
>> 
>> I would only allow the drop if the directory only contains empty db oid 
>> directories.

That's subject to race conditions (ie, someone creating a table about
the same time you are deciding it's okay to drop the tablespace).  There
needs to be some interlock, and I think that associating that lock with
infrequently executed connect/disconnect operations would be good from a
performance standpoint.

> Wouldn't this be better tracked in the dependency tracking that's already 
> built into postgresql?

No, because dependencies are local to individual databases.
        regards, tom lane

Re: Tablespaces

From

Tom Lane

Date:

27 February 2004, 13:42:33

"scott.marlowe" <scott.marlowe@ihs.com> writes:
> Is possible / reasonable / smart and or dumb to look at implementing the 
> tablespaces as riding atop the initlocation handled stuff.

In my mind, one of the main benefits of this work will be that we'll be
able to get *rid* of the initlocation stuff.  It's a crock.
        regards, tom lane

Re: Tablespaces

From

"scott.marlowe"

Date:

27 February 2004, 14:07:26

On Fri, 27 Feb 2004, Tom Lane wrote:

> "scott.marlowe" <scott.marlowe@ihs.com> writes:
> > Is possible / reasonable / smart and or dumb to look at implementing the 
> > tablespaces as riding atop the initlocation handled stuff.
> 
> In my mind, one of the main benefits of this work will be that we'll be
> able to get *rid* of the initlocation stuff.  It's a crock.

OK, that's fine, but I keep thinking that a superuser should have to 
create the tablespace itself, and then tables can be assigned by users 
based on the rights assigned by the dba / superuser.  Is that how we're 
looking at doing it, or will any user be able to create a tablespace 
anywhere postgresql has write permission, or will only dbas be able to 
create AND use table spaces. I'm just not sure how that's gonna be 
handled, and haven't seen it addressed.

Re: Tablespaces

From

Tom Lane

Date:

27 February 2004, 14:24:56

"scott.marlowe" <scott.marlowe@ihs.com> writes:
> On Fri, 27 Feb 2004, Tom Lane wrote:
>> In my mind, one of the main benefits of this work will be that we'll be
>> able to get *rid* of the initlocation stuff.  It's a crock.

> OK, that's fine, but I keep thinking that a superuser should have to 
> create the tablespace itself, and then tables can be assigned by users 
> based on the rights assigned by the dba / superuser.

Yeah, we haven't yet gotten to the issue of permissions, but certainly
creating or deleting a tablespace has to be a superuser-only operation,
if only because you probably have also got some manual filesystem work
to do to set up the associated directory; and that has to be done as
root or postgres.

It might be a good idea to restrict connect/disconnect (if we use those
operations) to superusers as well.  
        regards, tom lane

Re: Tablespaces

From

tswan@idigx.com

Date:

27 February 2004, 15:35:06

>
>> I do not intend to undertake raw disk tablespaces for 7.5. I'd be
>> interested if anyone could provide some real world benchmarking of file
>> system vs. raw disk. Postgres benefits a lot from kernel file system
>> cache
>> at the moment.
>
> Yes, and don't forget that pg also relys on the OS for grouping and
> sorting the physical writes and doing readahead where appropriate.
>
>

Most people I know want tablespaces in order to limit or preallocate the
disk space used by a table or database in addition to controlling the
physical location of a table or database.

I know on linux, there is the option of creating an empty file or a
specific size using dd, mounting it through loopback, formatting it,
symlinking the appropriate OID/TID (or mounting the lpb device in the
appropriate directory) and then you control how much space that
directory/mount point can contain.

Of course, with MVCC you would have to vacuum frequently, as you could
miss some updates if there weren't enough tuples marked as free.  If there
were "in-place" updates, the preallocation and limitation much easier, but
that's not how PG works.

If the tablespace disk space allocation is exceeded there would need to be
some graceful reporting condition back to the client.  "UPDATE/INSERT
failed (tablespace size exceeded)", "(tablespace full)", "(disk full)" or
some other error may need to be handled/reported.

Re: Tablespaces

From

Gavin Sherry

Date:

27 February 2004, 20:04:06

On Fri, 27 Feb 2004 tswan@idigx.com wrote:

> >
> >> I do not intend to undertake raw disk tablespaces for 7.5. I'd be
> >> interested if anyone could provide some real world benchmarking of file
> >> system vs. raw disk. Postgres benefits a lot from kernel file system
> >> cache
> >> at the moment.
> >
> > Yes, and don't forget that pg also relys on the OS for grouping and
> > sorting the physical writes and doing readahead where appropriate.
> >
> >
>
> Most people I know want tablespaces in order to limit or preallocate the
> disk space used by a table or database in addition to controlling the
> physical location of a table or database.
>
> I know on linux, there is the option of creating an empty file or a
> specific size using dd, mounting it through loopback, formatting it,
> symlinking the appropriate OID/TID (or mounting the lpb device in the
> appropriate directory) and then you control how much space that
> directory/mount point can contain.
>
> Of course, with MVCC you would have to vacuum frequently, as you could
> miss some updates if there weren't enough tuples marked as free.  If there
> were "in-place" updates, the preallocation and limitation much easier, but
> that's not how PG works.

I do not intend to work on such a system for the initial introduction of
table spaces. The problem is, of course, knowing when you're actually out
of space in a table space in any given transaction. Given that WAL is on a
different partition (at least for the moment) the table space will not
have transaction X's data written to it until after transaction X is
finished. And we cannot error out a transaction which is already commited.

The solution is to keep track of free space and error out at some
percentage of free space remaining. But I don't want to complicate
tablespaces too much in 7.5.

Thanks,

Gavin

Re: Tablespaces

From

Tom Lane

Date:

27 February 2004, 23:49:45

Gavin Sherry <swm@linuxworld.com.au> writes:
> I do not intend to work on such a system for the initial introduction of
> table spaces. The problem is, of course, knowing when you're actually out
> of space in a table space in any given transaction.

It should not be that hard, at least not on local filesystems.  When PG
realizes that a new page must be added to a table, it does a write() to
append a page of zeroes to the physical table.  This happens
immediately.  It's true that actual data may not be written into that
section of the file till long after commit, but the kernel should do
space allocation checking upon the first write.

I have heard tell that this may not happen when you are dealing with NFS
(yet another reason not to run databases across NFS) but on all local
filesystems I know of, out-of-space should result in a failure before
transaction commit.

I say "should" because I suspect this isn't a very heavily tested code
path in Postgres.  But in theory it should work.  Feel free to submit
bug reports if you find it doesn't.
        regards, tom lane

Re: Tablespaces

From

Josh Berkus

Date:

29 February 2004, 18:28:10

Gavin,

#1:  I really think that we should have a way to set a "default tablespace"
for any database in a cluster.    This property would be vitally important
for anyone wishing to use tablespaces to impose quotas.   First, the
superuser would:
ALTER DATABASE db1 ALTER DEFAULT_TABLESPACE partition2;
then any regular users creating tables in that database would, by default,
have TABLESPACE partition2 automatically appended to them by the parser
unless overridden in the creation statement by specifying another, specific,
tablespace.

Alternately, the default tablespace could be set through a GUC.   In my mind,
this would be inferior on 2 counts:
1) It would require adding Yet Another Miscellaneos GUC Variable.
2) It would preclude large, multisuer installations from seamlessly using
tablespaces for quotas, becuase there would be no way to transparently set
the GUC differently for each user or database.


#2: Permissions:
I see the permissions issue as quite transparent.   First, I agree that only
the superuser should have the right to create, alter, or drop tablespaces.
'nuff said.
Second, as far as I can see, there is only one relevant permission for regular
users:  USE.   Either the user is permitted to create objects in that
tablespace, or he/she is not.  Other permissions, such as read access, should
NOT be set by tablespace, as such permissions are already governed by
database, table, and schema; to add a SELECT restriction to tablespaces would
frequently result in paralytic snarls of conflicting permissions on complex
installations.
Thus, by my proposal, the only GRANT for tablespaces (executed by a superuser)
would be:
GRANT USE ON tablespace1 TO user;
This permission would ONLY be accessed for CREATE/ALTER TABLE, and CREATE
INDEX statements.
Easy, neh?

#3: ALTER TABLE .... CHANGE TABLESPACE:
This is strictly in the class of "would be a very nice & useful feature if
it's not too difficult".  

Given how painful it is to drop & replace a table with multiple dependencies
(on some databases, only possible by droping & re-loading the entire
database) it would be nice to have an ALTER TABLE command that moved the
table to another tablespace.    It doesn't *seem* to me that this would be a
very challenging bit of programming, as the operation would be very similar
to REINDEX in the manipulation of files.   (But what I know, really?)

Once tablespaces are a feature and some users start using them for quota
management, there will quickly develop situations where the original
tablespace for a db runs out of room and can't be resized.   Being able to
move the table "in situ" then becomes vital, especially on very large
databases ... and when someday combined with partitioned tables, will become
essential.

Further, we will get an *immediate* flurry of requests from users who just
upgraded to 7.5 and want to make use of the tablespaces feature on an
existing production database.

ALTER INDEX ... CHANGE TABLESPACE is *not* needed, though, as there are no
issues other than time which I know of with dropping & re-creating an index.

If ALTER TABLE CHANGE TABLESPACE has some major technical hurdles, then I
think it's one of those things that could be put off until the next version
of tablespaces, or even held until Partition Tables is developed for a
combined solution.    But it would be nice to have.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Tablespaces

From

Andrew Sullivan

Date:

01 March 2004, 10:57:25

On Thu, Feb 26, 2004 at 05:28:41PM -0500, Alex J. Avriette wrote:

> The only reason I mentioned it to begin with was the recommendation of
> directio for databases in the Sun Blueprint, _Tuning Databases on the
> Solaris Platform_ (and being a Solaris geek, I asked, but apparently
> nobody else is worried enough about performance or not using Solaris
> enough to care).

That recommendation itself is a few years old.  While it may still be
true that directio is still fastest for Oracle on Solaris, I'd sure
like to see some recent evidence.  I've a funny feeling that this is
an old rule of thumb which is now true in the sense that everyone
believes it, but maybe not in the sense that a test would reveal it
to be a sensible rule.

> like to see some of the features of Oracle and DB2 available to users
> of postgres.  Some of these features are raw disk, tablespaces, and
> replication. We're getting there, and making really terrific progress

I don't think we want features for their own sake, though, and I'm
not convinced that raw filesystems are actually useful.  Course, it's
not my itch, and PostgreSQL _is_ free software.

A

-- 
Andrew Sullivan

Re: Tablespaces

From

"Simon Riggs"

Date:

01 March 2004, 20:28:15

>Gavin Sherry
> On Fri, 27 Feb 2004 tswan@idigx.com wrote:
> > Most people I know want tablespaces in order to limit or preallocate
the
> > disk space used by a table or database in addition to controlling
the
> > physical location of a table or database.

> I do not intend to work on such a system for the initial introduction
of
> table spaces. The problem is, of course, knowing when you're actually
out
> of space in a table space in any given transaction. Given that WAL is
on a
> different partition (at least for the moment) the table space will not
> have transaction X's data written to it until after transaction X is
> finished. And we cannot error out a transaction which is already
commited.
> 
> The solution is to keep track of free space and error out at some
> percentage of free space remaining. But I don't want to complicate
> tablespaces too much in 7.5.

You're absolutely right about the not-knowing when you're out of space
issue. However, if the xlog has been written then it is not desirable,
but at least acceptable that the checkpoint/bgwriter cannot complete on
an already committed txn. It's not the txn which is getting the error,
that's all.

Hmmm...I'm not sure that we'll be able or should avoid the out of space
situation completely. The question is...what will we do when we hit it?
It doesn't matter whether you stop at 100% or 90% or whatever, you still
have to stop and then what? Stay up as long as possible hopefully: If
there wasn't enough space to write to the tablespace, going into
recovery won't help the situation either; youre still out of space until
you fix that. We now have the option not to crash, since it might be
perfectly viable to keep on chugging away on one Tablespace even though
all txn work on the out-of-space tablespace is frozen/barred etc. Sounds
like a refinement, but something to keep in mind at the design stage if
we can.

The problem is that tablespaces do complicate space management (that's
what people want though, so that's OK). That complicates admin and so pg
will hit many more out of space errors than we've seen previously.
Trying to work out how to spot these ahead of time, accept user defined
limits on each tablespace etc sounds like extra complexity for the
initial drop. I guess my own suggested approach is to start by handling
the error cases, then go back and try to avoid some of them.

All of this exposes for me the complication that doing PITR and
tablespaces at the same time is likely to be more complex for us both
than either had envisaged. The reduced complexity for PITR was what I
was shooting for, also! I'm happy to work together on any issues that
arise.

For PITR, I think we would need:
- a very accessible list of tablespace locations, so taking a full
physical database backup can be easily accomplished using OS utilities.
Hopefully a list maintained external to the database? We have the
equivalent now with env variables.
- decisions about what occurs when for-whatever-reason one or more
tablespaces are not recoverable from backup?
- it might be desirable to allow recovery with less than all of the
original tablespces
- it might also be desirable to allow recovery when the tablespaces txn
Ids don't match (though that is forbidden on many other dbms)

Best Regards, Simon Riggs

Re: Tablespaces

From

Tom Lane

Date:

01 March 2004, 21:22:44

"Simon Riggs" <simon@2ndquadrant.com> writes:
> Gavin Sherry wrote:
>> I do not intend to work on such a system for the initial introduction of
>> table spaces. The problem is, of course, knowing when you're actually out
>> of space in a table space in any given transaction. Given that WAL is on a
>> different partition (at least for the moment) the table space will not
>> have transaction X's data written to it until after transaction X is
>> finished. And we cannot error out a transaction which is already
>> commited.

As long as the kernel doesn't lie about file extension, we will not
commit any transaction that requires a disallowed increase in the
allocated size of data files, because allocation of another table page
is checked with the kernel during the transaction.  So on most
filesystems (maybe not NFS) the problem Gavin is worried about doesn't
exist.

> You're absolutely right about the not-knowing when you're out of space
> issue. However, if the xlog has been written then it is not desirable,
> but at least acceptable that the checkpoint/bgwriter cannot complete on
> an already committed txn. It's not the txn which is getting the error,
> that's all.

Right.  This is in fact not a fatal situation, as long as you don't run
out of preallocated WAL space.  For a recent practical example of our
behavior under zero-free-space conditions, see this thread:
http://archives.postgresql.org/pgsql-hackers/2004-01/msg00530.php
particularly the post-mortem here:
http://archives.postgresql.org/pgsql-hackers/2004-01/msg00606.php
Barring one small bug, the database would likely have stayed up, and
continued to service at least the read-only transactions, until Chris
got around to freeing some disk space.

I think it is sufficient (at least in the near term) to expect people to
use partition size limits if they want to control database size --- that
is, make a partition of the desired size and put the database directory
in there.  Tablespaces as per the design we are discussing would make it
easier to apply such a policy to a sub-area of a database cluster than
it is today, but they needn't in themselves implement the restriction.
        regards, tom lane

Re: Out of space situation and WAL log pre-allocation (was Tablespaces)

From

"Simon Riggs"

Date:

02 March 2004, 18:53:20

Tom Lane [mailto:tgl@sss.pgh.pa.us]
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> > You're absolutely right about the not-knowing when you're out of
space
> > issue. However, if the xlog has been written then it is not
desirable,
> > but at least acceptable that the checkpoint/bgwriter cannot complete
on
> > an already committed txn. It's not the txn which is getting the
error,
> > that's all.
> 
> Right.  This is in fact not a fatal situation, as long as you don't
run
> out of preallocated WAL space.  

...following on also from thoughts on [PERFORM] list...

Clearly running out of pre-allocated WAL space is likely to be the next
issue. Running out of space in the first place is likely to be because
of an intense workload, which is exactly the thing which also makes you
run out of pre-allocated WAL space. Does that make sense?

Best regards, Simon Riggs

Re: Tablespaces

From

Christopher Kings-Lynne

Date:

02 March 2004, 23:30:46

> A table space parameter will be added to DDL commands which create
> physical database objects (CREATE DATABASE/INDEX/TABLE/SEQUENCE) and to
> CREATE SCHEMA. The associated routines, as well as the corresponding DROP
> commands will need to be updated. Adding the ability to ALTER <object>
> TABLESPACE <name> seems a little painful. Would people use it? Comments?

How about allowing the specification on schemas and databases of 
different default tablespaces for TEMP, TABLE and INDEX??  Is there any 
point to that?

Chris

Re: Tablespaces

From

Christopher Kings-Lynne

Date:

02 March 2004, 23:32:27

> I've been looking at implementing table spaces for 7.5. Some notes and
> implementation details follow.

Ah sorry, other things you might need to consider:

Privileges on tablespaces:

GRANT USAGE ON TABLESPACE tbsp TO ...;

Different disk settings for different tablespaces (since they will 
likely be on different disks):

ALTER TABLESPACE tbsp SET random_page_cost TO 2.5;

Chris

Re: Tablespaces

From

Tom Lane

Date:

02 March 2004, 23:53:01

Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:
> How about allowing the specification on schemas and databases of 
> different default tablespaces for TEMP, TABLE and INDEX??  Is there any 
> point to that?

TEMP tables are not local to any particular schema, so it wouldn't make
sense to have a schema-level default for their placement.

The other five combinations are at least theoretically sensible, but
do we need 'em all?  It seems to me that a reasonable compromise is to
offer database-level default tablespaces for TEMP, TABLE, and INDEX,
ignoring the schema level.  This is simple and understandable, and if
you don't like it, you're probably the kind of guy who will want to
override it per-table anyway ...

BTW, another dimension to think about is where TOAST tables and their
indexes will get placed.
        regards, tom lane

Re: Out of space situation and WAL log pre-allocation (was Tablespaces)

From

Tom Lane

Date:

03 March 2004, 00:12:37

"Simon Riggs" <simon@2ndquadrant.com> writes:
> Tom Lane wrote:
>> Right.  This is in fact not a fatal situation, as long as you don't
>> run out of preallocated WAL space.  

> Clearly running out of pre-allocated WAL space is likely to be the next
> issue. Running out of space in the first place is likely to be because
> of an intense workload, which is exactly the thing which also makes you
> run out of pre-allocated WAL space. Does that make sense?

I think one of the first things people would do with tablespaces is
stick the data files onto a separate partition from the WAL and clog
files.  (Actually you can do this today with a simple symlink hack, but
tablespaces will make it easier and clearer.)  The space usage for WAL
is really pretty predictable, because of the checkpoint-at-least-
every-N-segments setting.  clog is not exactly a space hog either.
Once you have that separation established, out-of-disk-space can kill
individual transactions but never the database as a whole.

One of the things that bothers me about the present PITR design is that
it presumes that individual WAL log segments can be kept until the
external archiver process feels like writing them somewhere.  If there's
no guarantee that that happens within X amount of time, then you can't
bound the amount of space needed on the WAL drive, and so you are back
facing the possibility of an out-of-WAL-space panic.  I suspect that we
cannot really do anything about that, but it's annoying.  Any bright
ideas out there?
        regards, tom lane

Re: Tablespaces

From

Bruce Momjian

Date:

03 March 2004, 00:24:07

Gavin Sherry wrote:
> The actual creation of the table space will be done with:
> 
>     CREATE TABLE SPACE <name> LOCATION </path/to/tblspc>;

Seems you should use CREATE TABLESPACE (no space) so it is more distinct
from CREATE TABLE.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Tablespaces

From

Bruce Momjian

Date:

03 March 2004, 00:24:38

Richard Huxton wrote:
> On Thursday 26 February 2004 10:07, Gavin Sherry wrote:
> >
> > CREATE TABLESPACE tbl1 LOCATION '/var/'
> >
> > which will result in something like '/var/123443' is a bad idea. Then
> > again, the user should know better. Comments?
> 
> The LOCATION should have the same owner and permissions as $PGDATA - that 
> should catch mistyping.
> 
> Unless you're running as root, of course. In which case you clearly know 
> better than everyone else, so off you go!

FYI, you can't run the postmaster as root.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Tablespaces

From

Bruce Momjian

Date:

03 March 2004, 00:25:24

Joshua D. Drake wrote:
> > Is it possible to put WALs and CLOGs into different tablespaces? (maybe 
> > different RAID systems). Some companies want that ...
> 
> You can do this now, but it would be nice to be able to have it actually 
> configurable versus the hacked linked method.

Agreed, but because the system has to be down to move pg_xlog, I think
we should write a command-line utility to assist with this, perhaps.  It
could check permissions and stuff.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Tablespaces

From

Bruce Momjian

Date:

03 March 2004, 00:26:32

Tom Lane wrote:
> Gavin Sherry <swm@linuxworld.com.au> writes:
> > A table space is a directory structure. The directory structure is as
> > follows:
> > [swm@dev /path/to/tblspc]$ ls
> > OID1/    OID2/
> > OID1 and OID2 are the OIDs of databases which have created a table space
> > against this file system location. In this respect, a table space
> > resembles $PGDATA/base. I thought it useful to keep this kind of
> > namespace mechanism in place ...
> 
> Actually, this is *necessary* AFAICT.  The case that forces it is DROP
> DATABASE.  Since you have to execute that from another database, there's
> no reasonable way to look into the target database's catalogs.  That
> means that the OID of the database has to be sufficient information to
> get rid of all its files.  You can do this fairly easily if in each
> tablespace (whose locations you know from the shared pg_tablespace
> table) you can look for a subdirectory matching the target database's
> OID.  If we tried to put the database's files just "loose" in each
> tablespace directory then we'd be in trouble.
> 

Gavin, let us know if you want us to create the global pg_tablespace for
you.  Some of us have done a lot of system catalog work.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Tablespaces

From

Gavin Sherry

Date:

03 March 2004, 00:26:38

On Tue, 2 Mar 2004, Bruce Momjian wrote:

> Gavin Sherry wrote:
> > The actual creation of the table space will be done with:
> >
> >     CREATE TABLE SPACE <name> LOCATION </path/to/tblspc>;
>
> Seems you should use CREATE TABLESPACE (no space) so it is more distinct
> from CREATE TABLE.

Oops. Typo.

>
> --
>   Bruce Momjian                        |  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us               |  (610) 359-1001
>   +  If your life is a hard drive,     |  13 Roberts Road
>   +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
>
>
> !DSPAM:40455de0297537578347468!
>
>

Re: Tablespaces

From

Bruce Momjian

Date:

03 March 2004, 00:27:58

Tom Lane wrote:
> Gavin Sherry <swm@linuxworld.com.au> writes:
> >> Speaking of locking, can we do anything to prevent people from shooting
> >> themselves in the foot by changing active tablespaces?  Are we even
> >> going to have a DROP TABLESPACE command, and if so what would it do?
> 
> > Ahh. I forgot to detail my ideas on this. It seems to me that we cannot
> > drop a table space until the directory is empty.
> 
> How would it get to be empty?  Are you thinking of some sort of "connect
> database to tablespace" and "disconnect database from tablespace"
> commands that would respectively create and delete the per-database
> subdirectory?  That seems moderately reasonable to me.  We could then
> invent a locking protocol that requires backends to lock a tablespace
> before they can execute either of these operations (or delete the
> tablespace of course).

One crude solution would be to remove the tablespace oid directory only
when the database is dropped, and require an empty tablespace directory
to drop the tablespace.  This allows a lock only on tablespace creation,
and not a lock on object creation in each tablespace.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Out of space situation and WAL log pre-allocation (was

From

Joe Conway

Date:

03 March 2004, 00:33:53

Tom Lane wrote:
> One of the things that bothers me about the present PITR design is that
> it presumes that individual WAL log segments can be kept until the
> external archiver process feels like writing them somewhere.  If there's
> no guarantee that that happens within X amount of time, then you can't
> bound the amount of space needed on the WAL drive, and so you are back
> facing the possibility of an out-of-WAL-space panic.  I suspect that we
> cannot really do anything about that, but it's annoying.  Any bright
> ideas out there?

Maybe specify an archive location (that of course could be on a separate 
partition) that the external archiver should check in addition to the 
normal WAL location. At some predetermined interval, push WAL log 
segments no longer needed to the archive location.

Joe

Re: Tablespaces

From

Bruce Momjian

Date:

03 March 2004, 00:34:18

Greg Stark wrote:
>
> > > I am expecting to hear some bleating about this from people whose
> > > preferred platforms don't support symlinks ;-).  However, if we don't
>
> Well, one option would be to have the low level filesystem storage (md.c?)
> routines implement a kind of symlink themselves. Just a file with a special
> magic number followed by a path.
>
> I'm normally against reimplementing OS services but symlinks are really a very
> simple concept and simple to implement. Especially if you can make a few
> simplifying assumptions: they only ever need to appear as the final path
> element not as parent directories and tablespaces don't need symlinks pointing
> to symlinks. Ideally postgres also doesn't need to implement relative links
> either.

I just checked from the MinGW console and I see:

    # touch a
    # ln -s a b
    # echo test >a
    # cat b
    # l ?
    -rw-r--r--    1 Bruce Mo Administ        5 Mar  2 23:30 a
    -rw-r--r--    1 Bruce Mo Administ        0 Mar  2 23:30 b
    # cat a
    test
    # cat b
    #

It accepts ln -s, but does nothing with it.

For tablespaces on OS's that don't support it, I think we will have to
store the path name in the file and read it via the backend.  Somehow we
should cache those lookups.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Tablespaces

From

Bruce Momjian

Date:

03 March 2004, 00:35:18

Tom Lane wrote:
> "scott.marlowe" <scott.marlowe@ihs.com> writes:
> > Is possible / reasonable / smart and or dumb to look at implementing the 
> > tablespaces as riding atop the initlocation handled stuff.
> 
> In my mind, one of the main benefits of this work will be that we'll be
> able to get *rid* of the initlocation stuff.  It's a crock.

Agreed. It should be ripped out once we have tablespaces, and if we keep
it for one extra release, there will be confusion over which to use.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Tablespaces

From

Bruce Momjian

Date:

03 March 2004, 00:37:24

Josh Berkus wrote:
> #3: ALTER TABLE .... CHANGE TABLESPACE:
> This is strictly in the class of "would be a very nice & useful feature if 
> it's not too difficult". ? 
> 
> Given how painful it is to drop & replace a table with multiple dependencies 
> (on some databases, only possible by droping & re-loading the entire 
> database) it would be nice to have an ALTER TABLE command that moved the 
> table to another tablespace. ? ?It doesn't *seem* to me that this would be a 
> very challenging bit of programming, as the operation would be very similar 
> to REINDEX in the manipulation of files. ? (But what I know, really?)
> 
> Once tablespaces are a feature and some users start using them for quota 
> management, there will quickly develop situations where the original 
> tablespace for a db runs out of room and can't be resized. ? Being able to 
> move the table "in situ" then becomes vital, especially on very large 
> databases ... and when someday combined with partitioned tables, will become 
> essential.
> 
> Further, we will get an *immediate* flurry of requests from users who just 
> upgraded to 7.5 and want to make use of the tablespaces feature on an 
> existing production database.

If we don't implement moving tables between tablespaces, we should add a
stub for it in the grammer and mention it is not implemented yet,
because if we don't, we will get tons of questions.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Tablespaces

From

Bruce Momjian

Date:

03 March 2004, 00:40:42

Gavin Sherry wrote:
> Actually, I think that's a pretty good idea :-). I'd solves a bunch of
> issues in the backend (postmaster start up can recurse through
> $PGDATA/tablespaces looking for postmaster.pid files) and will also assist
> admins with complex configurations (perhaps).

Why are you asking about postmaster.pid files.  That file goes in the
top level /data directory, no?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Tablespaces

From

Gavin Sherry

Date:

03 March 2004, 00:47:16

On Tue, 2 Mar 2004, Bruce Momjian wrote:

> Gavin Sherry wrote:
> > Actually, I think that's a pretty good idea :-). I'd solves a bunch of
> > issues in the backend (postmaster start up can recurse through
> > $PGDATA/tablespaces looking for postmaster.pid files) and will also assist
> > admins with complex configurations (perhaps).
>
> Why are you asking about postmaster.pid files.  That file goes in the
> top level /data directory, no?

I was trying to be paranoid about users who have multiple postmasters on
the same machine and want to share a table space while both systems are
live. There'd be no mechanism to test for that situation if we didn't have
something like a postmaster.pid file. Is this being a little too paranoid?

Gavin

Re: Out of space situation and WAL log pre-allocation (was Tablespaces)

From

Tom Lane

Date:

03 March 2004, 00:55:14

Joe Conway <mail@joeconway.com> writes:
> Tom Lane wrote:
>> facing the possibility of an out-of-WAL-space panic.  I suspect that we
>> cannot really do anything about that, but it's annoying.  Any bright
>> ideas out there?

> Maybe specify an archive location (that of course could be on a separate 
> partition) that the external archiver should check in addition to the 
> normal WAL location. At some predetermined interval, push WAL log 
> segments no longer needed to the archive location.

Does that really help?  The panic happens when you fill the "normal" and
"archive" partitions, how's that different from one partition?
        regards, tom lane

Re: Tablespaces

From

Tom Lane

Date:

03 March 2004, 01:00:17

Gavin Sherry <swm@linuxworld.com.au> writes:
> I was trying to be paranoid about users who have multiple postmasters on
> the same machine and want to share a table space while both systems are
> live. There'd be no mechanism to test for that situation if we didn't have
> something like a postmaster.pid file. Is this being a little too paranoid?

Hm.  AFAICS there is no safe situation in which a tablespace directory
could be shared by two different installations (== toplevel $PGDATA
directories).  I don't think we need a dynamic postmaster.pid-type lock
to protect them.  What might make sense is some sort of marker file in a
tablespace directory that links back to the owning $PGDATA directory.
CREATE TABLESPACE should create this, or reject if it already exists.
        regards, tom lane

Re: Tablespaces

From

Bruce Momjian

Date:

03 March 2004, 01:00:58

Gavin Sherry wrote:
> On Tue, 2 Mar 2004, Bruce Momjian wrote:
> 
> > Gavin Sherry wrote:
> > > Actually, I think that's a pretty good idea :-). I'd solves a bunch of
> > > issues in the backend (postmaster start up can recurse through
> > > $PGDATA/tablespaces looking for postmaster.pid files) and will also assist
> > > admins with complex configurations (perhaps).
> >
> > Why are you asking about postmaster.pid files.  That file goes in the
> > top level /data directory, no?
> 
> I was trying to be paranoid about users who have multiple postmasters on
> the same machine and want to share a table space while both systems are
> live. There'd be no mechanism to test for that situation if we didn't have
> something like a postmaster.pid file. Is this being a little too paranoid?

Oh, yikes, I see.  Right now we have the interlock on the /data
directory, but once you start moving stuff out from under /data using
tablespaces, we do perhaps loose the interlock.  However, I assume the
CREATE TABLESPACE is going to create the tablespace directory, so I
don't see how two postmasters could both create the directory.

For example, if you say 
CREATE TABLESPACE tb IN '/var/tb1'

I assume you have to create:
/var/tb1/pgsql_tablespace

and then
/var/tb1/pgsql_tablespace/oid1/var/tb1/pgsql_tablespace/oid2

or something like that, and set the proper permissions on
pgsql_tablespace.  We will have write permission on the directory they
pass to us, but we might not have permissions to change the mode of the
directory they pass, so we have to create a subdirectory anyway, and
that is our interlock.

For example:# run as root$ chmod a+w .$ ls -ld .drwxrwxrwx  2 root  wheel  512 Mar  2 23:51 .# run as the postmaster$
mkdirnew$ ls -ld newdrwxr-xr-x  2 postgres  wheel  512 Mar  2 23:52 new$ chmod 700 new$ chmod 700 .chmod: .: Operation
notpermittedchmod: .: Operation not permitted

As you can see, I have permission to create the /new directory, but no
ability to set its mode, so we have to create a directory that matches
the permissions of /data:
drwx------  6 postgres  postgres  512 Mar  2 12:48 /u/pg/data/

We could require the admin to create a directory that we own instead of
just one that we have write permission in, but why bother when we can
use the new directory as an interlock from multiple postmasters anyway.

Right now we do require the directory used as /data be one where we can
create a /data subdirectory, so this seems similar.  We don't put the
data directly in the passed directory, but in /data under that.  In
fact, we could just call it /var/tb1/data instead of
/var/tb1/pgsql_tablespace.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Out of space situation and WAL log pre-allocation (was

From

Joe Conway

Date:

03 March 2004, 01:13:12

Tom Lane wrote:
> Joe Conway <mail@joeconway.com> writes:
>>Maybe specify an archive location (that of course could be on a separate 
>>partition) that the external archiver should check in addition to the 
>>normal WAL location. At some predetermined interval, push WAL log 
>>segments no longer needed to the archive location.
> 
> Does that really help?  The panic happens when you fill the "normal" and
> "archive" partitions, how's that different from one partition?

I see your point. But it would allow you to use a relatively modest 
local partition for WAL segments, while you might be using a 1TB netapp 
tray over NFS for the archive segments. I guess if the archive partition 
fills up, I would err on the side of dropping archive segments on the 
floor. That would mean a new full backup would be needed, but at least 
it wouldn't result in a corrupt, or shut down, database.

Joe

Re: [pgsql-hackers-win32] Tablespaces

From

Tom Lane

Date:

03 March 2004, 01:18:53

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> For tablespaces on OS's that don't support it, I think we will have to
> store the path name in the file and read it via the backend.  Somehow we
> should cache those lookups.

My feeling is that we need not support tablespaces on OS's without
symlinks.

            regards, tom lane

Re: [pgsql-hackers-win32] Tablespaces

From

Bruce Momjian

Date:

03 March 2004, 01:20:15

Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > For tablespaces on OS's that don't support it, I think we will have to
> > store the path name in the file and read it via the backend.  Somehow we
> > should cache those lookups.
>
> My feeling is that we need not support tablespaces on OS's without
> symlinks.

Agreed, but are we going to support non-tablespace installs?  I wasn't
sure that was an option.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: [pgsql-hackers-win32] Tablespaces

From

Gavin Sherry

Date:

03 March 2004, 01:28:34

On Wed, 3 Mar 2004, Tom Lane wrote:

> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > For tablespaces on OS's that don't support it, I think we will have to
> > store the path name in the file and read it via the backend.  Somehow we
> > should cache those lookups.
>
> My feeling is that we need not support tablespaces on OS's without
> symlinks.

I'm going to focus on implementing this on the system(s) I'm used to
developing on (ie, those which support symlinks). Once that is done, I'll
talk with the Win32 guys about what, if anything, we can do about getting
this to work on Win32 (and possibly other non-symlink supporting OSs).

Thanks,

Gavin

Re: Out of space situation and WAL log pre-allocation (was Tablespaces)

From

Tom Lane

Date:

03 March 2004, 01:34:01

Joe Conway <mail@joeconway.com> writes:
> Tom Lane wrote:
>> Joe Conway <mail@joeconway.com> writes:
>>> Maybe specify an archive location (that of course could be on a separate 
>>> partition) that the external archiver should check in addition to the 
>>> normal WAL location. At some predetermined interval, push WAL log 
>>> segments no longer needed to the archive location.
>> 
>> Does that really help?  The panic happens when you fill the "normal" and
>> "archive" partitions, how's that different from one partition?

> I see your point. But it would allow you to use a relatively modest 
> local partition for WAL segments, while you might be using a 1TB netapp 
> tray over NFS for the archive segments.

Fair enough, but it seems to me that that sort of setup really falls in
the category of a user-defined archiving process --- that is, the hook
that Postgres calls will push WAL segments from the local partition to
the NFS server, and then pushing them off NFS to tape is the
responsibility of some other user-defined subprocess.  Database panic
happens if and only if the local partition overflows.  I don't see that
making Postgres explicitly aware of the secondary NFS arrangement will
buy anything.

> I guess if the archive partition fills up, I would err on the side of
> dropping archive segments on the floor.

That should be user-scriptable policy, in my worldview.

We haven't yet talked much about what the WAL-segment-archiving API
should look like, but if it cannot support implementing the above kind
of arrangement outside the database, then we've dropped the ball.
IMHO anyway.
        regards, tom lane

Re: [pgsql-hackers-win32] Tablespaces

From

Tom Lane

Date:

03 March 2004, 01:36:51

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Tom Lane wrote:
>> My feeling is that we need not support tablespaces on OS's without
>> symlinks.

> Agreed, but are we going to support non-tablespace installs?  I wasn't
> sure that was an option.

A setup containing only the default tablespace cannot use any symlinks.
That doesn't seem hard though.

            regards, tom lane

Re: Tablespaces

From

Claudio Natoli

Date:

03 March 2004, 06:13:01

Bruce Momjian writes:
> I just checked from the MinGW console and I see:
> [snip]
> It accepts ln -s, but does nothing with it.

And even if it had worked, it wouldn't really matter, since we don't
actually want to *run* the system under MinGW/msys, just build it.

I think the idea of implementing in symlinks for non-compliant platforms in
md.c has some merit. FWIW, looks like that is how cygwin implements
symlinks...

Cheers,
Claudio

---
Certain disclaimers and policies apply to all email sent from Memetrics.
For the full text of these disclaimers and policies see
<a
href="http://www.memetrics.com/emailpolicy.html">http://www.memetrics.com/em
ailpolicy.html</a>

Re: Tablespaces

From

Oliver Elphick

Date:

03 March 2004, 07:59:03

On Wed, 2004-03-03 at 04:59, Tom Lane wrote:
>    What might make sense is some sort of marker file in a
> tablespace directory that links back to the owning $PGDATA directory.
> CREATE TABLESPACE should create this, or reject if it already exists.

It will not be enough for the marker to list the path of the parent
$PGDATA, since that path might get changed by system administration
action.  The marker should contain some sort of unique string which
would match the same string somewhere in $PGDATA.  Then, if either
tablespace or $PGDATA were moved, it would be possible to tie the two
back together.  It wouldn't be an issue on most normal systems, but
might be of crucial importance for an ISP running numerous separate
clusters.
-- 
Oliver Elphick <olly@lfix.co.uk>
LFIX Ltd

Re: [pgsql-hackers-win32] Tablespaces

From

Bruce Momjian

Date:

03 March 2004, 09:31:17

Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Tom Lane wrote:
> >> My feeling is that we need not support tablespaces on OS's without
> >> symlinks.
>
> > Agreed, but are we going to support non-tablespace installs?  I wasn't
> > sure that was an option.
>
> A setup containing only the default tablespace cannot use any symlinks.
> That doesn't seem hard though.

Yea, I think you are right.  We just disable CREATE TABLESPACE and the
rest should just work.  Basically, pg_tablespace will only have one
entry on those platforms.  The initdb directory structure will have a
single tablespace, but that doesn't use symlinks.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: [pgsql-hackers-win32] Tablespaces

From

Bruce Momjian

Date:

03 March 2004, 09:32:26

Claudio Natoli wrote:
>
> Bruce Momjian writes:
> > I just checked from the MinGW console and I see:
> > [snip]
> > It accepts ln -s, but does nothing with it.
>
> And even if it had worked, it wouldn't really matter, since we don't
> actually want to *run* the system under MinGW/msys, just build it.
>
> I think the idea of implementing in symlinks for non-compliant platforms in
> md.c has some merit. FWIW, looks like that is how cygwin implements
> symlinks...

Why can't we use MS Win32 shortcut files to simulate symlinks?  MinGW
doesn't do it, so I suppose it isn't possible.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Tablespaces

From

Tom Lane

Date:

03 March 2004, 10:37:25

Oliver Elphick <olly@lfix.co.uk> writes:
> It will not be enough for the marker to list the path of the parent
> $PGDATA, since that path might get changed by system administration
> action.  The marker should contain some sort of unique string which
> would match the same string somewhere in $PGDATA.

We have already added a notion of a "unique installation identifier"
for PITR purposes (look in pg_control).  So we could use that for this
purpose if we wanted to.

But I'm not sure how important it really is.  AFAICS the behavior of
CREATE TABLESPACE will be "create marker file, if it already exists
then abort".  It has no need to actually look in the file and so there's
no need for the contents to be unique.
        regards, tom lane

Re: Tablespaces

From

Greg Stark

Date:

03 March 2004, 12:36:20

Bruce Momjian <pgman@candle.pha.pa.us> writes:

> Greg Stark wrote:
> > 
> > > > I am expecting to hear some bleating about this from people whose
> > > > preferred platforms don't support symlinks ;-).  However, if we don't
> > 
> > Well, one option would be to have the low level filesystem storage (md.c?)
> > routines implement a kind of symlink themselves. Just a file with a special
> > magic number followed by a path.

On further contemplation it doesn't seem like using symlinks really ought to
be necessary. It should be possible to drive everything off the catalog tables
while avoidin having the low level filesystem code know anything about them.

Instead of having the low level code fetch the pg_* records themselves, some
piece of higher level code would do the query and call down to storage layer
to inform it of the locations for everything. It would have to do this on
database initialization and on any subsequent object creation.

Basically maintain an in-memory hash table of oid -> path, and call down to
the low level code whenever that hash changes. (Or more likely oid->ts_id and
a separate list of ts_id -> path.)

-- 
greg

Re: Tablespaces

From

Barry Lind

Date:

03 March 2004, 13:18:23

Oliver Elphick wrote:
> On Wed, 2004-03-03 at 04:59, Tom Lane wrote:
> 
>>   What might make sense is some sort of marker file in a
>>tablespace directory that links back to the owning $PGDATA directory.
>>CREATE TABLESPACE should create this, or reject if it already exists.
> 
> 
> It will not be enough for the marker to list the path of the parent
> $PGDATA, since that path might get changed by system administration
> action.  The marker should contain some sort of unique string which
> would match the same string somewhere in $PGDATA.  Then, if either
> tablespace or $PGDATA were moved, it would be possible to tie the two
> back together.  It wouldn't be an issue on most normal systems, but
> might be of crucial importance for an ISP running numerous separate
> clusters.

Taking this one step further would be to do something like Oracle does.  Every datafile in Oracle (because the Oracle
storagemanager stores 

multiple objects inside datafiles, one could say there is some 
similarity between Oracle datafiles and the proposed pg tablespaces), 
has meta info that tells it which database instance it belongs to and 
the last checkpoint that occured (It might actually be more granular 
than checkpoint, such that on a clean shutdown you can tell that all 
datafiles are consistent with each other and form a consistent database 
instance).  So Oracle on every checkpoint updates all datafiles with an 
identifier.  Now you might ask why is this useful.  Well in normal day 
to day operation it isn't, but it can be usefull in disaster recovery. 
If you loose a disk and need to restore the entire database from backups 
it can be difficult to make sure you have done it all correctly (do I 
have all the necessary files/directories? did I get the right ones from 
the right tapes?)  Especially if you have directories spread across 
various different disks that might be backed up to different tapes.  So 
by having additional information stored in each datafile Oracle can 
provide additional checks that the set of files that are being used when 
the database starts up are consistent and all belong together.  Oracle 
also ensures that all the datafiles that are suposed to exist actually 
do as well.

So what might this mean for postgres and tablespaces?  It could mean 
that on startup the database checks to verify that all the tablespaces 
that are registered actually exist.  And that the data in each 
tablespace is consistent with the current WAL status.  (i.e. someone 
didn't restore a tablespace from backup while the database was down that  is old and needs recovery.

A lot of what I am talking about here become PITR issues.  But since 
PITR and tablespaces are both potential features for 7.5, how they 
interact probably should be thought about in the designs for each.

thanks,
--Barry

Re: Tablespaces

From

Bruce Momjian

Date:

03 March 2004, 13:52:48

Greg Stark wrote:
> 
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> 
> > Greg Stark wrote:
> > > 
> > > > > I am expecting to hear some bleating about this from people whose
> > > > > preferred platforms don't support symlinks ;-).  However, if we don't
> > > 
> > > Well, one option would be to have the low level filesystem storage (md.c?)
> > > routines implement a kind of symlink themselves. Just a file with a special
> > > magic number followed by a path.
> 
> On further contemplation it doesn't seem like using symlinks really ought to
> be necessary. It should be possible to drive everything off the catalog tables
> while avoidin having the low level filesystem code know anything about them.
> 
> Instead of having the low level code fetch the pg_* records themselves, some
> piece of higher level code would do the query and call down to storage layer
> to inform it of the locations for everything. It would have to do this on
> database initialization and on any subsequent object creation.
> 
> Basically maintain an in-memory hash table of oid -> path, and call down to
> the low level code whenever that hash changes. (Or more likely oid->ts_id and
> a separate list of ts_id -> path.)

The advantage of symlinks is that an administrator could see how things
are laid out from the command line.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Out of space situation and WAL log pre-allocation (was Tablespaces)

From

"Simon Riggs"

Date:

03 March 2004, 17:55:26

>Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Joe Conway <mail@joeconway.com> writes:
> > Tom Lane wrote:
> >> Joe Conway <mail@joeconway.com> writes:
> >>> Maybe specify an archive location (that of course could be on a
> separate
> >>> partition) that the external archiver should check in addition to
the
> >>> normal WAL location. At some predetermined interval, push WAL log
> >>> segments no longer needed to the archive location.
> >>
> >> Does that really help?  The panic happens when you fill the
"normal"
> and
> >> "archive" partitions, how's that different from one partition?
> 
> > I see your point. But it would allow you to use a relatively modest
> > local partition for WAL segments, while you might be using a 1TB
netapp
> > tray over NFS for the archive segments.
> 
> Fair enough, but it seems to me that that sort of setup really falls
in
> the category of a user-defined archiving process --- that is, the hook
> that Postgres calls will push WAL segments from the local partition to
> the NFS server, and then pushing them off NFS to tape is the
> responsibility of some other user-defined subprocess.  Database panic
> happens if and only if the local partition overflows.  I don't see
that
> making Postgres explicitly aware of the secondary NFS arrangement will
> buy anything.

Tom's last sentence there summarises the design I was working with. I
had considered Joe's suggested approach (which was Oracle's also).

However, the PITR design will come with a usable low-function program
which can easily copy logs from pg_xlog to another archive directory.
That's needed as a test harness anyway, so it may as well be part of the
package. You'd be able to use that in production to copy xlogs to
another larger directory as a staging area to tape/failover on another
system: effectively Joe's idea is catered for in the basic package.

Anyway I'm answering questions before publishing the design as
stands...though people do keep spurring me to refine it as I'm writing
it down! That's why its good to document it I guess.
> > I guess if the archive partition fills up, I would err on the side
of
> > dropping archive segments on the floor.
> 
> That should be user-scriptable policy, in my worldview.

Hmmm. Very difficult that one.

My experience is in commercial systems. Dropping archive segments on the
floor is just absolutely NOT GOOD, if that is the only behaviour. The
whole purpose of having a dbms is so that you can protect your business
data, while using it. Such behaviour would most likely be a barrier to
wider commercial adoption. [Oracle and other dbms will freeze when this
situation is hit, rather than continue and drop archive logs.]
User-selectable behaviour? OK. That's how we deal with fsync; I can
relate to that. That hadn't been part of my thinking because of the
importance I'd attached to the log files themselves, but I can go with
that, if that's what was meant.

So, if we had a parameter called Wal_archive_policy that has 3 settings:
None = no archiving
Optimistic = archive, but if for some reason log space runs out then
make space by dropping the oldest archive logs 
Strict = if log space runs out, stop further write transactions from
committing, by whatever means, even if this takes down dbms.

That way, we've got something akin to transaction isolation level with
various levels of protection.

Best Regards, Simon Riggs

Re: Out of space situation and WAL log pre-allocation (was Tablespaces)

From

Tom Lane

Date:

03 March 2004, 18:14:07

Joe Conway <mail@joeconway.com> writes:
> Simon Riggs wrote:
>> O... and other dbms will freeze when this situation is hit, rather
>> than continue and drop archive logs.]

> Been there, done that, don't see how it's any better. I hesitate to be
> real specific here, but let's just say the end result was restore from 
> backup :-(

It's hard for me to imagine a situation in which killing the database
would be considered a more attractive option than dropping old log
data.  You may or may not ever need the old log data, but you darn well
do need a functioning database.  (If you don't, you wouldn't be going to
all this work.)

I think also that Simon completely misunderstood my intent in saying
that this could be "user-scriptable policy".  By that I meant that the
*user* could write the code to behave whichever way he liked.  Not that
we were going to go into a mad rush of feature invention and try to
support every combination we could think of.  I repeat: code that pushes
logs into a secondary area is not ours to write.  We should concentrate
on providing an API that lets users write it.  We have only limited
manpower for this project and we need to spend it on getting the core
functionality done right, not on inventing frammishes.
        regards, tom lane

Re: Out of space situation and WAL log pre-allocation (was

From

Bruce Momjian

Date:

03 March 2004, 18:34:44

Simon Riggs wrote:
> User-selectable behaviour? OK. That's how we deal with fsync; I can
> relate to that. That hadn't been part of my thinking because of the
> importance I'd attached to the log files themselves, but I can go with
> that, if that's what was meant.
> 
> So, if we had a parameter called Wal_archive_policy that has 3 settings:
> None = no archiving
> Optimistic = archive, but if for some reason log space runs out then
> make space by dropping the oldest archive logs 
> Strict = if log space runs out, stop further write transactions from
> committing, by whatever means, even if this takes down dbms.
> 
> That way, we've got something akin to transaction isolation level with
> various levels of protection.

Yep, we will definately need something like that.  Basically whenever
the logs are being archived, you have to stop the database if you can't
archive, no?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Out of space situation and WAL log pre-allocation (was

From

Joe Conway

Date:

03 March 2004, 18:41:05

Simon Riggs wrote:
>> Tom Lane [mailto:tgl@sss.pgh.pa.us] That should be user-scriptable
>> policy, in my worldview.

> O... and other dbms will freeze when this situation is hit, rather
> than continue and drop archive logs.]

Been there, done that, don't see how it's any better. I hesitate to be
real specific here, but let's just say the end result was restore from 
backup :-(

> So, if we had a parameter called Wal_archive_policy that has 3
> settings: None = no archiving Optimistic = archive, but if for some
> reason log space runs out then make space by dropping the oldest
> archive logs Strict = if log space runs out, stop further write
> transactions from committing, by whatever means, even if this takes
> down dbms.

That sounds good to me. For the "Optimistic" case, we need to yell 
loudly if we do find ourselves needing to drop segments. For the 
"Strict" case, we just need to be sure it works correctly ;-)

Joe

Re: Tablespaces

From

Thomas Swan

Date:

03 March 2004, 22:31:28

Bruce Momjian wrote:

>Greg Stark wrote:
>  
>
>>Bruce Momjian <pgman@candle.pha.pa.us> writes:
>>
>>    
>>
>>>Greg Stark wrote:
>>>      
>>>
>>>>>>I am expecting to hear some bleating about this from people whose
>>>>>>preferred platforms don't support symlinks ;-).  However, if we don't
>>>>>>            
>>>>>>
>>>>Well, one option would be to have the low level filesystem storage (md.c?)
>>>>routines implement a kind of symlink themselves. Just a file with a special
>>>>magic number followed by a path.
>>>>        
>>>>
>>On further contemplation it doesn't seem like using symlinks really ought to
>>be necessary. It should be possible to drive everything off the catalog tables
>>while avoidin having the low level filesystem code know anything about them.
>>
>>Instead of having the low level code fetch the pg_* records themselves, some
>>piece of higher level code would do the query and call down to storage layer
>>to inform it of the locations for everything. It would have to do this on
>>database initialization and on any subsequent object creation.
>>
>>Basically maintain an in-memory hash table of oid -> path, and call down to
>>the low level code whenever that hash changes. (Or more likely oid->ts_id and
>>a separate list of ts_id -> path.)
>>    
>>
>
>The advantage of symlinks is that an administrator could see how things
>are laid out from the command line.
>
>  
>
That's a poor reason to require symlinks.  The administrator can just as
easily open up psql and query pg_tablespace to see that same
information.  Besides, the postgres doesn't know where to look on the
filesystem for the /path/to/oid without a system catalog lookup.  There
doesn't seem to be any sensible reason to force a filesystem requirement
when the core operations are diffferent to begin with. 

If a more global view of all databases is necessary, perhaps there ought
to be a system wide view which could display all of that information at
once: dbname, relation name, and physical location.

Re: Tablespaces

From

Bruce Momjian

Date:

03 March 2004, 22:41:21

Thomas Swan wrote:
> >The advantage of symlinks is that an administrator could see how things
> >are laid out from the command line.
> >
> >  
> >
> That's a poor reason to require symlinks.  The administrator can just as
> easily open up psql and query pg_tablespace to see that same
> information.  Besides, the postgres doesn't know where to look on the
> filesystem for the /path/to/oid without a system catalog lookup.  There
> doesn't seem to be any sensible reason to force a filesystem requirement
> when the core operations are diffferent to begin with. 
> 
> If a more global view of all databases is necessary, perhaps there ought
> to be a system wide view which could display all of that information at
> once: dbname, relation name, and physical location.

Who doesn't have symlinks these days, and is going to be using
tablespaces?  Even Win32 has them.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Tablespaces

From

Tom Lane

Date:

04 March 2004, 00:14:59

Thomas Swan <tswan@idigx.com> writes:
> Bruce Momjian wrote:
>> The advantage of symlinks is that an administrator could see how things
>> are laid out from the command line.
>> 
> That's a poor reason to require symlinks.  The administrator can just as
> easily open up psql and query pg_tablespace to see that same
> information.

Something to keep in mind here is that one of the times you would most
likely need that information is when the database is broken and you
*can't* simply "open up psql" and inspect system catalogs.  I like the
fact that a symlink implementation can be inspected without depending on
a working database.

If we were going to build a non-symlink implementation, I'd want the
highlevel-to-lowlevel data transfer to take the form of a flat ASCII
file that could be inspected by hand, rather than some hidden in-memory
datastructure.  But given the previous discussion in this thread,
I cannot see any strong reason not to rely on symlinks for the purpose.
We are not in the business of building replacements for OS features.
        regards, tom lane

Re: Tablespaces

From

Thomas Swan

Date:

04 March 2004, 02:39:56

Tom Lane wrote:

>Thomas Swan <tswan@idigx.com> writes:
>  
>
>>Bruce Momjian wrote:
>>    
>>
>>>The advantage of symlinks is that an administrator could see how things
>>>are laid out from the command line.
>>>
>>>      
>>>
>>That's a poor reason to require symlinks.  The administrator can just as
>>easily open up psql and query pg_tablespace to see that same
>>information.
>>    
>>
>
>Something to keep in mind here is that one of the times you would most
>likely need that information is when the database is broken and you
>*can't* simply "open up psql" and inspect system catalogs.  I like the
>fact that a symlink implementation can be inspected without depending on
>a working database.
>
>  
>
That's a sufficient argument, to allow for it.  Recoverability would be
one reason.

>If we were going to build a non-symlink implementation, I'd want the
>highlevel-to-lowlevel data transfer to take the form of a flat ASCII
>file that could be inspected by hand, rather than some hidden in-memory
>datastructure.  But given the previous discussion in this thread,
>I cannot see any strong reason not to rely on symlinks for the purpose.
>We are not in the business of building replacements for OS features.
>
>  
>
I do like the flat file output at least for a record of what went
where.  Regardless of whether or not symlinks are used, the admin would
need to know what directories/files/filesystems are to be backed up.

I am concerned as to what extent different filesystems do when you back
the directories up.    Would NTFS containing symlinks be able to be
backed up with a tar/zip command, or is something more elaborate needed? In the past, before upgrading, I have had to
tarthe pgdata directory

with the postmaster shutdown to insure a quick restoration of the
database in case an upgrade didn't proceed uneventfully.  Also, in the
event of a major version upgrade the restored information may or may not
proceed uneventfully.    I just wanted to point out something I thought
might be an issue further down the road.  

Perhaps the system catalog / flat file approach would be a more solid
approach, both of which would not involve replacing or duplicating OS
features.

Re: Out of space situation and WAL log pre-allocation (was

From

Shridhar Daithankar

Date:

04 March 2004, 03:09:43

Tom Lane wrote:
> I think also that Simon completely misunderstood my intent in saying
> that this could be "user-scriptable policy".  By that I meant that the
> *user* could write the code to behave whichever way he liked.  Not that
> we were going to go into a mad rush of feature invention and try to
> support every combination we could think of.  I repeat: code that pushes
> logs into a secondary area is not ours to write.  We should concentrate
> on providing an API that lets users write it.  We have only limited
> manpower for this project and we need to spend it on getting the core
> functionality done right, not on inventing frammishes.

Hmm... I totally agree. I think the backend could just offer a shared memory 
segment and a marker message to another process to allow copy from it. then it 
is the applications business to do the things.

Of course there has to be a two way agreement about it but an API is a real nice 
thing rather than an application.
 Shridhar

Re: Tablespaces

From

Marko Karppinen

Date:

07 March 2004, 20:07:57

On 3 March 2004, at 19:52, Bruce Momjian wrote:
> The advantage of symlinks is that an administrator could see how things
> are laid out from the command line.

One thing to keep in mind is that system administrators don't see
symlinks as being informational -- they see them as the actual UI
for the redirection in question. So their expectation is that they'll
be able to move the actual directory around at will (as long as they
update the symlink to match).

If symlinks are used, the rule of least surprise would mean that
no information whatsoever about the physical location of a tablespace
should be stored in the system catalogs. Otherwise their relationship
with the information stored in the symlink is ambiguous.

mk

Re: Tablespaces

From

Andrew Sullivan

Date:

08 March 2004, 09:32:51

On Mon, Mar 08, 2004 at 02:07:35AM +0200, Marko Karppinen wrote:
> One thing to keep in mind is that system administrators don't see
> symlinks as being informational -- they see them as the actual UI
> for the redirection in question. So their expectation is that they'll
> be able to move the actual directory around at will (as long as they
> update the symlink to match).

This is a good point.  It's worth keeping in mind, too, that in large
shops, the DBAs and the sysadmins often are in separate departments
with separate management, precisely because the database system has
traditionally been somewhat divorced from the OS (as an aside, I
suspect that this sort of separation is part of the reason for the
popularity of raw filesystems among DBAs.  Even if they didn't
provide better speed, it's just preferable not to have to involve
another department).  System administrators in such places have been
known to decide to "reorganise the disks", assuming that the database
just has its own home.  For such a sysadmin, a pile of symlinks would
be fair game for reorganisation.

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca

Re: Tablespaces

From

Andreas Pflug

Date:

08 March 2004, 09:56:26

Andrew Sullivan wrote:

>eorganise the disks", assuming that the database
>just has its own home.  For such a sysadmin, a pile of symlinks would
>be fair game for reorganisation.
>  
>

Please take into consideration that symlinks might be every day work for 
*nix admins, but for win admins it's very uncommon. Additionally, win 
admins are accustomed to gui tools, and many of them will stumble if 
forced to use a command line. For worse, junctions are not 
distinguishable in explorer; only the shell's dir command knows about 
junctions. This obfuscation makes junctions quite a bad choice for admin 
purposes.

IMHO there are only two viable options:
- no tablespaces for win32, i.e. recommend *ix for bigger installations
- a different tablespace storage approach., e.g. simple desktop links 
(or alike) redirecting to a directory.

Regards,
Andreas

Re: Tablespaces

From

Bruce Momjian

Date:

08 March 2004, 13:30:55

Andrew Sullivan wrote:
> On Mon, Mar 08, 2004 at 02:07:35AM +0200, Marko Karppinen wrote:
> > One thing to keep in mind is that system administrators don't see
> > symlinks as being informational -- they see them as the actual UI
> > for the redirection in question. So their expectation is that they'll
> > be able to move the actual directory around at will (as long as they
> > update the symlink to match).
> 
> This is a good point.  It's worth keeping in mind, too, that in large
> shops, the DBAs and the sysadmins often are in separate departments
> with separate management, precisely because the database system has
> traditionally been somewhat divorced from the OS (as an aside, I
> suspect that this sort of separation is part of the reason for the
> popularity of raw filesystems among DBAs.  Even if they didn't
> provide better speed, it's just preferable not to have to involve
> another department).  System administrators in such places have been
> known to decide to "reorganise the disks", assuming that the database
> just has its own home.  For such a sysadmin, a pile of symlinks would
> be fair game for reorganisation.

Agreed.  I think the idea is to use lstat to query the symlink, rather
than storing that information in the database.  My idea was to create an
lstat server-side function that could be used by pg_dump and friends.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Out of space situation and WAL log pre-allocation (was Tablespaces)

From

"Simon Riggs"

Date:

08 March 2004, 19:28:43

>Joe Conway [mailto:mail@joeconway.com]
> Simon Riggs wrote:
> >> Tom Lane [mailto:tgl@sss.pgh.pa.us] That should be user-scriptable
> >> policy, in my worldview.
> 
> > O... and other dbms will freeze when this situation is hit, rather
> > than continue and drop archive logs.]
> 
> Been there, done that, don't see how it's any better. I hesitate to be
> real specific here, but let's just say the end result was restore from
> backup :-(
> 
> > So, if we had a parameter called Wal_archive_policy that has 3
> > settings: None = no archiving Optimistic = archive, but if for some
> > reason log space runs out then make space by dropping the oldest
> > archive logs Strict = if log space runs out, stop further write
> > transactions from committing, by whatever means, even if this takes
> > down dbms.
> 
> That sounds good to me. For the "Optimistic" case, we need to yell
> loudly if we do find ourselves needing to drop segments. For the
> "Strict" case, we just need to be sure it works correctly ;-)

Good.

Yell loudly really needs to happen sometime earlier, which is as Gavin
originally thought something to do with tablespaces.

Strict behaviour is fairly straightforward, you just PANIC!

I'd think we could rename these to
Fail Operational rather than Optimistic
Fail Safe rather than Strict
...the other names were a bit like "I'm right" and "but I'll do yours
too" ;}

Best Regards, Simon Riggs

Re: Out of space situation and WAL log pre-allocation (was

From

"Simon Riggs"

Date:

08 March 2004, 19:28:45

>Bruce Momjian
> Simon Riggs wrote:
> > User-selectable behaviour? OK. That's how we deal with fsync; I can
> > relate to that. That hadn't been part of my thinking because of the
> > importance I'd attached to the log files themselves, but I can go
with
> > that, if that's what was meant.
> >
> > So, if we had a parameter called Wal_archive_policy that has 3
settings:
> > None = no archiving
> > Optimistic = archive, but if for some reason log space runs out then
> > make space by dropping the oldest archive logs
> > Strict = if log space runs out, stop further write transactions from
> > committing, by whatever means, even if this takes down dbms.
> >
> > That way, we've got something akin to transaction isolation level
with
> > various levels of protection.
> 
> Yep, we will definately need something like that.  Basically whenever
> the logs are being archived, you have to stop the database if you
can't
> archive, no?

That certainly was my initial feeling, though I believe it is possible
to accommodate both viewpoints. I would not want to have only the
alternative viewpoint, I must confess.

Best Regards, Simon Riggs

Re: Out of space situation and WAL log pre-allocation (was Tablespaces)

From

"Simon Riggs"

Date:

08 March 2004, 19:28:58

Please excuse the delay in replying..

>Tom Lane
> Joe Conway <mail@joeconway.com> writes:
> > Simon Riggs wrote:
> >> O... and other dbms will freeze when this situation is hit, rather
> >> than continue and drop archive logs.]
> 
> > Been there, done that, don't see how it's any better. I hesitate to
be
> > real specific here, but let's just say the end result was restore
from
> > backup :-(

Myself also. I accept your experience and insight, I apologise if my own
seemed overblown. My take on that is that if you're in a situation that
has a high probability of going bad, the last thing you would want is to
drop xlogs. Same technical experience, different viewpoint on what to
learn from it. 

> It's hard for me to imagine a situation in which killing the database
> would be considered a more attractive option than dropping old log
> data.  You may or may not ever need the old log data, but you darn
well
> do need a functioning database.  (If you don't, you wouldn't be going
to
> all this work.)

The main point here for me is that the choice of keeping archived (not
old) log files against keeping the database up isn't actually mine to
make; that choice belongs to the owner of the database, not me as
developer or administrator, consultant or whatever. 

Although I admit I did not at first comprehend that such a view was
possible, I did flex to allow yours and Joe's perspective when that was
voiced.

The point is one of risk: does the owner wish to risk the possibility
that a transaction may be lost in order to keep the database up? The
possibility of lost rows must be balanced against the probably higher
possibility of being unable to write new data. But which is worse? Who
can say?

In some environments where I have worked, (again forgive any seeming
personal arrogance or posturing), such as banks or finance generally, it
has been desirable to stop the system rather than risk losing even a
single row. In other situations, lost rows must be balanced against the
money lost through downtime. Guess it depends whether you've got a
contract for uptime or for data integrity?? ;)

> I repeat: code that pushes
> logs into a secondary area is not ours to write.  We should
concentrate
> on providing an API that lets users write it.  

Agreed.

> We have only limited
> manpower for this project and we need to spend it on getting the core
> functionality done right, not on inventing frammishes.

Love that word "frammish"...seriously, I understand and agree.

My understanding is that existing logic will cause a PANIC if the xlog
directory cannot be written to. Helping the database stay up by dropping
logs would require extra code...

This was an edge case anyhow...

Best Regards, Simon Riggs

Re: Out of space situation and WAL log pre-allocation (was

From

Bruce Momjian

Date:

08 March 2004, 19:51:12

Simon Riggs wrote:
> >Bruce Momjian
> > Simon Riggs wrote:
> > > User-selectable behaviour? OK. That's how we deal with fsync; I can
> > > relate to that. That hadn't been part of my thinking because of the
> > > importance I'd attached to the log files themselves, but I can go
> with
> > > that, if that's what was meant.
> > >
> > > So, if we had a parameter called Wal_archive_policy that has 3
> settings:
> > > None = no archiving
> > > Optimistic = archive, but if for some reason log space runs out then
> > > make space by dropping the oldest archive logs
> > > Strict = if log space runs out, stop further write transactions from
> > > committing, by whatever means, even if this takes down dbms.
> > >
> > > That way, we've got something akin to transaction isolation level
> with
> > > various levels of protection.
> > 
> > Yep, we will definately need something like that.  Basically whenever
> > the logs are being archived, you have to stop the database if you
> can't
> > archive, no?
> 
> That certainly was my initial feeling, though I believe it is possible
> to accommodate both viewpoints. I would not want to have only the
> alternative viewpoint, I must confess.
> 

Added to PITR TODO list.  Anything else to add:
http://momjian.postgresql.org/main/writings/pgsql/project


--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Out of space situation and WAL log pre-allocation (was Tablespaces)

From

Greg Stark

Date:

09 March 2004, 23:28:14

"Simon Riggs" <simon@2ndquadrant.com> writes:

> Strict behaviour is fairly straightforward, you just PANIC!

There is another mode possible as well. Oracle for example neither panics nor
continues, it just freezes. It keeps retrying the transaction until it finds
it has space.  

The sysadmin or dba just has to somehow create additional space by removing
old files or however and the database will continue where it left off. That
seems a bit nicer than panicing.

When I first heard that I was shocked. It means implementing archive logs
*created* a new failure mode where there was none before. I thought that was
the dumbest idea in the world: who needed a backup process that increased the
chances of an outage? Now I can see the logic, but I'm still not sure which
mode I would pick if it was up to me. As others have said, I guess it would
depend on the situation.

-- 
greg

Re: Tablespaces

From

Greg Stark

Date:

09 June 2004, 13:28:49

Dennis Bjorklund <db@zigo.dhs.org> writes:

> On Thu, 26 Feb 2004, Gavin Sherry wrote:
> 
> > Comments? Questions? Suggestions?
> 
> Is that plan that in the future one can split a single table into 
> different table spaces? Like storing all rows with year < 1999 in one 
> tablespace and the rest in another?

That's a separate orthogonal feature called "partitioned tables". There's some
amount of resistance to the idea amongst postgres people, and there's nobody
who has spoken up interested in implementing it, but there's also lots of
interest from users. A good patch would probably go a long way to convincing
people :)

Table spaces are being able to store different tables in different physical
locations on disk. A first version of this has actually been implemented for
7.5 using symlinks.

Partitioned tables and tablespaces do indeed have a certain amount of synergy.
But even in a single tablespace your example makes sense.

> With the rule system and two underlying tables one could make it work by 
> hand I think.

The rule system could be used to do this, but there was some discussion of
using inherited tables to handle it. However neither handles the really hard
part of detecting queries that use only a part of the table and taking that
into account in generating the plan.

-- 
greg

Re: Tablespaces

From

"Zeugswetter Andreas SB SD"

Date:

11 June 2004, 07:02:54

> > With the rule system and two underlying tables one could make it work by
> > hand I think.
>
> The rule system could be used to do this, but there was some discussion of
> using inherited tables to handle it. However neither handles the really hard
> part of detecting queries that use only a part of the table and taking that
> into account in generating the plan.

I think the consensus should be to add smarts to the planner to include
static constraint information to reduce table access.

e.g if you have a constraint "acol integer, check acol < 5"
and you have a query with a "where acol = 10" you could reduce that
to "where false". This would help in all sorts of situations not only
partitioned/inherited tables. I am not sure what the runtime cost of
such an inclusion would be, so maybe it needs smarts to only try in certain
cases ?

Andreas

Re: Tablespaces

From

Greg Stark

Date:

11 June 2004, 12:33:41

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:

> e.g if you have a constraint "acol integer, check acol < 5"
> and you have a query with a "where acol = 10" you could reduce that
> to "where false". 

I think part of the question is how much work do you put into checking this.
Checking constant known values like above is probably not too expensive.
Checking for ranges like "where acol between 5 and 10" is probably doable. And
that might be enough for partitioned tables. I think that's about all Oracle
bothers to check, for example. More complex where clauses and check
expressions might be hard to prove are true or false.

But then the work's still not done, you still have to add an optimization that
prunes members of a UNION ALL (or equivalent if it's done using inherited
tables or some other infrastructure) if they are known to provably produce
zero rows.

And then there are more subtle cases. Like if the query is "where acol = ?".
Then you know it only has to read one partition, but you don't know which one
at compile time. And it's important to handle that case because that might be
the only clause. So knowing that you only need one partition might be the
difference between a sequential scan of one partition, or an index scan of
many thousands of records because they're only a small percentage of the
entire table.

-- 
greg

Re: Tablespaces

From

pgsql@mohawksoft.com

Date:

11 June 2004, 13:10:24

> I don't think we want features for their own sake, though, and I'm
> not convinced that raw filesystems are actually useful.  Course, it's
> not my itch, and PostgreSQL _is_ free software.
>

I agree that raw file systems are seldom useful with one caveat, more
advanced file systems are sometimes detrimental to database access.

Conceptually, a file system and a database are redundant, both are doing
their best to preserve data integrity. This is especially true with
journalling file systems. Not to mention technologies like reiserfs which
attempts to do sub-block allocation.

What I think would go a long way to improving database performance on
non-raw partitions would be a simplified file system -- SFS anyone? The
simplified file system would not track access time. It would not overly
try to manage disk space. The target applications are going to allocate
disk space on a block level, rather than quibble about 4K here or 8K here,
have a user defined standard allocation unit of 64K, 128K, or so on.
Reduction on allocation overhead also reduces meta-data updating I/O. I
can almost imagine 32BIT FAT with large clusers, only with real inodes.
The idea would be that a database, like PostgreSQL, would be managing the
data not the file system. The file systems job would only to be the most
minimalist interface to the OS.

The benefts would be awesome, near-raw partition access and standard OS
tools for maintainence.