Multiple Storage per Tablespace, or Volumes - Mailing list pgsql-hackers

From Dimitri Fontaine
Subject Multiple Storage per Tablespace, or Volumes
Date
Msg-id 200702191125.45772.dim@dalibo.com
Whole thread Raw
Responses Re: Multiple Storage per Tablespace, or Volumes  (Martijn van Oosterhout <kleptog@svana.org>)
List pgsql-hackers
Hi list,

Here's a proposal of this idea which stole a good part of my night.
I'll present first the idea, then 2 use cases where to read some rational and
few details. Please note I won't be able to participate in any development
effort associated with this idea, may such a thing happen!

The bare idea is to provide a way to 'attach' multiple storage facilities (say
volumes) to a given tablespace. Each volume may be attached in READ ONLY,
READ WRITE or WRITE ONLY mode.
You can mix RW and WO volumes into the same tablespace, but can't have RO with
any W form, or so I think.

It would be pretty handy to be able to add and remove volumes on a live
cluster, and this could be a way to implement moving/extending tablespaces.


Use Case A: better read performances while keeping data write reliability

The first application of this multiple volumes per tablespace idea is to keep
a tablespace both into RAM (tmpfs or ramfs) and on disk (both RW).

Then PG should be able to read from both volumes when dealing with read
queries, and would have to fwrite()/fsync() both volumes for each write.
Of course, write speed will be constrained by the slowest volume, but the
quicker one would then be able to take away some amount of read queries
meanwhile.

It would be neat if PG was able to account volumes relative write speed in
order to assign pounds to each tablespace volumes; and have the planner or
executor span read queries among volumes depending on that.
For example if a single query has a plan containing several full scan (of
indexes and/or tables) in the same tablespace, those could be done on
different volumes.

Use Case B: Synchronous Master Slave(s) Replication

By using a Distributed File System capable of being mounted from several nodes
at the same time, we could have a configuration where a master node has
('exports') a WO tablespace volume, and one or more slaves (depending on FS
capability) configures a RO tablespace volume.

PG has then to be able to cope with a RO volume: the data are not written by
PG itself (local node point of view), so some limitations would certainly
occur.
Will it be possible, for example, to add indexes to data on slaves?
I'd use the solution even without this, thus...

When the master/slave link is broken, the master can no more write to
tablespace, as if it was a local disk failure of some sort, so this should
prevent nasty desync' problems: data is written on all W volumes or data is
not written at all.


I realize this proposal is the first draft of a work to be done, and that I
won't be able to make a lot more than drafting this idea. This mail is sent
on the hackers list in the hope someone there will find this is worth
considering and polishing...

Regards, and thanks for the good work ;)
--
Dimitri Fontaine

pgsql-hackers by date:

Previous
From: Gregory Stark
Date:
Subject: Short varlena headers and arrays
Next
From: "Ian Caulfield"
Date:
Subject: Re: RFC: Temporal Extensions for PostgreSQL