Thread: What needs to be done for real Partitioning?

What needs to be done for real Partitioning?

From
Josh Berkus
Date:
Folks,

I may (or may not) soon have funding for implementing full table partitioning
in PostgreSQL.  I thought it would be a good idea to discuss with people here
who are already using pseudo-partitioning what things need to be added to
Postgresql in order to make full paritioning a reality; that is, what do
other databases do that we don't?

Implementations are seperated into phases I and II, II being
harder-and-optional-stuff that may get done later,  I being essential
features.

Ph. I
-- CREATE TABLE ... WITH PARTITION ON {expression}
    ---- should automatically create expression index on {expression}
-- INSERT INTO should automatically create new partitions where necessary
    ---- new tables should automatically inherit all constraints, indexes,
            keys of "parent" table
-- UPDATE should automatically move rows between partitions where applicable
-- Query Planner/Executor should be improved to not always materialize
paritioned tables used in subqueries and joins.

Ph. II
-- Foreign Keys to/from partitioned tables should become possible
-- Query Planner/Executor should be improved to only join partitions which are
compliant with the query's WHERE or JOIN clauses where reasonable
-- DELETE FROM should automatically drop empty partitions
-- setting of WITH PARTITION ON {expression} TABLESPACE should automatically
create a new tablespace for each new partition and its indexes.
-- It should be possible to create new, empty partitions via a CREATE TABLE
PARTITION OF {table} ON {value} expression.

All syntax above is, of course, highly debatable.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: What needs to be done for real Partitioning?

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> -- CREATE TABLE ... WITH PARTITION ON {expression}

I'd rather see the partition control stuff as ALTER TABLE commands,
not decoration on CREATE TABLE.  See the WITH OIDS business we just went
through: adding nonstandard decoration to a standard command isn't good.

> -- INSERT INTO should automatically create new partitions where necessary
> -- DELETE FROM should automatically drop empty partitions

I am not sure I agree with either of those, and the reason is that they
would turn low-lock operations into high-lock operations.  DELETE FROM
would be particularly bad.  Furthermore, who wants to implement DROP
PARTITION as a DELETE FROM?  ISTM the whole point of partitioning is to
be able to load and unload whole partitions quickly, and having to
DELETE all the rows in a partition isn't my idea of quick.

> -- setting of WITH PARTITION ON {expression} TABLESPACE should automatically
> create a new tablespace for each new partition and its indexes.

This is a bad idea.  Where are you going to create these automatic
tablespaces?  What will they be named?  Won't this require superuser
privileges?  And what's the point anyway?

> -- It should be possible to create new, empty partitions via a CREATE TABLE
> PARTITION OF {table} ON {value} expression.

Huh?  ISTM this confuses establishment of a table's partition rule with
the act of pre-creating empty partitions for not-yet-used ranges of
partition keys.  Or are you trying to suggest that a table could be
partitioned more than one way at a time?  If so, how?

            regards, tom lane

Re: What needs to be done for real Partitioning?

From
PFC
Date:
    This is really great !

    Think about altering the partitioning (this is quite complex) : imagine a
table split in several partitions "archive" and "current" where a row is
moved from current to archive when it will not be updated anymore.
Sometimes you can partition on a simple numeric value, or even a boolean
value in this case. Other times you'd have to partition on a date,
(current month, current year, archive...). So, how to move the partition
between the two tables so that the oldest rows in the current month table
are moved to the current year table at the end of each month ?

    Some ideas :
    hidden field (like oid was) to indicate in which partition the tuple is ?


On Sat, 19 Mar 2005 21:02:38 +0100, Josh Berkus <josh@agliodbs.com> wrote:

> Folks,
>
> I may (or may not) soon have funding for implementing full table
> partitioning
> in PostgreSQL.  I thought it would be a good idea to discuss with people
> here
> who are already using pseudo-partitioning what things need to be added to
> Postgresql in order to make full paritioning a reality; that is, what do
> other databases do that we don't?
>
> Implementations are seperated into phases I and II, II being
> harder-and-optional-stuff that may get done later,  I being essential
> features.
>
> Ph. I
> -- CREATE TABLE ... WITH PARTITION ON {expression}
>     ---- should automatically create expression index on {expression}
> -- INSERT INTO should automatically create new partitions where necessary
>     ---- new tables should automatically inherit all constraints,
> indexes,
>             keys of "parent" table
> -- UPDATE should automatically move rows between partitions where
> applicable
> -- Query Planner/Executor should be improved to not always materialize
> paritioned tables used in subqueries and joins.
>
> Ph. II
> -- Foreign Keys to/from partitioned tables should become possible
> -- Query Planner/Executor should be improved to only join partitions
> which are
> compliant with the query's WHERE or JOIN clauses where reasonable
> -- DELETE FROM should automatically drop empty partitions
> -- setting of WITH PARTITION ON {expression} TABLESPACE should
> automatically
> create a new tablespace for each new partition and its indexes.
> -- It should be possible to create new, empty partitions via a CREATE
> TABLE
> PARTITION OF {table} ON {value} expression.
>
> All syntax above is, of course, highly debatable.
>



Re: What needs to be done for real Partitioning?

From
"Stacy White"
Date:
From: "Tom Lane" <tgl@sss.pgh.pa.us>
> Josh Berkus <josh@agliodbs.com> writes:
> > -- INSERT INTO should automatically create new partitions where
necessary
> > -- DELETE FROM should automatically drop empty partitions
>
> I am not sure I agree with either of those, and the reason is that they
> would turn low-lock operations into high-lock operations.

I second this.  We're current using an inheritance based partitioning scheme
with automatic partition creation in the application code, and have seen at
least one case of deadlock due to partition creation.

Other phase II/III items might include:

- Modify the partitioning scheme of a table.  In the above example, adding a
'200504' partition, and moving the '200502' orders into 'ARCHIVE'

- The ability to place a partition in a tablespace.  In the example above,
it would be nice to put the 'ARCHIVE' partition would likely be placed on a
slower set of disks than the most recent month's partition.

- Global indexes (that is to say, an index spanning the the table rather
than an individual partition).  This seems counterintuitive, but they've
dramatically increased performance on one of our Oracle systems and should
at least be worth considering.


Re: What needs to be done for real Partitioning?

From
Alvaro Herrera
Date:
On Sat, Mar 19, 2005 at 11:24:39PM +0100, PFC wrote:

>     Some ideas :
>     hidden field (like oid was) to indicate in which partition the tuple
>     is ?

I think that to make partitioning really possible we need to have
multi-relfilenode tables.

We probably also need multi-table indexes.  Implementing these would be
good for inheritance too.

--
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
Dios hizo a Adán, pero fue Eva quien lo hizo hombre.

Re: What needs to be done for real Partitioning?

From
Tom Lane
Date:
PFC <lists@boutiquenumerique.com> writes:
>     Some ideas :
>     hidden field (like oid was) to indicate in which partition the tuple is ?

tableoid would accomplish that already, assuming that the "partitioned
table" is effectively a view on separate physical tables.

            regards, tom lane

Re: What needs to be done for real Partitioning?

From
Josh Berkus
Date:
Tom, Stacy, Alvaro,

> I'd rather see the partition control stuff as ALTER TABLE commands,
> not decoration on CREATE TABLE.  See the WITH OIDS business we just went
> through: adding nonstandard decoration to a standard command isn't good.

OK, sure.

> > -- INSERT INTO should automatically create new partitions where necessary
> > -- DELETE FROM should automatically drop empty partitions
>
> I am not sure I agree with either of those, and the reason is that they
> would turn low-lock operations into high-lock operations.

For INSERT, I think that's a problem we need to work through.   Partitioning
on any scheme where you have to depend on the middleware to create new
partitions could never be more than a halfway implementation.  For one thing,
if we can't have 100% dependence on the idea that Table M, Partition 34
contains index values Y-Z, then that form of advanced query rewriting (which
is a huge performance gain on really large tables) becomes inaccessable.

Or are you proposing, instead, that attempts to insert beyond the range raise
an error?

> DELETE FROM
> would be particularly bad.  Furthermore, who wants to implement DROP
> PARTITION as a DELETE FROM?  ISTM the whole point of partitioning is to
> be able to load and unload whole partitions quickly, and having to
> DELETE all the rows in a partition isn't my idea of quick.

I mostly threw DELETE in for obvious symmetry.   If it's complicated, we can
drop it.

And you're right, I forgot DROP PARTITION.

> This is a bad idea.  Where are you going to create these automatic
> tablespaces?  What will they be named?  Won't this require superuser
> privileges?  And what's the point anyway?

Stacy White suggests the more sensible version of this:
ALTER TABLE {table} CREATE PARTITION WITH VALUE {value} ON TABLESPACE
{tablespacename}.   Manually creating the partitions in the appropriate
location probably makes the most sense.

The point, btw, is that if you have a 2TB table, you probably want to put its
partitions on several seperate disk arrays.

> Huh?  ISTM this confuses establishment of a table's partition rule with
> the act of pre-creating empty partitions for not-yet-used ranges of
> partition keys.

I don't understand why this would be confusing.   If INSERT isn't creating
partitions on new value breakpoint, then CREATE PARTITION needs to.

> Or are you trying to suggest that a table could be
> partitioned more than one way at a time?  If so, how?

No.

> - Modify the partitioning scheme of a table.  In the above example, adding
> a '200504' partition, and moving the '200502' orders into 'ARCHIVE'

Hmmm ... I don't see the point in automating this.   Can you explain?

> - Global indexes (that is to say, an index spanning the the table rather
> than an individual partition).  This seems counterintuitive, but they've
> dramatically increased performance on one of our Oracle systems and should
> at least be worth considering.

Hmmm, again can you detail this?   Maybe some performance examples?   It seems
to me that global indexes might interfere with the maintenance advantages of
partitioning.

> We probably also need multi-table indexes.  Implementing these would be
> good for inheritance too.

They would be nice, but I don't see them as a requirement for making
partitioning work.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: What needs to be done for real Partitioning?

From
Steve Atkins
Date:
On Sat, Mar 19, 2005 at 12:02:38PM -0800, Josh Berkus wrote:

> Folks,
>
> I may (or may not) soon have funding for implementing full table partitioning
> in PostgreSQL.  I thought it would be a good idea to discuss with people here
> who are already using pseudo-partitioning what things need to be added to
> Postgresql in order to make full paritioning a reality; that is, what do
> other databases do that we don't?
>
> Implementations are seperated into phases I and II, II being
> harder-and-optional-stuff that may get done later,  I being essential
> features.
>
> Ph. I
> -- CREATE TABLE ... WITH PARTITION ON {expression}
>     ---- should automatically create expression index on {expression}

ALTER TABLE might be cleaner, perhaps?

> -- INSERT INTO should automatically create new partitions where necessary
>     ---- new tables should automatically inherit all constraints, indexes,
>             keys of "parent" table
> -- UPDATE should automatically move rows between partitions where applicable
> -- Query Planner/Executor should be improved to not always materialize
> paritioned tables used in subqueries and joins.

Would the SELECT also look at the parent table, if it weren't empty? I can
think of cases where that'd be useful, especially if an existing table
can be partitioned with an ALTER TABLE.

This covers almost everything I'd want from table partitioning in the
short term.

> Ph. II
> -- Foreign Keys to/from partitioned tables should become possible
> -- Query Planner/Executor should be improved to only join partitions which are
> compliant with the query's WHERE or JOIN clauses where reasonable
> -- DELETE FROM should automatically drop empty partitions
> -- setting of WITH PARTITION ON {expression} TABLESPACE should automatically
> create a new tablespace for each new partition and its indexes.
> -- It should be possible to create new, empty partitions via a CREATE TABLE
> PARTITION OF {table} ON {value} expression.
>
> All syntax above is, of course, highly debatable.

Multi-table indexes would be nice too, though that leads to some problems
when a partition is truncated or dropped, I guess.

Cheers,
  Steve

Re: What needs to be done for real Partitioning?

From
PFC
Date:
> tableoid would accomplish that already, assuming that the "partitioned
> table" is effectively a view on separate physical tables.
>
>             regards, tom lane

    Very good.

    Also note the possibility to mark a partition READ ONLY. Or even a table.
    It does not seem very useful but just think that for instance the "1999",
"2000" ... "2004" partitions of a big archive probably never change.
READLONY means we're sure they never change, thus no need to backup them
every time. Keeping the example of some DB arranged by years / current
year / current month, Just backup the "current month" part every day and
the "current year" every month when you switch partitions.
    This could be achieved also by storing the time of last modification of a
table somewhere.


Re: What needs to be done for real Partitioning?

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
>>> -- INSERT INTO should automatically create new partitions where necessary
>>> -- DELETE FROM should automatically drop empty partitions
>>
>> I am not sure I agree with either of those, and the reason is that they
>> would turn low-lock operations into high-lock operations.

> For INSERT, I think that's a problem we need to work through.

Possibly, but I'm concerned about locking and deadlock issues.  The
reason that this is iffy is you would start the operation with only
an INSERT-grade lock, and then discover that you needed to add a
partition, which is surely something that needs an exclusive-grade
lock (consider two sessions trying to add the same partition at the
same time).  So I don't see how to do it without lock upgrading,
and lock upgrading is always a recipe for deadlocks.

The DELETE case is even worse because you can't physically release
storage until you're sure nothing in it is needed anymore by any open
transaction --- that introduces VACUUM-like issues as well as the
deadlock problem.

> Or are you proposing, instead, that attempts to insert beyond the
> range raise an error?

That was what I had in mind --- then adding partitions would require
a manual operation.  This would certainly be good enough for "phase I"
IMHO.

            regards, tom lane

Re: What needs to be done for real Partitioning?

From
Tom Lane
Date:
Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
> We probably also need multi-table indexes.

As Josh says, that seems antithetical to the main point of partitioning,
which is to be able to rapidly remove (and add) partitions of a table.
If you have to do index cleaning before you can drop a partition, what's
the point of partitioning?

            regards, tom lane

Re: What needs to be done for real Partitioning?

From
"Steinar H. Gunderson"
Date:
On Sat, Mar 19, 2005 at 07:03:19PM -0500, Tom Lane wrote:
> Possibly, but I'm concerned about locking and deadlock issues.  The
> reason that this is iffy is you would start the operation with only
> an INSERT-grade lock, and then discover that you needed to add a
> partition, which is surely something that needs an exclusive-grade
> lock (consider two sessions trying to add the same partition at the
> same time).  So I don't see how to do it without lock upgrading,
> and lock upgrading is always a recipe for deadlocks.

What about letting something periodical (say, vacuum) do this?

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: What needs to be done for real Partitioning?

From
Alvaro Herrera
Date:
On Sat, Mar 19, 2005 at 07:05:53PM -0500, Tom Lane wrote:
> Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
> > We probably also need multi-table indexes.
>
> As Josh says, that seems antithetical to the main point of partitioning,
> which is to be able to rapidly remove (and add) partitions of a table.
> If you have to do index cleaning before you can drop a partition, what's
> the point of partitioning?

Hmm.  You are right, but without that we won't be able to enforce
uniqueness on the partitioned table (we could only enforce it on each
partition, which would mean we can't partition on anything else than
primary keys if the tables have one).  IMHO this is something to
consider.

--
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
"El hombre nunca sabe de lo que es capaz hasta que lo intenta" (C. Dickens)

Re: What needs to be done for real Partitioning?

From
Rod Taylor
Date:
On Sun, 2005-03-20 at 00:29 -0400, Alvaro Herrera wrote:
> On Sat, Mar 19, 2005 at 07:05:53PM -0500, Tom Lane wrote:
> > Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
> > > We probably also need multi-table indexes.
> >
> > As Josh says, that seems antithetical to the main point of partitioning,
> > which is to be able to rapidly remove (and add) partitions of a table.
> > If you have to do index cleaning before you can drop a partition, what's
> > the point of partitioning?
>
> Hmm.  You are right, but without that we won't be able to enforce
> uniqueness on the partitioned table (we could only enforce it on each
> partition, which would mean we can't partition on anything else than
> primary keys if the tables have one).  IMHO this is something to
> consider.

Could uniqueness across partitions be checked for using a mechanism
similar to what a deferred unique constraint would use (trigger / index
combination)?


Re: What needs to be done for real Partitioning?

From
Tom Lane
Date:
Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
> Hmm.  You are right, but without that we won't be able to enforce
> uniqueness on the partitioned table (we could only enforce it on each
> partition, which would mean we can't partition on anything else than
> primary keys if the tables have one).  IMHO this is something to
> consider.

Well, partitioning on the primary key would be Good Enough for 95% or
99% of the real problems out there.  I'm not excited about adding a
large chunk of complexity to cover another few percent.

            regards, tom lane

Re: What needs to be done for real Partitioning?

From
Greg Stark
Date:
Josh Berkus <josh@agliodbs.com> writes:

> -- INSERT INTO should automatically create new partitions where necessary
>     ---- new tables should automatically inherit all constraints, indexes,
>             keys of "parent" table

I think you're going about this backwards.

Phase I should be an entirely manual system where you add and remove
partitions manually and create and drop indexes you want manually. You need
these low level interfaces anyways for a complete system, it doesn't make
sense to have everything automatic and then later try to wedge in a low level
interface. Only once you have that do you then start offering options to do
these things automatically.

I also think there are a few other components mixed up in your proposal that
are really not integral to partitioned tables. Tablespaces and expression
indexes may well be useful features to use in combination with partitioned
tables, but they shouldn't be required or automatic.

From my experience with Oracle I think there's one big concept that makes the
whole system make a lot more sense: individual partitions are really tables.
The partitioned tables themselves are just meta-objects like views.

Once you get that concept the whole featureset makes a lot more sense. You can
pull a partition out of a partitioned table and it becomes a normal table. You
can take a normal table and put it into a partitioned table. Creating a new
partition or altering a partition is just the same as creating or altering a
new table (except for the actual data definition part).

Given that understanding it's clear that tablespaces are an entirely
orthogonal feature. One that happens to play well with partitioned tables, but
not one that partitioned tables need any special support for. When you create
a new partition or create a table intending to add it as a partition to a
partitioned table you specify the tablespace just as you would normally do.

It's also clear that the last thing you want is an index on the partition key.
A big part of the advantage of partitioned tables is precisely that you get
the advantage of an index on a column without the extra expense.

It would also be reasonable to allow clustering individual partitions;
creating table or column constraints on some partitions and not others; or
even allow having indexes on some partitions and not others. In general the
only operations that you wouldn't be able to do on an individual partition
would be operations that make the column definitions incompatible with the
parent table.

The $64 question is how to specify the partitioning rules. That is, the rule
for determining which partition an insert should go into and which partitions
to look for records in. Oracle handles this by specifying a list of columns
when creating the partitioned table and then specifying either a range or
specific values for each individual partition. I can imagine other approaches
but none that allow for the planner and optimizer to take as much advantage of
the information.

So I think Phase I should look like:

  An ALTER TABLE command to make an inherited table "abstract" in the object
  oriented sense. That is, no records can be inserted in the parent table. If
  you follow the oracle model this is also where you specify the partition
  key. There's no index associated with this partition key though.

  A command to create a new partition, essentially syntactic sugar for a
  CREATE TABLE with an implied INHERITS clause and a constraint on the
  partition key. If you follow the oracle model then you explicitly specify
  which range or specific value of the partition key this partition holds.

  A command to remove a partition from the partitioned table and turn it into
  a regular table.

  A command to take a regular table and turn it into a partition. Again here
  you specify the range or value of the partition key. There has to be some
  verification that the table really holds the correct data though. Perhaps
  this could be skipped by providing a table with a properly constructed
  constraint in place.

  Magic to make INSERT/UPDATE figure out the correct partition to insert the new
  record. (Normally I would have suggested that UPDATE wasn't really necessary
  but in Postgres it seems like it would fall out naturally from having INSERT.)

Phase II would be planner and executor improvements to take advantage of the
information to speed up queries and allow for individual partitions to be
read-only or otherwise inaccessible without impeding queries that don't need
that partition.

Phase III would be autopilot features like having new partitions automatically
created and destroyed and being able to specify in advance rules for
determining which tablespaces to use for these new partitions.

I'm not sure whether to put global indexes under Phase II or III. Personally I
think there's no point to them at all. They defeat the whole point of
partitioned tables. Once you have global indexes adding and removing
partitions becomes a lot harder and slower. You may as well have kept
everything in one table in the first place. But apparently some people find
them useful.

--
greg

Re: What needs to be done for real Partitioning?

From
PFC
Date:

> It would also be reasonable to allow clustering individual partitions;
> creating table or column constraints on some partitions and not others;

    I have a session mamagement which works like that, using views now.

    sessions.online is a table of the online sessions. It has a UNIQUE on
user_id.
    sessions.archive contains all the closed sessions. Obviously it does not
have a UNIQUE on user_id.





Re: What needs to be done for real Partitioning?

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> So I think Phase I should look like:

>   An ALTER TABLE command to make an inherited table "abstract" in the object
>   oriented sense. That is, no records can be inserted in the parent table. If
>   you follow the oracle model this is also where you specify the partition
>   key. There's no index associated with this partition key though.

Check.

>   A command to create a new partition, essentially syntactic sugar for a
>   CREATE TABLE with an implied INHERITS clause and a constraint on the
>   partition key. If you follow the oracle model then you explicitly specify
>   which range or specific value of the partition key this partition holds.

Check.

>   A command to remove a partition from the partitioned table and turn it into
>   a regular table.

Ugh.  Why?  You can access the table directly anyway.

>   A command to take a regular table and turn it into a partition.

Double ugh.  Verifying that the table matches the partition scheme seems
like a lot of ugly, bug-prone, unnecessary code.  What's the use case
for this anyway?

Those last two are *certainly* not Phase I requirements, and I don't
think we need them at all ever.

            regards, tom lane

Re: What needs to be done for real Partitioning?

From
"Stacy White"
Date:
From: "Tom Lane" <tgl@sss.pgh.pa.us>
> Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
> > We probably also need multi-table indexes.
> As Josh says, that seems antithetical to the main point of partitioning,
> which is to be able to rapidly remove (and add) partitions of a table.
> If you have to do index cleaning before you can drop a partition, what's
> the point of partitioning?

Global indexes (as opposed to partition local indexes) are useful in cases
where you have a large number of partitions, index columns different than
the partition key, and index values that limit the query to just a subset of
the partitions.

The two domains that I'm most familiar with are warehouse management, and
the film industry. In both these cases it's logical to partition on
day/week/month, it's frequently important to keep a lot of history, and it's
common to have products that only show activity for a few months.  In one of
our production systems we have 800 partitions (by week, with a lot of
history), but a popular product might have only 20 weeks worth of activity.
Selecting records for the product requires at least 800 random-access reads
if you have local indexes on 'product_no', 780 of which just tell the
executor that the partition doesn't include any information on the product.

This is definitely a phase II item, but as I said before it's worth
considering since good DBAs can do a lot with global indexes.

FWIW, we see large benefits from partitioning other than the ability to
easily drop data, for example:

- We can vacuum only the active portions of a table
- Postgres automatically keeps related records clustered together on disk,
which makes it more likely that the blocks used by common queries can be
found in cache
- The query engine uses full table scans on the relevant sections of data,
and quickly skips over the irrelevant sections
- 'CLUSTER'ing a single partition is likely to be significantly more
performant than clustering a large table

In fact, we have yet to drop a partition on any of our Oracle or Postgres
production systems.


Re: What needs to be done for real Partitioning?

From
Josh Berkus
Date:
Alvaro, Greg, Tom,

> Hmm.  You are right, but without that we won't be able to enforce
> uniqueness on the partitioned table (we could only enforce it on each
> partition, which would mean we can't partition on anything else than
> primary keys if the tables have one).  IMHO this is something to
> consider.

Sure.  However, for most partitioned use cases, the partition column will be
part of the real key of the table (for example, for a security log, the real
key might be (timestamp, machine, application, event_type) with the partition
on extract(hour from timestamp)).   As a result, there is no need to enforce
inter-partition uniqueness; the paritioning scheme enforces it already.

The only need for inter-partition uniqueness is on surrogate integer keys.
This can already be enforced de-facto simply by using a sequence.  While it
would be possible to create a uniqueness check that spans partitions, it
would be very expensive to do so, thus elminating some of the advantage of
partitioning in the first place.   I'm not saying that we won't want this
some day as an option, I just see it as a Phase III refinement.

Greg, first of all, thanks for helping clean up my muddy thinking about
implementing partitions.  Comments below:

> Phase I should be an entirely manual system where you add and remove
> partitions manually and create and drop indexes you want manually. You need
> these low level interfaces anyways for a complete system, it doesn't make
> sense to have everything automatic and then later try to wedge in a low
> level interface. Only once you have that do you then start offering options
> to do these things automatically.

This makes sense.  Thanks!

> whole system make a lot more sense: individual partitions are really
> tables. The partitioned tables themselves are just meta-objects like views.

So, like the current pseudo-partitioning implementation, partitions would be
"full tables" just with some special rules for query-rewriting when they are
pulled.  This makes sense, I think I just got carried away in another
direction.

> It's also clear that the last thing you want is an index on the partition
> key. A big part of the advantage of partitioned tables is precisely that
> you get the advantage of an index on a column without the extra expense.

Well, you need it with the current pseudo-partitioning.   What would allow us
to eliminate indexing the partition key is special re-writing rules that only
pull the partitions compliant with the outer query.  Until that step takes
place, the indexes are very much needed.  So maybe the advanced planner
rewriting is a Phase I item, not a Phase II item?

> The $64 question is how to specify the partitioning rules. That is, the
> rule for determining which partition an insert should go into and which
> partitions to look for records in. Oracle handles this by specifying a list
> of columns when creating the partitioned table and then specifying either a
> range or specific values for each individual partition. I can imagine other
> approaches but none that allow for the planner and optimizer to take as
> much advantage of the information.

Well, I would think that specifying an expression that defines a new partition
at each change in value (like EXTRACT(day FROM timestamp) on a time-based
partitioning) would cover 90% of implemenations and be a lot simpler to
administer.   The Oracle approach has the advantage of allowing "custom
paritioning" at the expense of greater complexity.

>   A command to remove a partition from the partitioned table and turn it
> into a regular table.
>
>   A command to take a regular table and turn it into a partition. Again
> here you specify the range or value of the partition key. There has to be
> some verification that the table really holds the correct data though.
> Perhaps this could be skipped by providing a table with a properly
> constructed constraint in place.

Like Tom, I don't see the point in these.  What do they do that CREATE TABLE
AS and/or INSERT INTO do not?

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: What needs to be done for real Partitioning?

From
Oleg Bartunov
Date:
On Sun, 20 Mar 2005, Josh Berkus wrote:

>
>> whole system make a lot more sense: individual partitions are really
>> tables. The partitioned tables themselves are just meta-objects like views.

If partition is a table, so I could define different indices for them ?
In our prototype of scaled full text search we create another index
which is optimized for "archived" (not changed) data - it's sort of
standard inverted index which is proven to be scaled, while tsearch2's index
is good for "online" data. All interfaces ( dictionaries, parsers, ranking)
are the same, so it's possible to combine search results.
This is rather easy to implement using table inheritance, but I'd like
to do this with partitioning


     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: What needs to be done for real Partitioning?

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> >   A command to remove a partition from the partitioned table and turn it into
> >   a regular table.
>
> Ugh.  Why?  You can access the table directly anyway.
>
> >   A command to take a regular table and turn it into a partition.
>
> Double ugh.  Verifying that the table matches the partition scheme seems
> like a lot of ugly, bug-prone, unnecessary code.  What's the use case
> for this anyway?
>
> Those last two are *certainly* not Phase I requirements, and I don't
> think we need them at all ever.

These are effectively equivalent to "ALTER TABLE RENAME". Without these
commands you would be in pretty much the same position as a DBA without the
ability to rename tables.

The heart of partitioned tables is being able to load and unload entire
partitions quickly. You have to have somewhere to "unload" them too. Most
people aren't happy just watching their data disappear entirely. They want to
move them other tables or even other databases.

Similarly, they have to have somewhere to load them from. They're usually not
happy loading data directly into their production data warehouse tables
without manipulating the data, or doing things like clustering or indexing.

You could argue for some sort of setup where you could take a partition
"offline" during which you could safely do things like export or manipulate
the data. But that's awfully limiting. What if I want to do things like add
columns, or change data types, or any other manipulation that breaks the
symmetry with the production partitioned table.

I don't think it's really hard at all to check that the table matches the
partition scheme. You can just require that there be an existing table
constraint in place that matches the partitioning scheme. I think you can even
be fascist about the exact syntax of the constraint fitting precisely a
specified format.

--
greg

Re: What needs to be done for real Partitioning?

From
Tom Lane
Date:
"Stacy White" <harsh@computer.org> writes:
> FWIW, we see large benefits from partitioning other than the ability to
> easily drop data, for example:

> - We can vacuum only the active portions of a table
> - Postgres automatically keeps related records clustered together on disk,
> which makes it more likely that the blocks used by common queries can be
> found in cache
> - The query engine uses full table scans on the relevant sections of data,
> and quickly skips over the irrelevant sections
> - 'CLUSTER'ing a single partition is likely to be significantly more
> performant than clustering a large table

Global indexes would seriously reduce the performance of both vacuum and
cluster for a single partition, and if you want seq scans you don't need
an index for that at all.  So the above doesn't strike me as a strong
argument for global indexes ...

            regards, tom lane

Re: What needs to be done for real Partitioning?

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> You could argue for some sort of setup where you could take a partition
> "offline" during which you could safely do things like export or manipulate
> the data. But that's awfully limiting. What if I want to do things like add
> columns, or change data types, or any other manipulation that breaks the
> symmetry with the production partitioned table.

[ scrapes eyebrows off ceiling... ]  You don't really expect to be able
to do that kind of thing to just one partition do you?

            regards, tom lane

Re: What needs to be done for real Partitioning?

From
Greg Stark
Date:
Josh Berkus <josh@agliodbs.com> writes:

> Well, I would think that specifying an expression that defines a new partition
> at each change in value (like EXTRACT(day FROM timestamp) on a time-based
> partitioning) would cover 90% of implemenations and be a lot simpler to
> administer.   The Oracle approach has the advantage of allowing "custom
> paritioning" at the expense of greater complexity.

Hm. This is where I might be less helpful. Once you're submersed in one way of
doing things it can be hard to think outside the box like this.

But I fear this scheme might be harder to actually take advantage of. If I do
a query like

 WHERE timestamp BETWEEN '2005-01-01 11:00' AND '2005-01-01 12:00'

How do you determine which partitions that range will cover?

Also, it seems like it would be inconvenient to try to construct expressions
to handle things like "start a new partition ever 1 million values".

And worse, how would you handle changing schemes with this? Like, say we want
to switch from starting one partition per month to starting one partition per
week?



I think some actual use cases might be helpful for you. I can contribute an
interesting one, though I have to be intentionally vague even though I don't
work on that system any more.

We had a table with a layout like:

txnid serial,
groupid integer,
data...

Each day a cron job created 6 new groups (actually later that was changed to
some other number). It then added a new partition to handle the range of the
new day's groups. Later another cron job exchanged out the partition from a
week earlier and exported that table, transfered it to another machine and
loaded it there.

txnid was a unique identifier but we couldn't have a unique constraint because
that would have required a global index. That didn't cause any problems since
it was a sequence generated column anyways.

We did have a unique index on <groupid,txnid> which is a local index because
groupid was the partition key. In reality nothing in our system ever really
needed a txn without knowing which group it came from anyways, so it was easy
to change our queries to take advantage of this.

We had a lot of jobs, some *extremely* performance sensitive that depended on
being able to scan the entire list of txns for a given day or a given set of
groupids. The partitions meant it could do a full table scan which made these
extremely fast.

This was with Oracle 8i. All partition keys in 8i were ranges. In 9 Oracle
added the ability to make partition reference specific id values. Sort of like
how you're describing having a key expression. We might have considered using
that scheme with groupid but then it would have meant adding a bunch of new
partitions each day and having some queries that would involve scanning
multiple partitions.

--
Greg

Re: What needs to be done for real Partitioning?

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Global indexes would seriously reduce the performance of both vacuum and
> cluster for a single partition, and if you want seq scans you don't need
> an index for that at all.  So the above doesn't strike me as a strong
> argument for global indexes ...

I think he means some sort of plan for queries like

  select * from invoices where customer_id = 1

where customer 1 only did business with us for two years. One could imagine
some kind of very coarse grained bitmap index that just knows which partitions
customer_id=1 appears in, and then does a sequential scan of those partitions.

But I think you can do nearly as well without using global indexes of any
type. Assuming you had local indexes on customer_id for each partition and
separate histograms for each partition the planner could conclude that it
needs sequential scans for some partitions and a quick index lookup expecting
0 records for other partitions.

Not as good as pruning partitions entirely but if you're doing a sequential
scan the performance hit of a few index lookups isn't a problem.

--
greg

Re: What needs to be done for real Partitioning?

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Greg Stark <gsstark@mit.edu> writes:
> > You could argue for some sort of setup where you could take a partition
> > "offline" during which you could safely do things like export or manipulate
> > the data. But that's awfully limiting. What if I want to do things like add
> > columns, or change data types, or any other manipulation that breaks the
> > symmetry with the production partitioned table.
>
> [ scrapes eyebrows off ceiling... ]  You don't really expect to be able
> to do that kind of thing to just one partition do you?

Well no. That's exactly why I would want to pull the partition out of the
partitioned table so that I can then do whatever work I need to archive it
without affecting the partitioned table.

Take an analogous situation. I have a huge log file I want to rotate. The
quickest most efficient way to do this would be to move it aside, HUP the
daemon (or whatever else I have to do to get it to open a new file) then gzip
and archive the old log files.

--
greg

Re: What needs to be done for real Partitioning?

From
"Stacy White"
Date:
From: "Tom Lane" <tgl@sss.pgh.pa.us>
> "Stacy White" <harsh@computer.org> writes:
> > FWIW, we see large benefits from partitioning other than the ability to
> > easily drop data, for example:
>
> > - We can vacuum only the active portions of a table
> > - Postgres automatically keeps related records clustered together on
disk,
> > which makes it more likely that the blocks used by common queries can be
> > found in cache
> > - The query engine uses full table scans on the relevant sections of
data,
> > and quickly skips over the irrelevant sections
> > - 'CLUSTER'ing a single partition is likely to be significantly more
> > performant than clustering a large table
> Global indexes would seriously reduce the performance of both vacuum and
> cluster for a single partition, and if you want seq scans you don't need
> an index for that at all.  So the above doesn't strike me as a strong
> argument for global indexes ...

Tom, this list was in response to your question "If you have to do index
cleaning before you can drop a partition, what's the point of
partitioning?".  I was trying to make the point that partioning isn't just
about being able to quickly drop data.  The argument for global indexes came
in the form of my war story and the description of the conditions under
which global indexes will perform better than local indexes (see my original
email for details) .  But, like I said, this would definitely be a phase
II/III item.


Re: What needs to be done for real Partitioning?

From
"Stacy White"
Date:
From: "Greg Stark" <gsstark@mit.edu>
> Tom Lane <tgl@sss.pgh.pa.us> writes:
> Not as good as pruning partitions entirely but if you're doing a
sequential
> scan the performance hit of a few index lookups isn't a problem.

Greg, I think you've got the right idea.  For large databases, though, it
won't be uncommon to have large numbers of partitions, in which case we're
not talking about a few index lookups.  The database I used in my example
wasn't huge, but the table in question had over 800 partitions.  A larger
database could have thousands.  I suppose the importance of global indexes
depends on the sizes of the databases your target audience is running.

Here's some more detail on our real-world experience:  The group made the
decision to partition some of the larger tables for better performance.  The
idea that global indexes aren't useful is pretty common in the database
world, and 2 or 3 good DBAs suggested that the 'product_no' index be local.
But with the local indexes, performance on some queries was bad enough that
the group actually made the decision to switch back to unpartitioned tables.
(The performance problems came about because of the overhead involved in
searching >800 indices to find the relevant rows).

Luckily they that had the chance to work with a truly fantastic DBA (the
author of an Oracle Press performance tuning book even) before they could
switch back.  He convinced them to make some of their indexes global.
Performance dramatically improved (compared with both the unpartitioned
schema, and the partitioned-and-locally-indexed schema), and they've since
stayed with partitioned tables and a mix of local and global indexes.

But once again, I think that global indexes aren't as important as the Phase
I items in any of the Phase I/Phase II breakdowns that have been proposed in
this thread.


Re: What needs to be done for real Partitioning?

From
Josh Berkus
Date:
Stacy,

> Luckily they that had the chance to work with a truly fantastic DBA (the
> author of an Oracle Press performance tuning book even) before they could
> switch back.  He convinced them to make some of their indexes global.
> Performance dramatically improved (compared with both the unpartitioned
> schema, and the partitioned-and-locally-indexed schema), and they've since
> stayed with partitioned tables and a mix of local and global indexes.

Hmmm.  Wouldn't Greg's suggestion of a bitmap index which holds information on
what values are found in what partition also solve this?    Without 1/2 of
the overhead imposed by global indexes?

I can actually see such a bitmap as being universally useful to the
partitioning concept ... for one, it would resolve the whole "partition on
{value}" issue.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: What needs to be done for real Partitioning?

From
"Jim C. Nasby"
Date:
On Mon, Mar 21, 2005 at 09:55:03AM -0800, Josh Berkus wrote:
> Stacy,
>
> > Luckily they that had the chance to work with a truly fantastic DBA (the
> > author of an Oracle Press performance tuning book even) before they could
> > switch back.  He convinced them to make some of their indexes global.
> > Performance dramatically improved (compared with both the unpartitioned
> > schema, and the partitioned-and-locally-indexed schema), and they've since
> > stayed with partitioned tables and a mix of local and global indexes.
>
> Hmmm.  Wouldn't Greg's suggestion of a bitmap index which holds information on
> what values are found in what partition also solve this?    Without 1/2 of
> the overhead imposed by global indexes?
>
> I can actually see such a bitmap as being universally useful to the
> partitioning concept ... for one, it would resolve the whole "partition on
> {value}" issue.

I suspect both will have their uses. I've read quite a bit about global
v. local indexs in Oracle, and there are definately cases where global
is much better than local. Granted, there's some things with how Oracle
handles their catalog, etc. that might make local indexes more expensive
for them than they would be for PostgreSQL. It's also not clear how much
a 'partition bitmap' index would help.

As for the 'seqscan individual partitions' argument, that's not going to
work well at all for a case where you need to hit a relatively small
percentage of rows in a relatively large number of partitions. SELECT
... WHERE customer_id = 1 would be a good example of such a query
(assuming the table is partitioned on something like invoice_date).
--
Jim C. Nasby, Database Consultant               decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: What needs to be done for real Partitioning?

From
"Jim C. Nasby"
Date:
On Sat, Mar 19, 2005 at 07:05:53PM -0500, Tom Lane wrote:
> Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
> > We probably also need multi-table indexes.
>
> As Josh says, that seems antithetical to the main point of partitioning,
> which is to be able to rapidly remove (and add) partitions of a table.
> If you have to do index cleaning before you can drop a partition, what's
> the point of partitioning?

Why would you need to do index cleaning first? Presumably the code that
goes to check a heap tuple that an index pointed at to ensure that it
was visible in the current transaction would be able to recognize if the
partition that tuple was in had been removed, and just ignore that index
entry. Granted, you'd need to clean the index up at some point
(presumably via vacuum), but it doesn't need to occur at partition drop
time.
--
Jim C. Nasby, Database Consultant               decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: What needs to be done for real Partitioning?

From
"Jim C. Nasby"
Date:
I think Greg's email did a good job of putting this on track. Phase 1
should be manual, low-level type of support. Oracle has had partitioning
for years now, and IF they've added automated partition management, it's
only happened in 10g which is pretty recent.

For inserts that don't currently have a defined partition to fit in, the
Oracle model might be better than tossing an error: a partitioned table
in Oracle also contains a default partition. Any rows that don't match a
defined partition go into the default partition. For many cases you'll
never have anything in the default partition, but sometimes you'll have
some partition values that occur infrequenttly enough in the table so as
not to warrant their own partition.

There's also another partitioning application that I think is often
overlooked. I have a table with about 130M rows that is
'pseudo-partitioned' by project_id. Right now, there are 5 different
project IDs that account for the bulk of those 130M rows. Oracle
provides a means to partition on discreet values. When you do this,
there's not actually any reason to even store the partition field in the
partition tables, since it will be the same for every row in the
partition. In my case, since the smallint project ID is being aligned to
a 4 byte boundary, having this feature would save ~120M rows * 4 bytes =
480MB in the table. Granted, 480MB isn't anything for today's disk
sizes, but it makes a huge difference when you look at it from an I/O
standpoint. Knowing that a partition contains only one value of a field
or set of fields also means you can drop those fields from local indexes
without losing any effectiveness. In my case, I have 2 indexes I could
drop project_id from. Does each node in a B-tree index have the full
index key? If so, then there would be substantial I/O gains to be had
there, as well. Even if each node doesn't store the full key, there
could still be faster to handle a narrower index.

I realize this might be a more difficult case to support. It probably
couldn't be done using inheritance, though I don't know if inheritence
or a union view is better for partitioning. In either case, this case
might not be a good candidate for phase 1, but I think partitioning
should be designed with it in mind.
--
Jim C. Nasby, Database Consultant               decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: What needs to be done for real Partitioning?

From
Simon Riggs
Date:
On Sun, 2005-03-20 at 01:14 -0500, Greg Stark wrote:
> Josh Berkus <josh@agliodbs.com> writes:
>
> > -- INSERT INTO should automatically create new partitions where necessary
> >     ---- new tables should automatically inherit all constraints, indexes,
> >             keys of "parent" table
>
> I think you're going about this backwards.

Certainly, there are two schools of thought here. I have been in two
minds about which those two designs previously, and indeed here which
one to support.

> Phase I should be an entirely manual system where you add and remove
> partitions manually and create and drop indexes you want manually. You need
> these low level interfaces anyways for a complete system, it doesn't make
> sense to have everything automatic and then later try to wedge in a low level
> interface. Only once you have that do you then start offering options to do
> these things automatically.

Maybe its just me, but ISTM that implementing an automatic system is
actually easier to begin with. No commands, no syntax etc. You're right,
you need the low level interfaces anyway...

> From my experience with Oracle I think there's one big concept that makes the
> whole system make a lot more sense: individual partitions are really tables.
> The partitioned tables themselves are just meta-objects like views.

Hmmm. Oracle provides a very DBA-intensive implementation that as Stacy
points out, many people still do not understand. It does work, well. And
has many of the wrinkles ironed out, even if not all of them are easy to
understand why they exist at first glance.

I think it most likely that Phase I should be a simplified blend of both
ideas, with a clear view towards minimum impact and implementability,
otherwise it may not make the cut for 8.1

Best Regards, Simon Riggs


Re: What needs to be done for real Partitioning?

From
Hannu Krosing
Date:
On E, 2005-03-21 at 09:55 -0800, Josh Berkus wrote:
> Stacy,
>
> > Luckily they that had the chance to work with a truly fantastic DBA (the
> > author of an Oracle Press performance tuning book even) before they could
> > switch back.  He convinced them to make some of their indexes global.
> > Performance dramatically improved (compared with both the unpartitioned
> > schema, and the partitioned-and-locally-indexed schema), and they've since
> > stayed with partitioned tables and a mix of local and global indexes.
>
> Hmmm.  Wouldn't Greg's suggestion of a bitmap index which holds information on
> what values are found in what partition also solve this?    Without 1/2 of
> the overhead imposed by global indexes?
>
> I can actually see such a bitmap as being universally useful to the
> partitioning concept ... for one, it would resolve the whole "partition on
> {value}" issue.

I once (maybe about a year ago) tried to elaborate using bitmap
index(es) with page granularity as a tool for simultaneous clustering
and lookup for data warehousing using postgres. the main idea was to
determine storage location from AND of all "clustered" bitmap indexes
and corresponding fast and clustered lookups.

This could/should/maybe :) possibly be combined with clustering as well.

--
Hannu Krosing <hannu@tm.ee>

Re: What needs to be done for real Partitioning?

From
Hannu Krosing
Date:
On L, 2005-03-19 at 12:02 -0800, Josh Berkus wrote:
> Folks,
>
> I may (or may not) soon have funding for implementing full table partitioning
> in PostgreSQL.

If you don't get it, contact me as there is a small possibility that I
know a company interested enough to fund (some) of it :)

> I thought it would be a good idea to discuss with people here
> who are already using pseudo-partitioning what things need to be added to
> Postgresql in order to make full paritioning a reality; that is, what do
> other databases do that we don't?

As these are already discussed in this thread, I'll try to outline a
method of providing a global index (unique or not) in a way that will
still make it possible to quickly remove (and not-quite-so-quickly add)
a partition.

The structure is inspired by the current way of handling >1Gb tables.

As each tid consists of 32 bit page pointer we have pointerspace of
35184372088832 bytes/index (4G of 8k pages). currently this is directly
partitioned mapped to 1Gbyte/128kpage files, but we can, with minimal
changes to indexes, put a lookup table between index and page lookup.

In case of global index over partitions this table could point to 1G
subtables from different partition tables.

The drop partition table can also be fast - just record the pages in
lookup table as deleted - means one change per 1G of dropped table.
The next vacuum should free pointers to deleted subfiles.

Adding partitions is trickier -

If the added table forms part of partitioning index (say names from C to
E), and there is a matching index on subtable,

Then that part of btree can probably copied into the main btree index as
a tree btanch, which should be relatively fast (compared to building it
one tid at a time).

Else adding the the index could probably also be sped up by some kind of
index merge - faster than building from scratch but slower than above.


To repeat - the global index over partitioned table should have te same
structure as our current b-tree index, only with added map of 128k index
partitions to 1G subfiles of (possibly different) tables. This map will
be quite small - for 1Tb of data it will be only 1k entries - this will
fit in cache on all modern processors and thus should add only tiny
slowdown from current direct tid.page/128k method

--
Hannu Krosing <hannu@tm.ee>

Re: What needs to be done for real Partitioning?

From
Hannu Krosing
Date:
On P, 2005-03-20 at 00:52 +0100, PFC wrote:
> > tableoid would accomplish that already, assuming that the "partitioned
> > table" is effectively a view on separate physical tables.
> >
> >             regards, tom lane
>
>     Very good.
>
>     Also note the possibility to mark a partition READ ONLY. Or even a table.
>     It does not seem very useful but just think that for instance the "1999",
> "2000" ... "2004" partitions of a big archive probably never change.
> READLONY means we're sure they never change, thus no need to backup them
> every time. Keeping the example of some DB arranged by years / current
> year / current month, Just backup the "current month" part every day and
> the "current year" every month when you switch partitions.
>     This could be achieved also by storing the time of last modification of a
> table somewhere.

Would we still need regular VACUUMing of read-only table to avoid
OID-wraparound ?

--
Hannu Krosing <hannu@tm.ee>

Re: What needs to be done for real Partitioning?

From
Hannu Krosing
Date:
On L, 2005-03-19 at 19:03 -0500, Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
> >>> -- INSERT INTO should automatically create new partitions where necessary
> >>> -- DELETE FROM should automatically drop empty partitions
> >>
> >> I am not sure I agree with either of those, and the reason is that they
> >> would turn low-lock operations into high-lock operations.
>
> > For INSERT, I think that's a problem we need to work through.
>
> Possibly, but I'm concerned about locking and deadlock issues.  The
> reason that this is iffy is you would start the operation with only
> an INSERT-grade lock, and then discover that you needed to add a
> partition, which is surely something that needs an exclusive-grade
> lock (consider two sessions trying to add the same partition at the
> same time).  So I don't see how to do it without lock upgrading,
> and lock upgrading is always a recipe for deadlocks.
>
> The DELETE case is even worse because you can't physically release
> storage until you're sure nothing in it is needed anymore by any open
> transaction --- that introduces VACUUM-like issues as well as the
> deadlock problem.
>

If we go with my proposal (other post in this thread) of doing most of
the partitioning in the level between logical file and physikal 1Gb
storage files, then adding a partition should be nearly the same as
crossing the 1G boundary is now.

removing the partition would be just plain vacuum (if we can make pg
shring each 1G subfile independently)

> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
--
Hannu Krosing <hannu@tm.ee>

Re: What needs to be done for real Partitioning?

From
Hannu Krosing
Date:
On L, 2005-03-19 at 23:47 -0500, Tom Lane wrote:
> Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
> > Hmm.  You are right, but without that we won't be able to enforce
> > uniqueness on the partitioned table (we could only enforce it on each
> > partition, which would mean we can't partition on anything else than
> > primary keys if the tables have one).  IMHO this is something to
> > consider.
>
> Well, partitioning on the primary key would be Good Enough for 95% or
> 99% of the real problems out there.  I'm not excited about adding a
> large chunk of complexity to cover another few percent.

That automatically means that partitioning expression has to be a range
over PK. (you dont want to have every tuple in separate tabel :)

And it also means that you have to automatically create new partitions.

Are you sure that partitioning on anything else than PK would be
significantly harder ?

I have a case where I do manual partitioning over start_time
(timestamp), but the PK is an id from a sequence. They are almost, but
not exactly in the same order. And I don't think that moving the PK to
be (start_time, id) just because of "partitioning on PK only" would be a
good design in any way.

So please don't design the system to partition on PK only.

--
Hannu Krosing <hannu@tm.ee>

Re: What needs to be done for real Partitioning?

From
Alvaro Herrera
Date:
On Mon, Mar 21, 2005 at 08:26:24PM +0200, Hannu Krosing wrote:
> On P, 2005-03-20 at 00:52 +0100, PFC wrote:

> >     Also note the possibility to mark a partition READ ONLY. Or even a table.

> Would we still need regular VACUUMing of read-only table to avoid
> OID-wraparound ?

You could VACUUM FREEZE the table or partition, so you wouldn't need to
vacuum it again.

--
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
"Someone said that it is at least an order of magnitude more work to do
production software than a prototype. I think he is wrong by at least
an order of magnitude."                              (Brian Kernighan)

Re: What needs to be done for real Partitioning?

From
Hannu Krosing
Date:
On T, 2005-03-22 at 09:10 -0400, Alvaro Herrera wrote:
> On Mon, Mar 21, 2005 at 08:26:24PM +0200, Hannu Krosing wrote:
> > On P, 2005-03-20 at 00:52 +0100, PFC wrote:
>
> > >     Also note the possibility to mark a partition READ ONLY. Or even a table.
>
> > Would we still need regular VACUUMing of read-only table to avoid
> > OID-wraparound ?
>
> You could VACUUM FREEZE the table or partition, so you wouldn't need to
> vacuum it again.

But when I do just VACUUM; will this know to avoid vacuuming VACUUM
FREEZE'd partitions ?

Or could this be somehow liked to READ ONLY + VACUUM FREEZE state ?

--
Hannu Krosing <hannu@tm.ee>

Re: What needs to be done for real Partitioning?

From
Josh Berkus
Date:
Hannu,

> If you don't get it, contact me as there is a small possibility that I
> know a company interested enough to fund (some) of it :)

Enough people have been interested in this that if we get our acts together,
we may do it as multi-funded.   Easier on our budget ...

> As these are already discussed in this thread, I'll try to outline a
> method of providing a global index (unique or not) in a way that will
> still make it possible to quickly remove (and not-quite-so-quickly add)
> a partition.
<snip>
> To repeat - the global index over partitioned table should have te same
> structure as our current b-tree index, only with added map of 128k index
> partitions to 1G subfiles of (possibly different) tables. This map will
> be quite small - for 1Tb of data it will be only 1k entries - this will
> fit in cache on all modern processors and thus should add only tiny
> slowdown from current direct tid.page/128k method

I think this is a cool idea.  It would need to be linked to clustering, so
that each partition can be an iteration of the clustered index instead of a
specifc # of bytes.  But it would give us the "fully automated partitioning"
which is one fork of the two we want.

Plus I'm keen on any idea that presents an alternative to aping Oracle.

How difficult would your proposal be to code?

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: What needs to be done for real Partitioning?

From
Bruce Momjian
Date:
Added to TODO:

* Support table partitioning that allows a single table to be stored
  in subtables that are partitioned based on the primary key or a WHERE
  clause


---------------------------------------------------------------------------

Josh Berkus wrote:
> Hannu,
>
> > If you don't get it, contact me as there is a small possibility that I
> > know a company interested enough to fund (some) of it :)
>
> Enough people have been interested in this that if we get our acts together,
> we may do it as multi-funded.   Easier on our budget ...
>
> > As these are already discussed in this thread, I'll try to outline a
> > method of providing a global index (unique or not) in a way that will
> > still make it possible to quickly remove (and not-quite-so-quickly add)
> > a partition.
> <snip>
> > To repeat - the global index over partitioned table should have te same
> > structure as our current b-tree index, only with added map of 128k index
> > partitions to 1G subfiles of (possibly different) tables. This map will
> > be quite small - for 1Tb of data it will be only 1k entries - this will
> > fit in cache on all modern processors and thus should add only tiny
> > slowdown from current direct tid.page/128k method
>
> I think this is a cool idea.  It would need to be linked to clustering, so
> that each partition can be an iteration of the clustered index instead of a
> specifc # of bytes.  But it would give us the "fully automated partitioning"
> which is one fork of the two we want.
>
> Plus I'm keen on any idea that presents an alternative to aping Oracle.
>
> How difficult would your proposal be to code?
>
> --
> Josh Berkus
> Aglio Database Solutions
> San Francisco
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: What needs to be done for real Partitioning?

From
"Roger Hand"
Date:
On March 21, 2005 8:07 AM, Hannu Krosing wrote:
> On L, 2005-03-19 at 23:47 -0500, Tom Lane wrote:
> > Well, partitioning on the primary key would be Good Enough for 95% or
> > 99% of the real problems out there.  I'm not excited about adding a
> > large chunk of complexity to cover another few percent.
>
> Are you sure that partitioning on anything else than PK would be
> significantly harder ?
>
> I have a case where I do manual partitioning over start_time
> (timestamp), but the PK is an id from a sequence. They are almost, but
> not exactly in the same order. And I don't think that moving the PK to
> be (start_time, id) just because of "partitioning on PK only" would be a
> good design in any way.
>
> So please don't design the system to partition on PK only.

I agree. I have used table partitioning to implement pseudo-partitioning, and I am very pleased with the results so
far.Real partitioning would be even better, but I am partitioning by timestamp, and this is not the PK, and I don't
wishto make it one. 

-Roger

Re: What needs to be done for real Partitioning?

From
Yann Michel
Date:
Hi,

On Sun, Mar 20, 2005 at 06:01:49PM -0500, Tom Lane wrote:
> Global indexes would seriously reduce the performance of both vacuum and
> cluster for a single partition, and if you want seq scans you don't need
> an index for that at all.  So the above doesn't strike me as a strong
> argument for global indexes ...

I'd like to describe a usecase where a global index is usefull.

We have a datawarehouse with invoices for a rolling window of a few
years. Each invoice has several positions so a uk is
(invoice,position). Dur to the fact that most of the queries are only on
a few months or some quarters of a year, our pk starts with the
time-attribute (followed by the dimension ids) which is the partition
key (range). During the nightly update, we receive each updated invoice
so we have to update that special (global unique) row which is resolved
very fast by using the uk.

So you can see, that there is a usefull case for providing a global
index while using partitining and local indexes as well.

Regards,
Yann