Thread: Syntax for partitioning

Syntax for partitioning

From

Itagaki Takahiro

Date:

28 October 2009, 23:15:08

I'd like to improve partitioning feature in 8.5.
Kedar-san's previous work is wonderful, but I cannot see any updated patch.
http://archives.postgresql.org/message-id/bd8134a40906080702s96c90a9q3bbb581b9bd0d5d7@mail.gmail.com

So, I'll take over the work if there are no ones to do it.
I'm thinking to add syntax support first. Table partitioning was
proposed many times, but it is still not applied into core.
The reason is it is too difficult to make perfect partitioning
feature at once. I think syntax support is a good start.

First, I will add syntax for CREATE TABLE, ALTER TABLE ADD/DROP PARTITION.
The syntax is borrowed from from Oracle and MySQL. Their characteristics
are using "LESS THAN" in range partitioning. The keyword "PARTITION" is
added to the full-reserved keyword list to support ADD/DROP PARTITION.

Those syntax is merely a syntax sugar for INHERITS with CHECK. Declarations
are translated into CHECK constraints. I have a plan to adjust pg_dump to
dump definitions of partitioning in the correct format, but the actual
implementation will be still based on constraint exclusion. In addition,
hash partitioning is not implemented; syntax is parsed but "not implemented"
error are raised for now.

Here is syntax I propose:
----
ALTER TABLE table_name ADD PARTITION name ...;
ALTER TABLE table_name DROP PARTITION [IF EXISTS] name [CASCADE | RESTRICT];

Range partitioning: CREATE TABLE table_name ( columns )   PARTITION BY RANGE ( a_expr )   (     PARTITION name VALUES
LESSTHAN [(] const [)],     PARTITION name VALUES LESS THAN [(] MAXVALUE [)] -- overflow partition   );
 

List partitioning: CREATE TABLE table_name ( columns )   PARTITION BY LIST ( a_expr )   (     PARTITION name VALUES
[IN]( const [, ...] ),     PARTITION name VALUES [IN] [(] DEFAULT [)]       -- overflow partition   );
 

Hash partitioning: CREATE TABLE table_name ( columns )   PARTITION BY HASH ( a_expr )   PARTITIONS num_partitions;
 CREATE TABLE table_name ( columns )   PARTITION BY HASH ( a_expr )   (     PARTITION name,     ...   );

Note: * Each partition can have optional WITH (...) and TABLESPACE clauses. * '(' and ')' are optional to support both
Oracleand MySQL syntax.
 
----

Comments welcome.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Syntax for partitioning

From

Pavel Stehule

Date:

29 October 2009, 01:40:11

2009/10/29 Itagaki Takahiro <itagaki.takahiro@oss.ntt.co.jp>:
> I'd like to improve partitioning feature in 8.5.
> Kedar-san's previous work is wonderful, but I cannot see any updated patch.
> http://archives.postgresql.org/message-id/bd8134a40906080702s96c90a9q3bbb581b9bd0d5d7@mail.gmail.com
>
> So, I'll take over the work if there are no ones to do it.
> I'm thinking to add syntax support first. Table partitioning was
> proposed many times, but it is still not applied into core.
> The reason is it is too difficult to make perfect partitioning
> feature at once. I think syntax support is a good start.
>
> First, I will add syntax for CREATE TABLE, ALTER TABLE ADD/DROP PARTITION.
> The syntax is borrowed from from Oracle and MySQL. Their characteristics
> are using "LESS THAN" in range partitioning. The keyword "PARTITION" is
> added to the full-reserved keyword list to support ADD/DROP PARTITION.
>
> Those syntax is merely a syntax sugar for INHERITS with CHECK. Declarations
> are translated into CHECK constraints. I have a plan to adjust pg_dump to
> dump definitions of partitioning in the correct format, but the actual
> implementation will be still based on constraint exclusion. In addition,
> hash partitioning is not implemented; syntax is parsed but "not implemented"
> error are raised for now.
>
> Here is syntax I propose:
> ----
> ALTER TABLE table_name ADD PARTITION name ...;
> ALTER TABLE table_name DROP PARTITION [IF EXISTS] name [CASCADE | RESTRICT];
>
> Range partitioning:
>  CREATE TABLE table_name ( columns )
>    PARTITION BY RANGE ( a_expr )
>    (
>      PARTITION name VALUES LESS THAN [(] const [)],
>      PARTITION name VALUES LESS THAN [(] MAXVALUE [)] -- overflow partition
>    );
>
> List partitioning:
>  CREATE TABLE table_name ( columns )
>    PARTITION BY LIST ( a_expr )
>    (
>      PARTITION name VALUES [IN] ( const [, ...] ),
>      PARTITION name VALUES [IN] [(] DEFAULT [)]       -- overflow partition
>    );
>
> Hash partitioning:
>  CREATE TABLE table_name ( columns )
>    PARTITION BY HASH ( a_expr )
>    PARTITIONS num_partitions;
>
>  CREATE TABLE table_name ( columns )
>    PARTITION BY HASH ( a_expr )
>    (
>      PARTITION name,
>      ...
>    );
>
> Note:
>  * Each partition can have optional WITH (...) and TABLESPACE clauses.
>  * '(' and ')' are optional to support both Oracle and MySQL syntax.
> ----
>
> Comments welcome.

+1

Pavel

>
> Regards,
> ---
> ITAGAKI Takahiro
> NTT Open Source Software Center
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Re: Syntax for partitioning

From

Nikhil Sontakke

Date:

29 October 2009, 07:40:27

Hi,

> So, I'll take over the work if there are no ones to do it.
> I'm thinking to add syntax support first. Table partitioning was
> proposed many times, but it is still not applied into core.
> The reason is it is too difficult to make perfect partitioning
> feature at once. I think syntax support is a good start.

Guess we are back to square one again on Partitioning :), but as long
as someone is willing to walk the whole nine yards with it, that would
be just great!

I had proposed Oracle style syntax a while back and had also submitted
a WIP patch then. Again then my motive was to move forward in a
piece-meal fashion on this feature. First solidify the syntax, keep
using the existing inheritance mechanism and go one step at a time. I
think a feature like Partitioning needs this kind of an approach,
because it might turn out to be a lot of work with a lot of very many
sub items.

So +1 on solidifying the syntax first and then sorting out the other
minute, intricate details later..

Regards,
Nikhils

>
> First, I will add syntax for CREATE TABLE, ALTER TABLE ADD/DROP PARTITION.
> The syntax is borrowed from from Oracle and MySQL. Their characteristics
> are using "LESS THAN" in range partitioning. The keyword "PARTITION" is
> added to the full-reserved keyword list to support ADD/DROP PARTITION.
>
> Those syntax is merely a syntax sugar for INHERITS with CHECK. Declarations
> are translated into CHECK constraints. I have a plan to adjust pg_dump to
> dump definitions of partitioning in the correct format, but the actual
> implementation will be still based on constraint exclusion. In addition,
> hash partitioning is not implemented; syntax is parsed but "not implemented"
> error are raised for now.
>
> Here is syntax I propose:
> ----
> ALTER TABLE table_name ADD PARTITION name ...;
> ALTER TABLE table_name DROP PARTITION [IF EXISTS] name [CASCADE | RESTRICT];
>
> Range partitioning:
>  CREATE TABLE table_name ( columns )
>    PARTITION BY RANGE ( a_expr )
>    (
>      PARTITION name VALUES LESS THAN [(] const [)],
>      PARTITION name VALUES LESS THAN [(] MAXVALUE [)] -- overflow partition
>    );
>
> List partitioning:
>  CREATE TABLE table_name ( columns )
>    PARTITION BY LIST ( a_expr )
>    (
>      PARTITION name VALUES [IN] ( const [, ...] ),
>      PARTITION name VALUES [IN] [(] DEFAULT [)]       -- overflow partition
>    );
>
> Hash partitioning:
>  CREATE TABLE table_name ( columns )
>    PARTITION BY HASH ( a_expr )
>    PARTITIONS num_partitions;
>
>  CREATE TABLE table_name ( columns )
>    PARTITION BY HASH ( a_expr )
>    (
>      PARTITION name,
>      ...
>    );
>
> Note:
>  * Each partition can have optional WITH (...) and TABLESPACE clauses.
>  * '(' and ')' are optional to support both Oracle and MySQL syntax.
> ----
>
> Comments welcome.
>
> Regards,
> ---
> ITAGAKI Takahiro
> NTT Open Source Software Center
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



--
http://www.enterprisedb.com

Re: Syntax for partitioning

From

Grzegorz Jaskiewicz

Date:

29 October 2009, 10:56:50

On 29 Oct 2009, at 02:15, Itagaki Takahiro wrote:

> I'd like to improve partitioning feature in 8.5.
> Kedar-san's previous work is wonderful, but I cannot see any updated  
> patch.
> http://archives.postgresql.org/message-id/bd8134a40906080702s96c90a9q3bbb581b9bd0d5d7@mail.gmail.com
>
> So, I'll take over the work if there are no ones to do it.
> I'm thinking to add syntax support first. Table partitioning was
> proposed many times, but it is still not applied into core.
> The reason is it is too difficult to make perfect partitioning
> feature at once. I think syntax support is a good start.
>
> First, I will add syntax for CREATE TABLE, ALTER TABLE ADD/DROP  
> PARTITION.
> The syntax is borrowed from from Oracle and MySQL. Their  
> characteristics
> are using "LESS THAN" in range partitioning. The keyword "PARTITION"  
> is
> added to the full-reserved keyword list to support ADD/DROP PARTITION.
>
> Those syntax is merely a syntax sugar for INHERITS with CHECK.  
> Declarations
> are translated into CHECK constraints. I have a plan to adjust  
> pg_dump to
> dump definitions of partitioning in the correct format, but the actual
> implementation will be still based on constraint exclusion. In  
> addition,
> hash partitioning is not implemented; syntax is parsed but "not  
> implemented"
> error are raised for now.
>
> Here is syntax I propose:
> ----
> ALTER TABLE table_name ADD PARTITION name ...;
> ALTER TABLE table_name DROP PARTITION [IF EXISTS] name [CASCADE |  
> RESTRICT];
>
> Range partitioning:
>  CREATE TABLE table_name ( columns )
>    PARTITION BY RANGE ( a_expr )
>    (
>      PARTITION name VALUES LESS THAN [(] const [)],
>      PARTITION name VALUES LESS THAN [(] MAXVALUE [)] -- overflow  
> partition
>    );
>
> List partitioning:
>  CREATE TABLE table_name ( columns )
>    PARTITION BY LIST ( a_expr )
>    (
>      PARTITION name VALUES [IN] ( const [, ...] ),
>      PARTITION name VALUES [IN] [(] DEFAULT [)]       -- overflow  
> partition
>    );
>
> Hash partitioning:
>  CREATE TABLE table_name ( columns )
>    PARTITION BY HASH ( a_expr )
>    PARTITIONS num_partitions;
>
>  CREATE TABLE table_name ( columns )
>    PARTITION BY HASH ( a_expr )
>    (
>      PARTITION name,
>      ...
>    );
>
> Note:
>  * Each partition can have optional WITH (...) and TABLESPACE clauses.
>  * '(' and ')' are optional to support both Oracle and MySQL syntax.
> ----
>
> Comments welcome.

+1000

Thanks !

(most anticipated feature for 8.5, here, next to replication [well, I  
am interested in multi master, but that's not going to happen :P ] )

Re: Syntax for partitioning

From

Heikki Linnakangas

Date:

29 October 2009, 13:58:30

Itagaki Takahiro wrote:
> The keyword "PARTITION" is
> added to the full-reserved keyword list to support ADD/DROP PARTITION.

Any chance to avoid that? PARTITION seems like something people might
well use as a column or variable name. OTOH, it is reserved in SQL2008
and SQL2003.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Syntax for partitioning

From

Greg Stark

Date:

29 October 2009, 14:33:52

On Thu, Oct 29, 2009 at 3:35 AM, Nikhil Sontakke
<nikhil.sontakke@enterprisedb.com> wrote:
> So +1 on solidifying the syntax first and then sorting out the other
> minute, intricate details later..

I like that idea as well but I have a concern. What will we do with
pg_dump. If the PARTITION commands are just syntactic sugar for
creating constraints and inherited tables then pg_dump will have to
generate the more generic commands for those objects. When we
eventually have real partitioning then restoring such a dump will not
create real partitions, just inherited tables. Perhaps we need some
kind of option to reverse-engineer partitioning commands from the
inheritance structure,  but I fear having pg_dump reverse engineer
inherited tables to produce partitioning commands will be too hard and
error-prone. Hopefully that's too pessimistic though, if they were
produced by PARTITION commands they should be pretty regular.

-- 
greg

Re: Syntax for partitioning

From

Andres Freund

Date:

29 October 2009, 14:58:06

On Thursday 29 October 2009 18:33:22 Greg Stark wrote:
> On Thu, Oct 29, 2009 at 3:35 AM, Nikhil Sontakke
> 
> <nikhil.sontakke@enterprisedb.com> wrote:
> > So +1 on solidifying the syntax first and then sorting out the other
> > minute, intricate details later..
> 
> I like that idea as well but I have a concern. What will we do with
> pg_dump. If the PARTITION commands are just syntactic sugar for
> creating constraints and inherited tables then pg_dump will have to
> generate the more generic commands for those objects. When we
> eventually have real partitioning then restoring such a dump will not
> create real partitions, just inherited tables. Perhaps we need some
> kind of option to reverse-engineer partitioning commands from the
> inheritance structure,  but I fear having pg_dump reverse engineer
> inherited tables to produce partitioning commands will be too hard and
> error-prone. Hopefully that's too pessimistic though, if they were
> produced by PARTITION commands they should be pretty regular.
One could have a system catalog containing the partitioning information and 
generate the constraints et al. from that and mark them in pg_depend...


Andres

Re: Syntax for partitioning

From

Peter Eisentraut

Date:

29 October 2009, 19:11:46

On tor, 2009-10-29 at 11:15 +0900, Itagaki Takahiro wrote:
> Range partitioning:
>   CREATE TABLE table_name ( columns )
>     PARTITION BY RANGE ( a_expr )
>     (
>       PARTITION name VALUES LESS THAN [(] const [)],
>       PARTITION name VALUES LESS THAN [(] MAXVALUE [)] -- overflow partition
>     );

Maybe this needs to mention the actual operator name instead of LESS
THAN, in case the operator is not named < or the user wants to use a
different one.

> Hash partitioning:
>   CREATE TABLE table_name ( columns )
>     PARTITION BY HASH ( a_expr )
>     PARTITIONS num_partitions;
> 
>   CREATE TABLE table_name ( columns )
>     PARTITION BY HASH ( a_expr )
>     (
>       PARTITION name,
>       ...
>     );

Unless someone comes up with a maintenance plan for stable hash
functions, we should probably not dare look into this yet.

Re: Syntax for partitioning

From

Jeff Davis

Date:

29 October 2009, 19:19:55

On Fri, 2009-10-30 at 00:10 +0200, Peter Eisentraut wrote:
> On tor, 2009-10-29 at 11:15 +0900, Itagaki Takahiro wrote:
> > Range partitioning:
> >   CREATE TABLE table_name ( columns )
> >     PARTITION BY RANGE ( a_expr )
> >     (
> >       PARTITION name VALUES LESS THAN [(] const [)],
> >       PARTITION name VALUES LESS THAN [(] MAXVALUE [)] -- overflow partition
> >     );
> 
> Maybe this needs to mention the actual operator name instead of LESS
> THAN, in case the operator is not named < or the user wants to use a
> different one.

I can't help but wonder if the PERIOD type might be better for
representing a partition range. It would make it easier to express and
enforce the constraint that no two partition ranges overlap ;)

Regards,Jeff Davis

Re: Syntax for partitioning

From

Itagaki Takahiro

Date:

29 October 2009, 22:50:49

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:

> > The keyword "PARTITION" is
> > added to the full-reserved keyword list to support ADD/DROP PARTITION.
> 
> Any chance to avoid that? PARTITION seems like something people might
> well use as a column or variable name. OTOH, it is reserved in SQL2008
> and SQL2003.

CREATE TABLE does not require PARTITION to be a reserved keyword,
but there are conflicts in ALTER TABLE ADD/DROP PARTITION:
 * ALTER TABLE ... DROP [COLUMN]  name [CASCADE | RESTRICT] * ALTER TABLE ... DROP PARTITION name [CASCADE | RESTRICT]

There are some solutions:
1. Change COLUMN not to an optional word (unlikely)2. Change syntax of DROP PARTITION to DROP TABLE PARITION or so3.
ChangeALTER TABLE ADD/DROP PARTITION to top level      => CREATE/DROP PARTITION name ON table_name

Any better ideas?

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Syntax for partitioning

From

Robert Haas

Date:

29 October 2009, 23:14:25

On Thu, Oct 29, 2009 at 9:51 PM, Itagaki Takahiro
<itagaki.takahiro@oss.ntt.co.jp> wrote:
>
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
>
>> > The keyword "PARTITION" is
>> > added to the full-reserved keyword list to support ADD/DROP PARTITION.
>>
>> Any chance to avoid that? PARTITION seems like something people might
>> well use as a column or variable name. OTOH, it is reserved in SQL2008
>> and SQL2003.
>
> CREATE TABLE does not require PARTITION to be a reserved keyword,
> but there are conflicts in ALTER TABLE ADD/DROP PARTITION:
>
>  * ALTER TABLE ... DROP [COLUMN]  name [CASCADE | RESTRICT]
>  * ALTER TABLE ... DROP PARTITION name [CASCADE | RESTRICT]
>
> There are some solutions:
>
>  1. Change COLUMN not to an optional word (unlikely)
>  2. Change syntax of DROP PARTITION to DROP TABLE PARITION or so
>  3. Change ALTER TABLE ADD/DROP PARTITION to top level
>      => CREATE/DROP PARTITION name ON table_name
>
> Any better ideas?

I'm not sure if this is better, but what about:

CREATE PARTITION name ON TABLE name
DROP PARTITION name

Since partitions will live in pg_class and are in some sense "top
level" objects, it seems like it would make sense to use a syntax that
is similar to the one we use for indices...  we can't say "DROP COLUMN
name", because the table must be specified.  But a partition name must
be unambiguous, so making the user write it out explicitly doesn't
seem friendly.

...Robert

Re: Syntax for partitioning

From

Itagaki Takahiro

Date:

30 October 2009, 00:03:01

Peter Eisentraut <peter_e@gmx.net> wrote:

> On tor, 2009-10-29 at 11:15 +0900, Itagaki Takahiro wrote:
> > Range partitioning:
> >   CREATE TABLE table_name ( columns )
> >     PARTITION BY RANGE ( a_expr )
> >     (
> >       PARTITION name VALUES LESS THAN [(] const [)],
> >       PARTITION name VALUES LESS THAN [(] MAXVALUE [)] -- overflow partition
> >     );
> 
> Maybe this needs to mention the actual operator name instead of LESS
> THAN, in case the operator is not named < or the user wants to use a
> different one.

How about to use "sortby" or "index_elem" here?
   PARTITION BY RANGE '(' sortby-or-index_elem ')' '(' RangePartitions ')'
   sortby:       a_expr USING qual_all_Op opt_nulls_order     | a_expr opt_asc_desc      opt_nulls_order
   index_elem:       ColId          opt_class opt_asc_desc opt_nulls_order     | func_expr      opt_class opt_asc_desc
opt_nulls_order    | '(' a_expr ')' opt_class opt_asc_desc opt_nulls_order

We should allow only btree operator class here because we need to
extract GREATER-THAN-OR-EQUAL operator from LESS THAN. In addition,
we will be able to optimize parition search in the future if we
restrict a range partition key should be comparable scalar value.

Multidimensional partitioning will be implemented with another
approach, something like "PARTITION BY GIST", because it would
require different oprimization from range partitioning.
BTW, "PARTITION BY <pg_am.amname>" crossed my mind here,
but it is not well-investigated yet.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Syntax for partitioning

From

Greg Stark

Date:

30 October 2009, 01:20:53

On Thu, Oct 29, 2009 at 7:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> CREATE TABLE does not require PARTITION to be a reserved keyword,
>> but there are conflicts in ALTER TABLE ADD/DROP PARTITION:
>>
>>  * ALTER TABLE ... DROP [COLUMN]  name [CASCADE | RESTRICT]
>>  * ALTER TABLE ... DROP PARTITION name [CASCADE | RESTRICT]
>>
>> There are some solutions:

Do we need a DROP PARTITION command at all? What would it even do?
Drop the partition from the parent table and throw it away in one
step? I think in actual practice people usually remove the partition
from the parent table first, then do things like archive it before
actually throwing it away.

--
greg

Re: Syntax for partitioning

From

Itagaki Takahiro

Date:

30 October 2009, 02:07:05

Greg Stark <gsstark@mit.edu> wrote:

> >> * ALTER TABLE ... DROP [COLUMN]  name [CASCADE | RESTRICT]
> >> * ALTER TABLE ... DROP PARTITION name [CASCADE | RESTRICT]
> 
> Do we need a DROP PARTITION command at all? What would it even do?

Currently no. So, it would be good to treat PARTITION as
just a synonym of TABLE. Not only DROP PARTITION but also
ALTER PARTITION will work.

> CREATE PARTITION name ON table_name
> DROP PARTITION name ALTER PARTITION name ...

We might need to specify partition keys with another syntax.
ALTER TABLE will have only one new command "PARTITION BY".
and we reuse TABLE command for PARTITION in other operations.
 ALTER TABLE table_name PARTITION BY RANGE (expr) (...)

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Syntax for partitioning

From

Peter Eisentraut

Date:

30 October 2009, 05:06:17

On Thu, 2009-10-29 at 15:19 -0700, Jeff Davis wrote:
> I can't help but wonder if the PERIOD type might be better for
> representing a partition range. It would make it easier to express and
> enforce the constraint that no two partition ranges overlap ;)

I can't help but wonder if the period type might better be a generic
container for pairs of scalar, totally-ordered types.

Re: Syntax for partitioning

From

Devrim GÜNDÜZ

Date:

30 October 2009, 11:17:47

On Thu, 2009-10-29 at 11:15 +0900, Itagaki Takahiro wrote:
> I'd like to improve partitioning feature in 8.5.

Nice.

> Here is syntax I propose:

<snip>

Is this the same as / similar to Oracle's syntax?

IIRC Nikhil's patch was Oracle's syntax, and I prefer having that one
instead of inventing our own wheel.

Regards,
--
Devrim GÜNDÜZ, RHCE
Command Prompt - http://www.CommandPrompt.com
devrim~gunduz.org, devrim~PostgreSQL.org, devrim.gunduz~linux.org.tr
http://www.gunduz.org  Twitter: http://twitter.com/devrimgunduz

Re: Syntax for partitioning

From

Jeff Davis

Date:

30 October 2009, 13:18:20

On Fri, 2009-10-30 at 10:03 +0200, Peter Eisentraut wrote:
> I can't help but wonder if the period type might better be a generic
> container for pairs of scalar, totally-ordered types.

That would be ideal. However, it doesn't really look like our type
system was built to handle that kind of thing.

We could use typmod, I suppose, but even that won't hold a full Oid. Any
ideas/suggestions?

Regards,Jeff Davis

Re: Syntax for partitioning

From

pg@thetdh.com

Date:

30 October 2009, 13:43:09

>> PARTITION BY RANGE ( a_expr )
>> ...
>> PARTITION BY HASH ( a_expr )
>> PARTITIONS num_partitions;

> Unless someone comes up with a maintenance plan for stable hash functions, we should probably not dare look into this yet.

What would cover the common use case of per-day quals and drops over an extended history period, say six or nine months? You don't get quite the same locality of reference, generally, with an unpartitioned table, due to slop in the arrival of rows. Ideally, you don't want to depend on an administrator, or even an administrative script, to continually intervene in the structure of a table, as would be the case with partitioning by range, and you don't want to coalesce multiple dates, as an arbitrary hash might do. What the administrator would want would be to decide what rows were too old to keep, then process (e.g. archive, summarize, filter) and delete them.

Suppose that the number of partitions were taken as a hint rather than as a naming modulus, and that any quasi-hash function had to be specified explicitly (although storage assignment could be based on a hash of the quasi-hash output). If a_expr were allowed to include a to-date conversion of a timestamp, day-by-day partitioning would fall out naturally. If, in addition, single-parameter (?) functions were characterized as range-preserving and order-preserving, plan generation could be improved for time ranges on quasi-hash-partitioned tables, without a formal indexing requirement.

There are cases where additional partition dimensions would be useful, for eventual parallelized operation on large databases, and randomizing quasi-hash functions would help. IMHO stability is not needed, except to the extent that hash functions have properties that lend themselves to plan generation and/or table maintenance.

It is not clear to me what purpose there would be in dropping a partition. This would be tantamount to deleting all of the rows in a partition, if it were analogous to dropping a table, and would require some sort of compensatory aggregation of existing partitions (in effect, a second partitioning dimension), if it were merely structural.

Perhaps I'm missing something here.

David Hudson

Re: Syntax for partitioning

From

Heikki Linnakangas

Date:

30 October 2009, 14:12:28

Jeff Davis wrote:
> On Fri, 2009-10-30 at 10:03 +0200, Peter Eisentraut wrote:
>> I can't help but wonder if the period type might better be a generic
>> container for pairs of scalar, totally-ordered types.
> 
> That would be ideal. However, it doesn't really look like our type
> system was built to handle that kind of thing.
> 
> We could use typmod, I suppose, but even that won't hold a full Oid. Any
> ideas/suggestions?

Wait, it doesn't? A typmod is a 32-bit integer, like Oids. Am I missing
something?

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Syntax for partitioning

From

Jeff Davis

Date:

30 October 2009, 15:14:38

On Fri, 2009-10-30 at 19:12 +0200, Heikki Linnakangas wrote:
> Wait, it doesn't? A typmod is a 32-bit integer, like Oids. Am I missing
> something?

Oid is unsigned, typmod is signed. We might be able to get away with it,
but -1 is treated specially in some places outside of the type-specific
functions, e.g. exprTypmod().

I haven't looked at all of these places yet, so maybe a few simple
changes would allow us to treat typmod as a full 32 bits. Or perhaps it
could just be expanded to a signed 64-bit int. What do you think?

Regards,Jeff Davis

Re: Syntax for partitioning

From

Stephen Frost

Date:

30 October 2009, 17:52:01

* Jeff Davis (pgsql@j-davis.com) wrote:
> Oid is unsigned, typmod is signed. We might be able to get away with it,
> but -1 is treated specially in some places outside of the type-specific
> functions, e.g. exprTypmod().

Any negative result returned from the input handling function is
considered an error, as I recall.  It's more than just '-1'..

> I haven't looked at all of these places yet, so maybe a few simple
> changes would allow us to treat typmod as a full 32 bits. Or perhaps it
> could just be expanded to a signed 64-bit int. What do you think?

That was shot down previously due to the way typmods are passed around
currently..  Not that it wouldn't be really nice..
Thanks,
    Stephen

Re: Syntax for partitioning

From

Tom Lane

Date:

30 October 2009, 18:16:33

Jeff Davis <pgsql@j-davis.com> writes:
> I haven't looked at all of these places yet, so maybe a few simple
> changes would allow us to treat typmod as a full 32 bits. Or perhaps it
> could just be expanded to a signed 64-bit int. What do you think?

Neither is likely to happen, and even disregarding that, I doubt people
would be real happy with a design like this.  Where are you going to
put the typmod for the contained type?
        regards, tom lane

Re: Syntax for partitioning

From

Robert Haas

Date:

30 October 2009, 18:40:05

On Fri, Oct 30, 2009 at 5:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
>> I haven't looked at all of these places yet, so maybe a few simple
>> changes would allow us to treat typmod as a full 32 bits. Or perhaps it
>> could just be expanded to a signed 64-bit int. What do you think?
>
> Neither is likely to happen, and even disregarding that, I doubt people
> would be real happy with a design like this.  Where are you going to
> put the typmod for the contained type?

IMO, the real problem is that the type interface is poorly
encapsulated.  There's way too much code that knows about the internal
details of a type - namely, that it's a 32-bit integer modified by a
second 32-bit integer.  I think there are still places where the code
doesn't even know about typmod.  If we're going to go to the trouble
of changing anything, I think it should probably involve inserting an
abstraction layer that will make future extensions easier.  But I have
a feeling that's going to be a tough sell.

...Robert

Re: Syntax for partitioning

From

Jeff Davis

Date:

30 October 2009, 19:28:42

On Fri, 2009-10-30 at 17:39 -0400, Robert Haas wrote:
> IMO, the real problem is that the type interface is poorly
> encapsulated.  There's way too much code that knows about the internal
> details of a type - namely, that it's a 32-bit integer modified by a
> second 32-bit integer.  I think there are still places where the code
> doesn't even know about typmod.  If we're going to go to the trouble
> of changing anything, I think it should probably involve inserting an
> abstraction layer that will make future extensions easier.  But I have
> a feeling that's going to be a tough sell.

Yeah. We're way off topic for partitioning, so I think it's best to just
table this discussion until someone comes up with a good idea.

It's not the end of the world to write some generic C code, and have
multiple types make use of it, e.g. PERIOD, PERIODTZ, INT4RANGE,
FLOAT8RANGE, etc. It's a little redundant and creates some catalog
bloat, but I'm not too concerned about it right now. Certainly not
enough to rewrite the type system.

Regards,Jeff Davis

Re: Syntax for partitioning

From

Tom Lane

Date:

30 October 2009, 19:44:49

Robert Haas <robertmhaas@gmail.com> writes:
> IMO, the real problem is that the type interface is poorly
> encapsulated.  There's way too much code that knows about the internal
> details of a type - namely, that it's a 32-bit integer modified by a
> second 32-bit integer.  I think there are still places where the code
> doesn't even know about typmod.  If we're going to go to the trouble
> of changing anything, I think it should probably involve inserting an
> abstraction layer that will make future extensions easier.  But I have
> a feeling that's going to be a tough sell.

Yup, you're right.  It would be an enormous amount of work and break a
lot of third-party code, for largely hypothetical future benefits.
We've got better places to invest our limited manpower.
        regards, tom lane

Re: Syntax for partitioning

From

Itagaki Takahiro

Date:

01 November 2009, 20:12:17

Devrim GNDZ <devrim@gunduz.org> wrote:

> Is this the same as / similar to Oracle's syntax? 

Yes.

> IIRC Nikhil's patch was Oracle's syntax

No. See:
http://download.oracle.com/docs/cd/B28359_01/server.111/b28286/statements_7002.htm#i2125922

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Syntax for partitioning

From

Nikhil Sontakke

Date:

02 November 2009, 02:02:21

Hi,

>> Is this the same as / similar to Oracle's syntax?
>
> Yes.
>
>> IIRC Nikhil's patch was Oracle's syntax
>
> No. See:
> http://download.oracle.com/docs/cd/B28359_01/server.111/b28286/statements_7002.htm#i2125922
>

Any decent prevalent partitioning syntax should be ok IMHO. IIRC,
MySQL paritioning syntax is also pretty similar to Oracle's.

Regards,
Nikhils
-- 
http://www.enterprisedb.com

Re: Syntax for partitioning

From

Itagaki Takahiro

Date:

02 November 2009, 02:50:32

Here are details of partitioning syntax.

-----------------
 Syntax overview
-----------------
Partitions are defined with 3 steps:
 1. Create a plain table as parent.
 2. Set a partition key to a table.
 3. Add a partition to a table which has a partition key.

i.e.,
    CREATE TABLE table (...) PARTITION BY { RANGE | LIST } ( key ) (...)
is just an abbreviated form of:
 1. CREATE TABLE table (...);
 2. ALTER TABLE table PARTITION BY { RANGE | LIST } ( key );
 3. CREATE PARTITION name ON table VALUES ...;

Currently RANGE and LIST partitions are supported.
No reserved keywords are required by the syntax, and that's why
ALTER TABLE ADD PARTITION cannot be used here instead of CREATE PARTITION.

HASH partitions are not supported, but we can use LIST paritions with
an expression key as incomplete HASH partitions:
    CREATE TABLE table (...) PARTITION BY LIST ( hashtext(attr) ) (...);
    SELECT * FROM table WHERE hashtext(attr) = hashtext('search_key');

------------------------------------------
 Features *NOT* included in this proposal
------------------------------------------
To simplify patch, the following features are not included:
 - Partition triggers to dispatch rows inserted into parent table
 - Expanding some commands for a parent to partitions (ex. VACUUM)
 - ALTER commands except RENAME (ex. MERGE, SPLIT, UPDATE)
 - Ability to add an existing table to a parent as a partition
    (ex. ALTER TABLE table INHERIT parent AS PARTITION)
 - Planner and Executor improvements

I have plans to implement some of them in separated patches, but not now.

-----------------
 Catalog changes
-----------------
A new system catalog "pg_partition" represents partition keys for each
table. A parent table of partitions has only one pg_partition row.
I think separated pg_partition table is better than adding these columns
to pg_class, but it might be debatable.

CREATE TABLE pg_partition (
    partrelid oid REFERENCES oid ON pg_class,    -- partitioned table oid
    partopr   oid REFERENCES oid ON pg_operator, -- operator to comapre keys
    partkind  "char", -- kind of partition: 'R' (range) or 'L' (list)
    partkey   text,   -- expression tree of partition key
    PRIMARY KEY (partrelid)
) WITHOUT OIDS;

In addition, we would need to store threshold values of child tables
somewhere, but under consideration. I'm thinking to extract upper
and lower bounds from CHECK constraint, but it might be unreliable.
Comments and ideas welcome.

----------------
 Syntax details
----------------
CREATE TABLE table (...) PARTITION BY RANGE ( expr [USING operator] )
[ (
    PARTITION name VALUES LESS THAN [(] upper [)],
    PARTITION name VALUES LESS THAN [(] MAXVALUE [)] -- overflow partition
  ) ] ;
-- default operator is '<' for range partitions

CREATE TABLE table (...) PARTITION BY LIST ( expr [USING operator] )
[ (
    PARTITION name VALUES [IN] ( values ),
    PARTITION name VALUES [IN] [(] DEFAULT [)] -- overflow partition
  ) ] ;
-- default operator is '=' for list partitions

ALTER TABLE table PARTITION BY { RANGE | LIST } ... ;
ALTER TABLE table NO PARTITION; -- drop partition key

CREATE PARTITION partition ON table VALUES LESS THAN ...; -- range
CREATE PARTITION partition ON table VALUES IN (...);      -- list

DROP PARTITION partition;                 -- synonym for DROP TABLE
ALTER PARTITION partition RENAME TO name; -- synonym for ALTER TABLE RENAME

Note:
  * Each partition can have optional WITH (...) and TABLESPACE clauses.
  * '(' and ')' are optional to support both Oracle and MySQL syntax.

-----------
 WIP patch
-----------
The attached partitioning_20091102.patch is a WIP patch. There are
still not implemented features marked with TODO tags, but I'll use
this design -- especially Node manipulations.
Please notice me if I'm missing something.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachment

partitioning_20091102.patch

Re: Syntax for partitioning

From

Itagaki Takahiro

Date:

05 November 2009, 07:43:15

Here is a WIP partitioning patch. The new syntax are:
  1. CREATE TABLE parent (...);
  2. ALTER TABLE parent PARTITION BY { RANGE | LIST } ( key );
  3. CREATE TABLE child (...);
  4. ALTER TABLE child INHERIT parent AS PARTITION VALUES ...;

We can also use "CREATE TABLE PARTITION BY" as 1+2+3+4 and
"CREATE PARTITION" as 3+4. I think "INHERIT AS PARTITION" is rarely
used typically, but such orthogonality seems to be cleaner.

The most complex logic of the patch is in ATExecAddInherit(). It scans
existing partitions and generate CHECK constraint for the new partition.

Any comments to the design?  If no objections, I'd like to stop adding
features in this CommitFest and go for remaining auxiliary works
-- pg_dump, object dependency checking, documentation, etc.

> -----------------
>  Catalog changes
> -----------------
In addition to pg_partition, I added pg_inherits.inhvalues field.
The type of field is "anyarray" and store partition values.
For range partition, an upper bound value is stored in the array.
For list partition, list values are stored in it. These separated
value fields will be useful to implement partition triggers in the
future. In contrast, reverse engineering of check constraints is messy.

CATALOG(pg_inherits,2611) BKI_WITHOUT_OIDS
{
    Oid            inhrelid;
    Oid            inhparent;
    int4        inhseqno;
    anyarray    inhvalues;    /* values for partition */
} FormData_pg_inherits;

> CREATE TABLE pg_partition (
>     partrelid oid REFERENCES oid ON pg_class,    -- partitioned table oid
>     partopr   oid REFERENCES oid ON pg_operator, -- operator to compare keys
>     partkind  "char", -- kind of partition: 'R' (range) or 'L' (list)
>     partkey   text,   -- expression tree of partition key
>     PRIMARY KEY (partrelid)
> ) WITHOUT OIDS;

------------------------------
 Limitations and Restrictions
------------------------------
* We can create a new partition as long as partitioning keys
  are not conflicted with existing partitions. Especially,
  we cannot add any partitions if we have overflow partitions
  because a new partition always split the overflow partition.

* We cannot reuse an existing check constraint as a partition
  constraint. ALTER TABLE INHERIT AS PARTITION brings on
  a table scan to add a new CHECK constraint.

* No partition triggers nor planner and executor improvements.
  It would come in the future development.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachment

partitioning_20091105.patch

Re: Syntax for partitioning

From

Itagaki Takahiro

Date:

12 November 2009, 06:54:25

I added psql and pg_dump support to Partitioning Syntax patch.
Paritioning information can be retrieved with a new system function
pg_get_partitiondef(parentRelid). Both psql and pg_dump use it.

There are some changes from the last patch.
Some of them seem to be a bit ugly. Ideas welcome.

  * If a table with the same name already exists when a partition
    is created, the table is re-used as partition. This behavior
    is required for pg_dump to be simple.

  * Don't create a new check constraint when a table is attached as
    partition to a parent table if the child table has constraints
    with the same definition. This behavior is required for pg_dump
    not to add duplicated check constraints in repeated dump and restore.

  * Inheritance is used for partitions, but pg_dump doesn't dump them as
    inheritance; It dump a child table without inheritance first, and
    re-add inheritance with ALTER TABLE PARTITION BY. PartitionInfo is
    added as a DumpableObject in pg_dump.

  * Dependencies of objects are managed with existing depencency manager
    except a check constraint to partition values. Partition constraints
    can be dropped even if the table is still in the partitioning set.

A patch attached, and I'll summarize it:

==== Syntax ====
CREATE TABLE parent (...)
    PARTITION BY { RANGE | LIST } ( key [ USING oprator ] )
    ( <partitions> );
ALTER TABLE parent PARTITION BY { RANGE | LIST } ...;
CREATE PARTITION partition ON parent VALUES ...;
<partitions> :=
      PARTITION name VALUES LESS THAN { range_upper | MAXVALUE }
    | PARTITION name VALUES IN ( list_value [,...] | DEFAULT )

==== System Catalog ====
CREATE TABLE pg_partition (
    partrelid oid UNIQUE REFARENCES pg_class(oid),
    partopr   oid        REFARENCES pg_operatoroid),
    partkind  "char",    -- 'R':RANGE or 'L':LIST
    partkey   text       -- node dump of the partition key
) WITHOUT OIDS;

CREATE TABLE pg_inherits (
    inhrelid  oid,
    inhparent oid,
    inhseqno  integer,
+   inhvalues anyarray -- Non-null if the inheritance is for partitioning.
) WITHOUT OIDS;

==== Sample output from psql ====
=# \d sales_range
               Table "public.sales_range"
    Column     |            Type             | Modifiers
---------------+-----------------------------+-----------
 salesman_id   | numeric(5,0)                |
 salesman_name | character varying(30)       |
 sales_state   | character varying(20)       |
 sales_date    | timestamp without time zone |
Partitions: PARTITION BY RANGE ( sales_date USING < )
(
    PARTITION sales_2006 VALUES LESS THAN '2007-01-01 00:00:00',
    ...
)

==== Sample output from pg_dump ====
CREATE TABLE sales_range (...);
CREATE TABLE sales_2006 (...); -- without inheritance
ALTER TABLE public.sales_range PARTITION BY RANGE ( sales_date USING < )
(
    PARTITION sales_2006 VALUES LESS THAN '2007-01-01 00:00:00',
    ...
);

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachment

partitioning_20091112.patch

Re: Syntax for partitioning

From

Marko Tiikkaja

Date:

17 November 2009, 15:33:36

Hi,

I'm reviewing your patch.  The patch applies without problems and the
feature works as advertised.  I have yet to look at the code in detail,
but it looks sane and seems to work.  However, this looks like a mistake:

partinfo = (PartitionInfo *) malloc(ntups * sizeof(PartitionInfo));

or am I missing something?


The syntax itself seems a bit weird in some cases.  Say you have:
PARTITION BY RANGE ( foo USING > )
(PARTITION bar VALUES LESS THAN 0
);

which translates to CHECK (bar > 0).  That doesn't sound at all like
LESS THAN to me.  This syntax seems to be the same Oracle uses, and I
think it's nice for the general case, but I think the reversed operator
weirdness is a bit too much.  Maybe we should use something like

PARTITION bar VALUES OPERATOR 0

when the user specifies the operator?


Regards,
Marko Tiikkaja

Re: Syntax for partitioning

From

Jaime Casanova

Date:

17 November 2009, 15:46:32

On Thu, Nov 12, 2009 at 5:54 AM, Itagaki Takahiro
<itagaki.takahiro@oss.ntt.co.jp> wrote:
> I added psql and pg_dump support to Partitioning Syntax patch.
> Paritioning information can be retrieved with a new system function
> pg_get_partitiondef(parentRelid). Both psql and pg_dump use it.
>

i haven't seen the patch but:

>
>  * If a table with the same name already exists when a partition
>    is created, the table is re-used as partition. This behavior
>    is required for pg_dump to be simple.
>

i guess the table must be empty, if not we should be throw an error...
and i actually prefer some more explicit syntax for this not just
reusing a table

>      PARTITION name VALUES LESS THAN { range_upper | MAXVALUE }
>    | PARTITION name VALUES IN ( list_value [,...] | DEFAULT )
>

i remember someone making a comment about actually using operators
instead of LESS THEN and family

--
Atentamente,
Jaime Casanova
Soporte y capacitación de PostgreSQL
Asesoría y desarrollo de sistemas
Guayaquil - Ecuador
Cel. +59387171157

Re: Syntax for partitioning

From

Marko Tiikkaja

Date:

17 November 2009, 16:09:54

Jaime Casanova wrote:
>>  * If a table with the same name already exists when a partition
>>    is created, the table is re-used as partition. This behavior
>>    is required for pg_dump to be simple.
>>
> 
> i guess the table must be empty, if not we should be throw an error...
> and i actually prefer some more explicit syntax for this not just
> reusing a table

I'd be OK with only a notification - even if the table wasn't empty -,
similar to how inheritance combines rows currently.  The patch currently
silently reuses the table unless it has rows which don't satisfy the
CHECK constraint, in which case it gives you the default CHECK
constraint error.

>>      PARTITION name VALUES LESS THAN { range_upper | MAXVALUE }
>>    | PARTITION name VALUES IN ( list_value [,...] | DEFAULT )
>>
> 
> i remember someone making a comment about actually using operators
> instead of LESS THEN and family

That doesn't sound like a bad idea..

Regards,
Marko Tiikkaja

Re: Syntax for partitioning

From

Simon Riggs

Date:

17 November 2009, 18:35:35

On Thu, 2009-10-29 at 11:15 +0900, Itagaki Takahiro wrote:

> I think syntax support is a good start.

I don't see a syntax-only patch as being any use at all to this
community.

We go to enormous lengths in other areas to never allow patches with
restrictions. Why would we allow a patch that is essentially 100%
restriction? i.e. It does nothing at all. Worse than that, it will
encourage people to believe it exists in full, when that isn't the case.

The syntax has never really been in question, so it doesn't really move
us forwards in any direction. This is exactly the kind of shallow
feature we have always shied away from and that other databases have
encouraged.

The only reason I can see is that it allows people to develop non-open
source code that matches how Postgres will work when we get our act
together. That seems likely to discourage, rather than encourage the
funding of this work for open source. It may even muddy the water for
people that don't understand that the real magic happens in the
internals, not in the syntax.

Why not just wait until we have a whole patch and then apply?

-- Simon Riggs           www.2ndQuadrant.com

Re: Syntax for partitioning

From

Robert Haas

Date:

17 November 2009, 21:12:58

On Tue, Nov 17, 2009 at 4:31 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Thu, 2009-10-29 at 11:15 +0900, Itagaki Takahiro wrote:
>
>> I think syntax support is a good start.
>
> I don't see a syntax-only patch as being any use at all to this
> community.
>
> We go to enormous lengths in other areas to never allow patches with
> restrictions. Why would we allow a patch that is essentially 100%
> restriction? i.e. It does nothing at all. Worse than that, it will
> encourage people to believe it exists in full, when that isn't the case.
>
> The syntax has never really been in question, so it doesn't really move
> us forwards in any direction. This is exactly the kind of shallow
> feature we have always shied away from and that other databases have
> encouraged.
>
> The only reason I can see is that it allows people to develop non-open
> source code that matches how Postgres will work when we get our act
> together. That seems likely to discourage, rather than encourage the
> funding of this work for open source. It may even muddy the water for
> people that don't understand that the real magic happens in the
> internals, not in the syntax.
>
> Why not just wait until we have a whole patch and then apply?

Because big patches are really hard to get applied.  Personally, I
think a syntax-only patch makes a lot of sense, as long as the design
is carefully thought about so that it can serve as a foundation for
future work in this area.  I don't think "the whole patch" is even
necessarily a well-defined concept in this instance: different people
could have very different ideas about what would constitute a complete
solution, or which aspects of a complete solution are most important
or should be pursued first.  Settling on a syntax, and an internal
representation for that syntax, seems like it will make subsequent
discussions about those projects considerably more straightforward,
and it has some value in and of itself since similar notation is used
by other databases.

At least, that's MHO.

...Robert

Re: Syntax for partitioning

From

Itagaki Takahiro

Date:

18 November 2009, 00:17:25

Marko Tiikkaja <marko.tiikkaja@cs.helsinki.fi> wrote:

> Jaime Casanova wrote:
> >>      PARTITION name VALUES LESS THAN { range_upper | MAXVALUE }
> >>    | PARTITION name VALUES IN ( list_value [,...] | DEFAULT )
> > 
> > i remember someone making a comment about actually using operators
> > instead of LESS THEN and family
> 
> That doesn't sound like a bad idea..

I prefer to use widely-used syntax instead of postgres original one.
Oracle and MySQL already use "LESS THAN" and "IN" for partitioning.
I assume almost all user only use the default operators.
I don't want to break de facto standard for small utilization area.

I think truly what we want is a new partition "kind" in addition to
RANGE and LIST. If we want to split geometric data into paritions,
we need to treat the the partition key with gist-list operation.
I agree with a plan to add some additional parition kinds,
but want to keep RANGE and LIST partitions in the current syntax.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Syntax for partitioning

From

Itagaki Takahiro

Date:

18 November 2009, 00:24:21

Simon Riggs <simon@2ndQuadrant.com> wrote:

> Why not just wait until we have a whole patch and then apply?

"A whole patch" can be written by many contributers instead of only
one person, no?  I think we need to split works for partitioning
into serveral parts to encourage developing it. I just did one of
the parts, "syntax". Anothe patch "Partitioning option for COPY"
will do a good job in the field of "INSERT". 

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Syntax for partitioning

From

Itagaki Takahiro

Date:

18 November 2009, 00:35:49

Jaime Casanova <jcasanov@systemguards.com.ec> wrote:

> > * If a table with the same name already exists when a partition
> >   is created, the table is re-used as partition. This behavior
> >   is required for pg_dump to be simple.
> 
> i guess the table must be empty, if not we should be throw an error...
> and i actually prefer some more explicit syntax for this not just
> reusing a table

Yeah, an explicit syntax is better.
I've researched other syntax, but I cannot find any good ones.
* ALTER TABLE child INHERIT parent AS PARTITION   => implemenation "PARTITION is an INHERIT" is revealed to user.*
ALTERPARTITION child ATTACH TO parent   => child is not a partition yet at that point.* ALTER TABLE parent ADD
PARTITIONchild   => "partition" need to be a full-reserved word.

Are there better idea?

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Syntax for partitioning

From

Itagaki Takahiro

Date:

18 November 2009, 00:52:30

Marko Tiikkaja <marko.tiikkaja@cs.helsinki.fi> wrote:

> this looks like a mistake:
> partinfo = (PartitionInfo *) malloc(ntups * sizeof(PartitionInfo));

Oops, it should be "p"alloc. Thanks.

> Maybe we should use something like
> PARTITION bar VALUES OPERATOR 0
> when the user specifies the operator?

I think we could have reasonable restrictions to the operator
for future optimization. Is the VALUES OPERATOR syntax too freedom?

For the same reason, USING operator also might be too freedom.
RANGE (and maybe also LIST) partition keys should be sortable,
operator class name might be better to the option instead of
any operators. i.e., PARTITION BY RANGE ( foo [ USING operator ] )
should be:  PARTITION BY RANGE ( foo [ btree_ops_name ] )

If we do so, there will be no inconsistency in LESS THAN syntax
because btree_ops always have < operator.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Syntax for partitioning

From

Simon Riggs

Date:

18 November 2009, 05:04:12

On Wed, 2009-11-18 at 13:24 +0900, Itagaki Takahiro wrote:
> Simon Riggs <simon@2ndQuadrant.com> wrote:
> 
> > Why not just wait until we have a whole patch and then apply?
> 
> "A whole patch" can be written by many contributers instead of only
> one person, no?  I think we need to split works for partitioning
> into serveral parts to encourage developing it. I just did one of
> the parts, "syntax". Anothe patch "Partitioning option for COPY"
> will do a good job in the field of "INSERT". 

If we can agree the parts that are required, I would at least be
confident that we have understood this enough to allow one part to
proceed ahead of the others.

For partitioning the parts are these

1. Syntax for explicit partitioning
2. Internal data representations
3. Optimizations  many and various
4. Data Routing  a) Data routing on INSERT/COPY  b) UPDATE handling when the UPDATE causes partition migration

If this patch puts forward a solution for (2) also, then it is
potentially worthwhile. That is the real blocking point here. Once we
have that other people will quickly fill in the later parts.

I foresee a data structure that is a sorted list of boundary-values,
cached on the parent-relation. This should be accessible to allow
bsearch of particular values during both planning and execution. Same
rules apply as btree operator classes. For multi-level hierarchies the
parent level should have the union of all sub-hierarchies. I think we
need an index on pg_inherits also.

So please do (1) and (2), not just (1) in isolation.

-- Simon Riggs           www.2ndQuadrant.com

Re: Syntax for partitioning

From

Markus Wanner

Date:

19 November 2009, 10:59:16

Hi,

Robert Haas wrote:
> Settling on a syntax, and an internal representation for that syntax, 

I've been under the impression that this was only about syntax. What are 
the internal additions?

Generally speaking, I'd agree with Simon or even vote for doing the 
internals first and add the syntactic sugar only later on.

> seems like it will make subsequent
> discussions about those projects considerably more straightforward,

..or subsequent implementations more complicated, because you have to 
support an awkward syntax.

> and it has some value in and of itself since similar notation is used
> by other databases.

That point is well taken, but it would be more compelling if it were the 
same or at least a compatible syntax.

Regards

Markus Wanner

Re: Syntax for partitioning

From

Robert Haas

Date:

19 November 2009, 11:54:10

On Thu, Nov 19, 2009 at 9:58 AM, Markus Wanner <markus@bluegap.ch> wrote:
> Hi,
>
> Robert Haas wrote:
>>
>> Settling on a syntax, and an internal representation for that syntax,
>
> I've been under the impression that this was only about syntax. What are the
> internal additions?

I haven't looked at it in detail, but it adds a new pg_partition
table.  Whether that table is suitably structured for use by the
optimizer is not clear to me.

> Generally speaking, I'd agree with Simon or even vote for doing the
> internals first and add the syntactic sugar only later on.

That's not really possible in this case.  The internals consist of
taking advantage of the fact that we have explicit knowledge of how
the partitions are defined vs. just relying on the (slow) constraint
exclusion logic.  We can't do that unless, in fact, we have that
explicit knowledge, and that requires inventing syntax.

> That point is well taken, but it would be more compelling if it were the
> same or at least a compatible syntax.

There's been an effort to make it close, but I haven't followed it in
enough detail to know how close.

...Robert

Re: Syntax for partitioning

From

Peter Eisentraut

Date:

19 November 2009, 18:35:26

On ons, 2009-11-18 at 13:52 +0900, Itagaki Takahiro wrote:
> > partinfo = (PartitionInfo *) malloc(ntups * sizeof(PartitionInfo));
> 
> Oops, it should be "p"alloc. Thanks.

A very low-level comment:

1) Please stop casting the results of palloc and malloc.  We are not
writing C++ here.

2) I would prefer that you apply sizeof on the variable, not on the
type.  That way, the expression is independent of any type changes of
the variable, and can be reviewed without having to scroll around for
the variable definition.

So how about,

partinfo = palloc(ntups * sizeof(*partinfo));

Re: Syntax for partitioning

From

Nikhil Sontakke

Date:

20 November 2009, 01:45:26

Hi,

>> > partinfo = (PartitionInfo *) malloc(ntups * sizeof(PartitionInfo));
>
> 1) Please stop casting the results of palloc and malloc.  We are not
> writing C++ here.
>

I thought it was/is a good C programming practice to typecast (void *)
always to the returning structure type!!

Regards,
Nikhils

> 2) I would prefer that you apply sizeof on the variable, not on the
> type.  That way, the expression is independent of any type changes of
> the variable, and can be reviewed without having to scroll around for
> the variable definition.
>
> So how about,
>
> partinfo = palloc(ntups * sizeof(*partinfo));
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



--
http://www.enterprisedb.com

Re: Syntax for partitioning

From

Peter Eisentraut

Date:

20 November 2009, 01:58:07

On fre, 2009-11-20 at 11:14 +0530, Nikhil Sontakke wrote:
> Hi,
> 
> >> > partinfo = (PartitionInfo *) malloc(ntups * sizeof(PartitionInfo));
> >
> > 1) Please stop casting the results of palloc and malloc.  We are not
> > writing C++ here.
> >
> 
> I thought it was/is a good C programming practice to typecast (void *)
> always to the returning structure type!!

This could be preferable if you use sizeof on the type, so that you have
an additional check that the receiving variable actually has that type.
But if you use sizeof on the variable itself, it's unnecessary: You just
declare the variable to be of some type earlier, and then the expression
allocates ntups of it, without having to repeat the type information.

> 
> Regards,
> Nikhils
> 
> > 2) I would prefer that you apply sizeof on the variable, not on the
> > type.  That way, the expression is independent of any type changes of
> > the variable, and can be reviewed without having to scroll around for
> > the variable definition.
> >
> > So how about,
> >
> > partinfo = palloc(ntups * sizeof(*partinfo));
> >
> >
> > --
> > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-hackers
> >
> 
> 
> 
> -- 
> http://www.enterprisedb.com
>

Re: Syntax for partitioning

From

Tom Lane

Date:

20 November 2009, 02:19:20

Nikhil Sontakke <nikhil.sontakke@enterprisedb.com> writes:
>>> partinfo = (PartitionInfo *) malloc(ntups * sizeof(PartitionInfo));
>> 
>> 1) Please stop casting the results of palloc and malloc. �We are not
>> writing C++ here.

> I thought it was/is a good C programming practice to typecast (void *)
> always to the returning structure type!!

Yes.  The above is good style because it ensures that the variable
you're assigning the pointer to is the right type to match the sizeof
computation.  In C++ you'd use operator new instead and still have that
type-check without the cast, but indeed we are not writing C++ here.

The *real* bug in the quoted code is that it's using malloc.  There are
a few places in PG where it's appropriate to use malloc not palloc, but
pretty darn few.
        regards, tom lane

Re: Syntax for partitioning

From

Tom Lane

Date:

20 November 2009, 02:52:19

Peter Eisentraut <peter_e@gmx.net> writes:
> 2) I would prefer that you apply sizeof on the variable, not on the
> type.  That way, the expression is independent of any type changes of
> the variable, and can be reviewed without having to scroll around for
> the variable definition.

FWIW, I think the general project style has been the other way.
Yes, it means you write the type name three times not once, but
the other side of that coin is that it makes it more obvious what
is happening (and gives you an extra chance to realize that the
type you wrote is wrong ...)
        regards, tom lane

Re: Syntax for partitioning

From

Simon Riggs

Date:

20 November 2009, 03:08:33

On Thu, 2009-11-19 at 10:53 -0500, Robert Haas wrote:
> On Thu, Nov 19, 2009 at 9:58 AM, Markus Wanner <markus@bluegap.ch> wrote:
> > Hi,
> >
> > Robert Haas wrote:
> >>
> >> Settling on a syntax, and an internal representation for that syntax,
> >
> > I've been under the impression that this was only about syntax. What are the
> > internal additions?
> 
> I haven't looked at it in detail, but it adds a new pg_partition
> table.  Whether that table is suitably structured for use by the
> optimizer is not clear to me.

If it does, then my review comments to Kedar still apply: 

* why do we want another catalog table? what's wrong with pg_inherits?
It might need additional columns, and it certainly needs another index.

* We need an internal data structure (discussed on this thread also).
Leaving stuff in various catalog tables would not be the same thing at
all.

-- Simon Riggs           www.2ndQuadrant.com

Re: Syntax for partitioning

From

Robert Haas

Date:

20 November 2009, 12:00:01

On Fri, Nov 20, 2009 at 2:08 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Thu, 2009-11-19 at 10:53 -0500, Robert Haas wrote:
>> On Thu, Nov 19, 2009 at 9:58 AM, Markus Wanner <markus@bluegap.ch> wrote:
>> > Hi,
>> >
>> > Robert Haas wrote:
>> >>
>> >> Settling on a syntax, and an internal representation for that syntax,
>> >
>> > I've been under the impression that this was only about syntax. What are the
>> > internal additions?
>>
>> I haven't looked at it in detail, but it adds a new pg_partition
>> table.  Whether that table is suitably structured for use by the
>> optimizer is not clear to me.
>
> If it does, then my review comments to Kedar still apply:
>
> * why do we want another catalog table? what's wrong with pg_inherits?
> It might need additional columns, and it certainly needs another index.

That might work, I haven't looked at it enough to be sure one way or the other.

> * We need an internal data structure (discussed on this thread also).
> Leaving stuff in various catalog tables would not be the same thing at
> all.

Ultimately I'm guessing that for query optimization we'll need to
include the relevant info in the relcache.  But I think that can wait
until we're ready to actually make the optimizer changes - not much
point in caching data that is never used.  Right now I think it's
enough to verify (which I haven't) that the schema of the catalog
table is suitable for straightforward construction of the data that
will eventually need to be cached.

...Robert

Re: Syntax for partitioning

From

Emmanuel Cecchet

Date:

24 November 2009, 23:43:58

Hi,

Sorry for commenting only now but I think that we need to be able to 
store the partitions in different tablespaces. Even if originally the 
create table creates all partitions in the same tablespace, individual 
partitions should be allowed to be moved in different tablespaces using 
alter table or alter partition.  I think that other databases allows the 
user to define a tablespace for each partition in the create table 
statement.
In a warehouse, you might want to split your partitions on different 
volumes and over time, move older partitions to storage with higher 
compression if that data is not to be accessed frequently anymore. 
Altering tablespaces for partitions is important in that context.

Are you also planning to provide partitioning extensions to 'create 
table as'?

Thanks
Emmanuel

> Here is a WIP partitioning patch. The new syntax are:
>   1. CREATE TABLE parent (...);
>   2. ALTER TABLE parent PARTITION BY { RANGE | LIST } ( key );
>   3. CREATE TABLE child (...);
>   4. ALTER TABLE child INHERIT parent AS PARTITION VALUES ...;
>
> We can also use "CREATE TABLE PARTITION BY" as 1+2+3+4 and
> "CREATE PARTITION" as 3+4. I think "INHERIT AS PARTITION" is rarely
> used typically, but such orthogonality seems to be cleaner.
>
> The most complex logic of the patch is in ATExecAddInherit(). It scans
> existing partitions and generate CHECK constraint for the new partition.
>
> Any comments to the design?  If no objections, I'd like to stop adding
> features in this CommitFest and go for remaining auxiliary works
> -- pg_dump, object dependency checking, documentation, etc.
>
>   
>> -----------------
>>  Catalog changes
>> -----------------
>>     
> In addition to pg_partition, I added pg_inherits.inhvalues field.
> The type of field is "anyarray" and store partition values.
> For range partition, an upper bound value is stored in the array.
> For list partition, list values are stored in it. These separated
> value fields will be useful to implement partition triggers in the
> future. In contrast, reverse engineering of check constraints is messy.
>
> CATALOG(pg_inherits,2611) BKI_WITHOUT_OIDS
> {
>     Oid            inhrelid;
>     Oid            inhparent;
>     int4        inhseqno;
>     anyarray    inhvalues;    /* values for partition */
> } FormData_pg_inherits;
>
>   
>> CREATE TABLE pg_partition (
>>     partrelid oid REFERENCES oid ON pg_class,    -- partitioned table oid
>>     partopr   oid REFERENCES oid ON pg_operator, -- operator to compare keys
>>     partkind  "char", -- kind of partition: 'R' (range) or 'L' (list)
>>     partkey   text,   -- expression tree of partition key
>>     PRIMARY KEY (partrelid)
>> ) WITHOUT OIDS;
>>     
>
> ------------------------------
>  Limitations and Restrictions
> ------------------------------
> * We can create a new partition as long as partitioning keys
>   are not conflicted with existing partitions. Especially,
>   we cannot add any partitions if we have overflow partitions
>   because a new partition always split the overflow partition.
>
> * We cannot reuse an existing check constraint as a partition
>   constraint. ALTER TABLE INHERIT AS PARTITION brings on
>   a table scan to add a new CHECK constraint.
>
> * No partition triggers nor planner and executor improvements.
>   It would come in the future development.
>
> Regards,
> ---
> ITAGAKI Takahiro
> NTT Open Source Software Center
>
>   
> ------------------------------------------------------------------------
>
>
>   


-- 
Emmanuel Cecchet
Aster Data
Web: http://www.asterdata.com

Re: Syntax for partitioning

From

Itagaki Takahiro

Date:

25 November 2009, 00:00:41

Emmanuel Cecchet <manu@asterdata.com> wrote:

> I think that other databases allows the 
> user to define a tablespace for each partition in the create table 
> statement.

WITH and TABLESPACE clause are supported for each partition.
=# CREATE TABLE parent (...) PARTITION BY (key)   (     PARTITION child_1 VALUES LESS THAN 10 WITH (...) TABLESPACE
tbs_1  );=# CREATE PARTITION child_2 ON parent   VALUES LESS THAN 20 WITH (...) TABLESPACE tbl_2;

> Are you also planning to provide partitioning extensions to 'create 
> table as'?

Ah, I forgot that. It would be possible to have the feature.
There are no syntax issues. But it would be done after we support
automatic INSERT routing. We can create the table will partitions,
but tuples are not divided into child partitions because we have
no insert-triggers at the time of CREATE TABLE AS.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Syntax for partitioning

From

Greg Smith

Date:

25 November 2009, 00:53:10

I just made a few updates to 
http://wiki.postgresql.org/wiki/Table_partitioning , merging in the 
stuff that had been on the ToDo page and expanding the links to 
discussion on this list a bit.  The number of submitted patches over the 
last couple of years that handle some subset of the desired feature set 
here is really remarkable when you see them all together.

-- 
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com

Re: Syntax for partitioning

From

Emmanuel Cecchet

Date:

25 November 2009, 01:37:36

Greg Smith wrote:
> I just made a few updates to 
> http://wiki.postgresql.org/wiki/Table_partitioning , merging in the 
> stuff that had been on the ToDo page and expanding the links to 
> discussion on this list a bit.  The number of submitted patches over 
> the last couple of years that handle some subset of the desired 
> feature set here is really remarkable when you see them all together.
>
Should we add the 'WITH (...) TABLESPACE tbs' options to the syntax 
since they are supported?
Do we support ALTER ... SET TABLESPACE?

Emmanuel

-- 
Emmanuel Cecchet
Aster Data
Web: http://www.asterdata.com

Re: Syntax for partitioning

From

Itagaki Takahiro

Date:

25 November 2009, 02:03:27

Emmanuel Cecchet <manu@asterdata.com> wrote:

> Should we add the 'WITH (...) TABLESPACE tbs' options to the syntax 
> since they are supported?

Added the description.

> Do we support ALTER ... SET TABLESPACE?

DROP/ALTER PARTITION are synonyms for DROP/ALTER TABLE.
SET TABLESPACE is also supported. Added the description.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Syntax for partitioning

From

Itagaki Takahiro

Date:

25 November 2009, 04:17:18

Here is an updated partitioning syntax patch.
It implements syntax described here:
    http://wiki.postgresql.org/wiki/Table_partitioning#Syntax

Changes:
  * New syntax: ALTER TABLE parent ATTACH/DETACH PARTITION.
  * Partition keys accepts an opclass name instead of an operator.
  * "lo <= key AND key < hi" is used in range check constraints instead of
    "key >= lo AND key < hi".

ToDo items:
  * pg_get_partitiondef(parentOid, in_alter_format) might be ugly.
    It was designed only for psql -d and pg_dump. It might be cleaner
    if we move SQL formatter from the core function to client tools.
      In psql: pg_get_partitiondef(oid, false)
        Partitions: RANGE (sales_date)
        (
            PARTITION sales_2006 VALUES LESS THAN '...',
            ...
            PARTITION sales_max VALUES LESS THAN MAXVALUE
        )
      In pg_dump: pg_get_partitiondef(oid, true)
        ALTER TABLE parent PARTITION BY RANGE (sales_date);
        ALTER TABLE parent ATTACH PARTITION sales_2006 VALUES LESS THAN '...';
        ...
        ALTER TABLE parent ATTACH PARTITION sales_max VALUES LESS THAN MAXVALUE;

  * The patch does not contain the following documentation,
    but I'll start writing them if the syntax is ok.
      - ddl-partitioning.sgml
      - alter-partition.sgml  (new)
      - create-partition.sgml (new)
      - drop-partition.sgml   (new)

Note:
  * In fact, malloc was not a bug because it was the code in pg_dump.

Comments welcome.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachment

partitioning_20091125.patch

Re: Syntax for partitioning

From

Itagaki Takahiro

Date:

03 December 2009, 22:53:42

Here is an update partitioning syntax patch.

A bug reported by Marko is fixed.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachment

partitioning_20091204.patch

Re: Syntax for partitioning

From

Simon Riggs

Date:

04 December 2009, 05:01:51

On Fri, 2009-12-04 at 11:54 +0900, Itagaki Takahiro wrote:
> Here is an update partitioning syntax patch.
> 
> A bug reported by Marko is fixed.

I will review and eventually commit this, if appropriate, though it is
3rd in my queue and will probably not be done for at least 2 weeks,
possibly 4 weeks.

Some brief review comments

* SQL:2008 contains PARTITION BY syntax, though in completely different
context. A possible alternative would be to use PARTITIONED BY. Please
justify either way. Possibly add short section to docs to explain this.

* There are relatively few comments in-line. Please can you provide a
README section for the code that explains how partitioning works? A
reviewer's guide would also be helpful to explain some of the design
decisions in particular places.

* All of the tests use about 4 partitions, which is the kind of thing
that makes me think the test coverage isn't wide enough. More tests
please. This would include operations on 0?, 1 and many partitions. We
also need more test failures, covering all the dumbass things people
will attempt. Also need partitioning by strange datatypes, arrays,
timestamps with timezones and stupidly long list values. Read Rob
Treat's humorous dissection of earlier partitioning features at PGcon to
see what needs to be covered.

* Docs. This is looking fairly solid, so please begin working on docs. I
won't hold you to this in next few weeks, but we know it needs doing.

* It is essential that we have large real-world(ish) performance test
results that proves this patch will work in the areas for which it is
intended. We need a test with 500 partitions, using at least 10MB
partitions to see if there are any scale-related issues. This test case
will help set targets for later developments because it will highlight
performance issues in planning, DDL and other areas. This doesn't have
to be by the patch author, but we need to see evidence that this patch
operates in its primary use case.

-- Simon Riggs           www.2ndQuadrant.com

Re: Syntax for partitioning

From

Greg Smith

Date:

06 December 2009, 21:31:12

Simon Riggs wrote:
> I will review and eventually commit this, if appropriate, though it is
> 3rd in my queue and will probably not be done for at least 2 weeks,
> possibly 4 weeks.
>   
I've marked Simon as the next reviewer and expected committer on this 
patch and have updated it to "Returned with Feedback".  That's not 
saying work is going to stop on it.  It just looks like that is going to 
extend beyond when we want this CommitFest to finish, and I want to pull 
it off the list of things I'm monitoring as part of that.  Everyone 
should keep hammering away at nailing this fundamental bit down, so that 
the rest of the partitioning patch ideas floating around finally have a 
firm place to start attaching to.

-- 
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com

Re: Syntax for partitioning

From

Itagaki Takahiro

Date:

06 December 2009, 21:56:19

Greg Smith <greg@2ndquadrant.com> wrote:

> I've marked Simon as the next reviewer and expected committer on this 
> patch and have updated it to "Returned with Feedback".

OK. I'll re-submit improved patches in the next commit fest.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Syntax for partitioning

From

Simon Riggs

Date:

14 December 2009, 20:59:01

On Fri, 2009-12-04 at 09:00 +0000, Simon Riggs wrote:
> On Fri, 2009-12-04 at 11:54 +0900, Itagaki Takahiro wrote:
> > Here is an update partitioning syntax patch.
> > 
> > A bug reported by Marko is fixed.
> 
> I will review and eventually commit this, if appropriate, though it is
> 3rd in my queue and will probably not be done for at least 2 weeks,
> possibly 4 weeks.

I'll have to go back on this unfortunately, sorry about that.

I have enough items emerging from HS to keep me busy much longer than I
thought. I'll run with VF if that's OK, since I have some other related
changes in that area and it makes sense to understand that code also, if
OK with you.

-- Simon Riggs           www.2ndQuadrant.com

New VACUUM FULL still needed?

From

Takahiro Itagaki

Date:

14 December 2009, 22:16:31

Simon Riggs <simon@2ndQuadrant.com> wrote:

> I have enough items emerging from HS to keep me busy much longer than I
> thought. I'll run with VF if that's OK, since I have some other related
> changes in that area and it makes sense to understand that code also, if
> OK with you.

Sure. Many users want to see HS.

BTW, New VACUUM FULL patch is waiting for being applied.   https://commitfest.postgresql.org/action/patch_view?id=202
But I heard HS is attempting to modify VFI in another way or remove it
completely. Do we still need the patch, or reject it and fix VFI in HS?

Regards,
---
Takahiro Itagaki
NTT Open Source Software Center

Re: New VACUUM FULL still needed?

From

Simon Riggs

Date:

14 December 2009, 22:26:54

On Tue, 2009-12-15 at 11:17 +0900, Takahiro Itagaki wrote:
> Simon Riggs <simon@2ndQuadrant.com> wrote:
> 
> > I have enough items emerging from HS to keep me busy much longer than I
> > thought. I'll run with VF if that's OK, since I have some other related
> > changes in that area and it makes sense to understand that code also, if
> > OK with you.
> 
> Sure. Many users want to see HS.
> 
> BTW, New VACUUM FULL patch is waiting for being applied.
>     https://commitfest.postgresql.org/action/patch_view?id=202
> But I heard HS is attempting to modify VFI in another way or remove it
> completely. Do we still need the patch, or reject it and fix VFI in HS?

Plan is to apply patch for new VF, then for me to write another patch to
allow new VF to work with system relations also. 

VACUUM FULL INPLACE would then be prohibited if recovery_connections =
on, which given that is the default will pretty much reduce VFI to not
working at all in 8.5. But it remains an option if problems occur.

My intention is to keep all of the code there for 8.5 and then begin
removing old VF code at beginning of 8.6dev. It's been there too long
and is in far too deep to rip it out quickly. There's no mileage in
spending time on removing a non-feature when there is feature work to be
done.

-- Simon Riggs           www.2ndQuadrant.com

Re: Syntax for partitioning

From

Jaime Casanova

Date:

14 December 2009, 22:31:37

On Mon, Dec 14, 2009 at 7:29 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Fri, 2009-12-04 at 09:00 +0000, Simon Riggs wrote:
>> On Fri, 2009-12-04 at 11:54 +0900, Itagaki Takahiro wrote:
>> > Here is an update partitioning syntax patch.
>> >
>> > A bug reported by Marko is fixed.
>>
>> I will review and eventually commit this, if appropriate, though it is
>> 3rd in my queue and will probably not be done for at least 2 weeks,
>> possibly 4 weeks.
>
> I'll have to go back on this unfortunately, sorry about that.
>

the next patch for this will arrive in the next commitfest so maybe
you have more time then

--
Atentamente,
Jaime Casanova
Soporte y capacitación de PostgreSQL
Asesoría y desarrollo de sistemas
Guayaquil - Ecuador
Cel. +59387171157

Re: New VACUUM FULL still needed?

From

Simon Riggs

Date:

29 December 2009, 06:42:21

On Tue, 2009-12-15 at 11:17 +0900, Takahiro Itagaki wrote:
> Simon Riggs <simon@2ndQuadrant.com> wrote:
> 
> > I have enough items emerging from HS to keep me busy much longer than I
> > thought. I'll run with VF if that's OK, since I have some other related
> > changes in that area and it makes sense to understand that code also, if
> > OK with you.
> 
> Sure. Many users want to see HS.
> 
> BTW, New VACUUM FULL patch is waiting for being applied.
>     https://commitfest.postgresql.org/action/patch_view?id=202
> But I heard HS is attempting to modify VFI in another way or remove it
> completely. Do we still need the patch, or reject it and fix VFI in HS?

HS has two options: (1) move forwards alongside this patch, or (2) make
a separate fix for VFI in HS.

We still want to apply this patch, but I feel it needs changes as
discussed downthread as part of my review. Will it be possible for you
to make those changes and resubmit for next commitfest, or earlier?

-- Simon Riggs           www.2ndQuadrant.com

Re: Syntax for partitioning

From

Thom Brown

Date:

09 November 2011, 09:27:12

On 15 December 2009 02:31, Jaime Casanova <jcasanov@systemguards.com.ec> wrote:
> On Mon, Dec 14, 2009 at 7:29 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On Fri, 2009-12-04 at 09:00 +0000, Simon Riggs wrote:
>>> On Fri, 2009-12-04 at 11:54 +0900, Itagaki Takahiro wrote:
>>> > Here is an update partitioning syntax patch.
>>> >
>>> > A bug reported by Marko is fixed.
>>>
>>> I will review and eventually commit this, if appropriate, though it is
>>> 3rd in my queue and will probably not be done for at least 2 weeks,
>>> possibly 4 weeks.
>>
>> I'll have to go back on this unfortunately, sorry about that.
>>
>
> the next patch for this will arrive in the next commitfest so maybe
> you have more time then

So will this be revived any time soon?  Were there any subsequent
proposals which were posted?

-- 
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Syntax for partitioning

From

Jaime Casanova

Date:

09 November 2011, 10:42:08

On Wed, Nov 9, 2011 at 8:26 AM, Thom Brown <thom@linux.com> wrote:
>
> So will this be revived any time soon?  Were there any subsequent
> proposals which were posted?
>

there was an updated patch, you can find in this thread:
http://archives.postgresql.org/message-id/20100114181323.9A33.52131E4D@oss.ntt.co.jp

not sure what happens after that.

--
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación

Re: Syntax for partitioning

From

Robert Haas

Date:

09 November 2011, 11:15:19

On Wed, Nov 9, 2011 at 9:35 AM, Jaime Casanova <jaime@2ndquadrant.com> wrote:
> On Wed, Nov 9, 2011 at 8:26 AM, Thom Brown <thom@linux.com> wrote:
>>
>> So will this be revived any time soon?  Were there any subsequent
>> proposals which were posted?
>>
>
> there was an updated patch, you can find in this thread:
> http://archives.postgresql.org/message-id/20100114181323.9A33.52131E4D@oss.ntt.co.jp
>
> not sure what happens after that.

I reviewed a later version here:

http://archives.postgresql.org/pgsql-hackers/2010-07/msg00183.php

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Syntax for partitioning

From

Thom Brown

Date:

09 November 2011, 11:31:25

On 9 November 2011 15:15, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Nov 9, 2011 at 9:35 AM, Jaime Casanova <jaime@2ndquadrant.com> wrote:
>> On Wed, Nov 9, 2011 at 8:26 AM, Thom Brown <thom@linux.com> wrote:
>>>
>>> So will this be revived any time soon?  Were there any subsequent
>>> proposals which were posted?
>>>
>>
>> there was an updated patch, you can find in this thread:
>> http://archives.postgresql.org/message-id/20100114181323.9A33.52131E4D@oss.ntt.co.jp
>>
>> not sure what happens after that.
>
> I reviewed a later version here:
>
> http://archives.postgresql.org/pgsql-hackers/2010-07/msg00183.php

Ah yes, I've located a reference to this on the wiki now.  No wiki
updates needed.

I guess it's a matter of whether Takahiro-san has the time and desire
to pick this up again any time soon.  Whenever I cross the topic of
partitioning in PostgreSQL, it's always a tad embarrassing to explain
that it's still hacky compared to other database systems (and this
came up again last week), so this is of particular interest to me.  At
the moment there's no testing required as it's returned with feedback,
but I'm very willing to help assist in testing it should this return
to the fore again.  The idea of getting both this and materialised
views in time for 9.3 is extremely appealing; a performance release
(9.2) followed by a usability release.

--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Syntax for partitioning

From

Dimitri Fontaine

Date:

09 November 2011, 17:06:12

Thom Brown <thom@linux.com> writes:
> Whenever I cross the topic of
> partitioning in PostgreSQL, it's always a tad embarrassing to explain
> that it's still hacky compared to other database systems (and this
> came up again last week), so this is of particular interest to me.  At

The more I think about this problem, the more I think that the reason
why we still don't have declarative partitioning is that it basically
sucks.  Other vendors offer it because they couldn't do better, but they
are just asking the DBA to implement a service the system should be able
to care for itself.

Who knows better than PostgreSQL which part of the data are the most
often queried and how to best lay them out on disk to ensure QoS?  If
you think that's the DBA, go ask Tom to implement query hints…

More seriously, partitioning in PostgreSQL could be mostly transparent
to the users and "just working": it's all about data storage locality
and we already have a sub-relation storage model. By using segment
exclusion and some level of automatic clustering (physical organization)
of data, we could have all the benefits of partitioning without the
hassle of maintaining yet another explicit level of data definition.
 In particular, what part of the declarative partitioning system is intended to take care about creating the right
partitionsbefore new data are sent to them? 

In a first implementation, we could decide to "partition" the data over
an index that the DBA has to pick, and then maintain a "segment index"
which is a map of the data distribution in the physical segments, for
the indexed columns. The granularity could be different and maybe
dynamic so that you start at a block level map index for very little
tables and grow up to a segment map index for huge tables that require
thousands of segments, 1GB each.

Then the system have to organize data modifications so that it optimizes
the ranges to be the smallest possible on each map entry. And the
executor then has to be able to consult that map at query time and
exclude whole segments of data (or blocks for little tables) when the
segment indexing is able to exclude data. With some tricks because we
realize that depending on the size of the portions you skip you might
not benefit from moving the head on the platter in another way than what
the ongoing seqscan does, but we already have GUCs about that.

We might also need some internal facilities to lock out per segment (or
rather "map entries") rather than per table so that we have something
like a WHERE clause support for TRUNCATE.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support

Re: Syntax for partitioning

From

Thom Brown

Date:

09 November 2011, 18:01:00

On 9 November 2011 21:05, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
> Thom Brown <thom@linux.com> writes:
>> Whenever I cross the topic of
>> partitioning in PostgreSQL, it's always a tad embarrassing to explain
>> that it's still hacky compared to other database systems (and this
>> came up again last week), so this is of particular interest to me.  At
>
> The more I think about this problem, the more I think that the reason
> why we still don't have declarative partitioning is that it basically
> sucks.  Other vendors offer it because they couldn't do better, but they
> are just asking the DBA to implement a service the system should be able
> to care for itself.
>
> Who knows better than PostgreSQL which part of the data are the most
> often queried and how to best lay them out on disk to ensure QoS?  If
> you think that's the DBA, go ask Tom to implement query hints…

That also sounds like an argument against indexes.  Since the system
knows which parts of data are most often queried, surely it should be
managing indexes, not the DBA?

I imagine structuring data after the fact would involve rewriting data
whereas planning for upcoming data is something DBAs are expected to
do anyway using constraints, triggers, indexes etc.

And as it stands, what the DBA has to do at the moment is to exploit
table inheritance, apply constraints to each of the child tables
(mainly for constraint exclusion), then create a trigger to support
various types of update/insert/delete.  So what we have now is very
un-user-friendly, and tends to surprise end users.  The flow of data
isn't part of the table definition, it's custom-programmed into an
event on the parent table.

And partitioning may not just be about performance, it can be about
organising data and making it more manageable.  Although I agree that
if it came to partitioning across nodes for parallel access,
PostgreSQL could be in a position to make decisions about how that is
distributed.

> More seriously, partitioning in PostgreSQL could be mostly transparent
> to the users and "just working": it's all about data storage locality
> and we already have a sub-relation storage model. By using segment
> exclusion and some level of automatic clustering (physical organization)
> of data, we could have all the benefits of partitioning without the
> hassle of maintaining yet another explicit level of data definition.

That could be unworkable in a high-load OLTP environment.

>  In particular, what part of the declarative partitioning system is
>  intended to take care about creating the right partitions before new
>  data are sent to them?
>
> In a first implementation, we could decide to "partition" the data over
> an index that the DBA has to pick, and then maintain a "segment index"
> which is a map of the data distribution in the physical segments, for
> the indexed columns. The granularity could be different and maybe
> dynamic so that you start at a block level map index for very little
> tables and grow up to a segment map index for huge tables that require
> thousands of segments, 1GB each.
>
> Then the system have to organize data modifications so that it optimizes
> the ranges to be the smallest possible on each map entry. And the
> executor then has to be able to consult that map at query time and
> exclude whole segments of data (or blocks for little tables) when the
> segment indexing is able to exclude data. With some tricks because we
> realize that depending on the size of the portions you skip you might
> not benefit from moving the head on the platter in another way than what
> the ongoing seqscan does, but we already have GUCs about that.
>
> We might also need some internal facilities to lock out per segment (or
> rather "map entries") rather than per table so that we have something
> like a WHERE clause support for TRUNCATE.

Would this solve the same set of problems that partitioning attempts
to address?  And what about the example case of quarterly data?  In
your proposed design could you drop an entire set of data without a
DELETE?

And maybe I'm not looking at it from the right angle. (probably)
Although I appreciate some thought is needed about how useful
partitioning implementations in other database systems really are.

And now to demonstrate some additional ignorance on my part... does
the standard cover this?

--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Syntax for partitioning

From

Greg Smith

Date:

09 November 2011, 18:36:43

On 11/09/2011 04:05 PM, Dimitri Fontaine wrote:
> The more I think about this problem, the more I think that the reason
> why we still don't have declarative partitioning is that it basically
> sucks.

I think that we don't have it because no one has ever dumped the much 
larger than might be expected amount of time into pulling all the pieces 
together and smoothing out the rough parts.  I don't think there's any 
design thinking leap needed over what's already been worked out.  Just a 
lot of work to get all the edge cases right on the simplest possible 
thing that is useful.

The path to reach something that could be considered for commit includes 
something like this set of things:

1) Add partitioning catalog support
2) Create new syntax for partitioning that writes to the catalog
3) Decide how to represent partition data in memory
4) Route new INSERTed data into the right place
5) Support moving UPDATEd data into a new partition
6) Handle COPY usefully

The last rev of this submitted was still working through (1) here, i.e. 
this review from Robert:  
http://archives.postgresql.org/message-id/AANLkTikP-1_8B04eyIK0sDf8uA5KMo64o8sorFBZE_CT@mail.gmail.com  
And there's a whole pile of issues I don't think have been fully 
explored about even the most basic case.  How to handle ALTER to these 
structures cleanly, locking, etc..  I don't think it's possible to 
design such that you skip a large portion of these details; someone 
needs to put some number of spend weeks+ getting them all right instead.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us

Re: Syntax for partitioning

From

Jeff Janes

Date:

09 November 2011, 20:07:50

On Wed, Nov 9, 2011 at 1:05 PM, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
> Thom Brown <thom@linux.com> writes:
>> Whenever I cross the topic of
>> partitioning in PostgreSQL, it's always a tad embarrassing to explain
>> that it's still hacky compared to other database systems (and this
>> came up again last week), so this is of particular interest to me.  At
>
> The more I think about this problem, the more I think that the reason
> why we still don't have declarative partitioning is that it basically
> sucks.  Other vendors offer it because they couldn't do better, but they
> are just asking the DBA to implement a service the system should be able
> to care for itself.

Your last sentence seems to be a better description of PostgreSQL's
current implementation of partitioning via inheritance and constraints
and triggers.  Partitioning is a service the system should be able to
care for itself, even if it does need the DBA to declare it.  And why
shouldn't it need a DBA to declare it?  How is the system supposed to
anticipate that at some point years in the future I will want to run
the command sequence "create foo_archive as select from foo where
year<2009; delete from foo where year<2009", or its partition-based
equivalent, and have it operate on several billion rows cleanly and
quickly?  I don't think we can expect the system to anticipate what it
has never before experienced.  This is the DBA's job.

> Who knows better than PostgreSQL which part of the data are the most
> often queried and how to best lay them out on disk to ensure QoS?  If
> you think that's the DBA, go ask Tom to implement query hints…
>
> More seriously, partitioning in PostgreSQL could be mostly transparent
> to the users and "just working": it's all about data storage locality
> and we already have a sub-relation storage model. By using segment
> exclusion and some level of automatic clustering (physical organization)
> of data, we could have all the benefits of partitioning without the
> hassle of maintaining yet another explicit level of data definition.

While automatic clustering would be nice, it isn't the same thing as
partitioning.

Cheers,

Jeff

Re: Syntax for partitioning

From

Dimitri Fontaine

Date:

10 November 2011, 17:19:44

Jeff Janes <jeff.janes@gmail.com> writes:
> shouldn't it need a DBA to declare it?  How is the system supposed to
> anticipate that at some point years in the future I will want to run
> the command sequence "create foo_archive as select from foo where
> year<2009; delete from foo where year<2009", or its partition-based
> equivalent, and have it operate on several billion rows cleanly and
> quickly?  I don't think we can expect the system to anticipate what it
> has never before experienced.  This is the DBA's job.

Well, the not-fully spelled out proposal would be to still work it out
from a list of columns picked by the DBA.  I though that an existing
index would be best, but maybe just columns would be good.

I guess it's already time to play loose and invent some SQL syntax to
make it easier talking about the same thing:
 ALTER TABLE foo SEGMENT ON (year, stamp);

Now the aim would be to be able to implement the operation you describe
by using the new segment map, which is an index pointing to sequential
ranges of on-disk blocks where the data is known to share a common key
range over the columns you're segmenting on.  I would imagine this SQL:
 TRUNCATE foo WHERE year < 2009;

As the on-disk location of the data that qualify this WHERE clause is
known, it could be possible to (predicate) lock it and bulk remove it,
unlinking whole segments (1GB) at a time when relevant.

> While automatic clustering would be nice, it isn't the same thing as
> partitioning.

That has been my initial reaction to that kind of ideas too.  After some
more time brewing the ideas, I'm not convinced that the use cases that
usually drives you to the latter can't be solved with the former.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support

Re: Syntax for partitioning

From

Daniel Farina

Date:

10 November 2011, 23:24:17

On Thu, Nov 10, 2011 at 1:19 PM, Dimitri Fontaine
<dimitri@2ndquadrant.fr> wrote:
> Now the aim would be to be able to implement the operation you describe
> by using the new segment map, which is an index pointing to sequential
> ranges of on-disk blocks where the data is known to share a common key
> range over the columns you're segmenting on.  I would imagine this SQL:
>
>  TRUNCATE foo WHERE year < 2009;
>
> As the on-disk location of the data that qualify this WHERE clause is
> known, it could be possible to (predicate) lock it and bulk remove it,
> unlinking whole segments (1GB) at a time when relevant.

I am basically in agreement with you.  After wanting better
partitioning (Oracle-style) in Postgres for some time just to be free
of the mechanically painful table-inheritance version, I have come
around to thinking it's basically a bad idea, but one that with a
little bit of finessing can be made a good idea.

The reason I have started to think this is because of an old feature
that works very well: CREATE INDEX.  In spite of what people might
think, I think it's pretty clear that CREATE INDEX is not DDL: it's
actually physical advice to the system.  I have seen the
fourth-generation-language promise delivered upon quite a few times in
production, now: we witness an access pattern that becomes
problematic, we run CREATE INDEX CONCURRENTLY, the problem is solved
without any change to the application, and the index definition is
backported to our application bootstrapping process.  It would be hard
for me to understate how valuable this has been to avoid both
premature optimization and excessive panic when dealing with change.

Similar to the overall project stance on query hints, I don't think
Postgres should retreat on its ground from being a 4GL system.  I
think both indexes and a hypothetical partitioning feature should be
clearly isolated as directives to the system about how to physically
organize and access data, and any partitioning feature that creates
new relation namespace entries and expects you to manipulate them to
gain the benefits seems like extra, non-desirable surface area to me.

I think this becomes especially apparent once one considers on-line
repartitioning (I am exposing a bias here, but any feature in Postgres
that cannot be done concurrently -- like VACUUM FULL -- is very
dangerous to both me and my customers, whereas it may not be useless
or dangerous to a build-your-own data warehouse).  It feels like it
would be desirable to have the physical partitions exist in an
inconsistent-state whereby they are being brought into alignment with
the newly desired physical description.

Finally, I think a legitimate objection to this inclination is that it
can be really easy to issue a DELETE that is usually fast, but when
any mistake or change creeps in becomes very slow: I have heard from
some friends making heavy use of table partitioning via inheritance
that one of the problems is not quite exactly matching the table
constraint, and then hosing their hardware. As a result, they mangle
partitions explicitly in the application to prevent foot-gunning.
That's clearly lame (and they know it), but I think may indicate a
need to instead allow for some kind of physical-access-method
assertion checking quite apart from the logical content of the query
that can deliver a clear, crisp error to application developers if a
preferred access pattern is not usable.  My experience suggests that
while solving problems is good, turning problems into flat-out errors
is *nearly* as good, and worth some more investigation.

--
fdr

Re: Syntax for partitioning

From

Martijn van Oosterhout

Date:

12 November 2011, 09:23:22

On Thu, Nov 10, 2011 at 10:19:02PM +0100, Dimitri Fontaine wrote:
> Now the aim would be to be able to implement the operation you describe
> by using the new segment map, which is an index pointing to sequential
> ranges of on-disk blocks where the data is known to share a common key
> range over the columns you're segmenting on.  I would imagine this SQL:
>
>   TRUNCATE foo WHERE year < 2009;
>
> As the on-disk location of the data that qualify this WHERE clause is
> known, it could be possible to (predicate) lock it and bulk remove it,
> unlinking whole segments (1GB) at a time when relevant.

While I agree that explicit partitioning is somewhat of a hack, it's a
really useful hack.  But for me the most important use of partitioning
is "dropping a billion rows efficiently and getting the disk space
back".  And the biggest problem is always that dropping blocks of a
table requires fixing all the indexes.

For fixing the index of the partition key it's a simpler problem, you
could probably prune the btree relatively efficiently.  But for all
other indexes there's no better solution than walk the entire index.

However, in the very special case where the drop boundaries explicitly
match the dataset, you can simply drop all the indexes.

Now, if someone cames up with an efficient way to drop a huge number of
rows quickly, then I admit one of the major issues is fixed.  But
recovering the disk space is much harder.  Yes, recent versions of
Linux come with ways to punch holes in existing files, but that doesn't
make it quick or efficient.

> > While automatic clustering would be nice, it isn't the same thing as
> > partitioning.
>
> That has been my initial reaction to that kind of ideas too.  After some
> more time brewing the ideas, I'm not convinced that the use cases that
> usually drives you to the latter can't be solved with the former.

I hope so, but I'm not sure I'd like partitioning support to wait on
someone hitting on the right idea.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.  -- Arthur Schopenhauer

Re: Syntax for partitioning

From

Tom Lane

Date:

12 November 2011, 12:30:19

Martijn van Oosterhout <kleptog@svana.org> writes:
> While I agree that explicit partitioning is somewhat of a hack, it's a
> really useful hack.  But for me the most important use of partitioning
> is "dropping a billion rows efficiently and getting the disk space
> back".

Right.  The only way to make that speedy is for the partition boundaries
to match the desired granularity of data removal.  I don't really see
any way that the database can be expected to know what that is, unless
it's told in advance.  So AFAICS you really have to have a declarative
way of telling it how to do the partitioning --- it's not going to be
able to infer that automatically.
        regards, tom lane

Re: Syntax for partitioning

From

Dimitri Fontaine

Date:

14 November 2011, 05:29:25

Tom Lane <tgl@sss.pgh.pa.us> writes:
> to match the desired granularity of data removal.  I don't really see
> any way that the database can be expected to know what that is, unless
> it's told in advance.  So AFAICS you really have to have a declarative
> way of telling it how to do the partitioning --- it's not going to be
> able to infer that automatically.

Yes, I'm taking that back. Declarative is not the same thing as explicit
partitioning though, that "index" like physical map is declarative too,
e.g.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support

Re: Syntax for partitioning

From

Dimitri Fontaine

Date:

14 November 2011, 05:33:46

Martijn van Oosterhout <kleptog@svana.org> writes:
> While I agree that explicit partitioning is somewhat of a hack, it's a
> really useful hack.  But for me the most important use of partitioning
> is "dropping a billion rows efficiently and getting the disk space
> back".  And the biggest problem is always that dropping blocks of a
> table requires fixing all the indexes.

The problem with partitions that are in fact table is that the index are
separated and you can't enforce unique globally in the partition set.

Even with that physical map idea (segment based partitioning, but
allowing a finer control than segments), you could still maintain any
number of partial indexes, but still use a single primary key e.g.

> However, in the very special case where the drop boundaries explicitly
> match the dataset, you can simply drop all the indexes.

That's the idea with partial indexes too, right?

> Now, if someone cames up with an efficient way to drop a huge number of
> rows quickly, then I admit one of the major issues is fixed.  But
> recovering the disk space is much harder.  Yes, recent versions of
> Linux come with ways to punch holes in existing files, but that doesn't
> make it quick or efficient.

If you happen to drop a part of the data that fits in one or more
segments (and with a decent fillfactor you need less than 1GB to get
there), then you can unlink() whole files at a time.  That would be the
goal here.

> I hope so, but I'm not sure I'd like partitioning support to wait on
> someone hitting on the right idea.

I would think that's exactly what's been happening to us for several
years already.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support

Re: Syntax for partitioning

From

Martijn van Oosterhout

Date:

14 November 2011, 14:31:41

On Mon, Nov 14, 2011 at 10:33:36AM +0100, Dimitri Fontaine wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
> > While I agree that explicit partitioning is somewhat of a hack, it's a
> > really useful hack.  But for me the most important use of partitioning
> > is "dropping a billion rows efficiently and getting the disk space
> > back".  And the biggest problem is always that dropping blocks of a
> > table requires fixing all the indexes.
>
> The problem with partitions that are in fact table is that the index are
> separated and you can't enforce unique globally in the partition set.
>
> Even with that physical map idea (segment based partitioning, but
> allowing a finer control than segments), you could still maintain any
> number of partial indexes, but still use a single primary key e.g.

Ah, well, if you can come up with a way to get the advantages of
partition while still being able to enforce primary keys over
partitions, that would be A Really Cool Idea.

That said, I still don't see how you can enforce a unique index over
multiple segments over something other than the partition key while
still allowing quick dropping of segments.  If you can fix that you can
make it work for the current inheritence-style partitioning.

> If you happen to drop a part of the data that fits in one or more
> segments (and with a decent fillfactor you need less than 1GB to get
> there), then you can unlink() whole files at a time.  That would be the
> goal here.

I feel uncomfortable with the "happen to". You can add the magic too,
but for scripting purposes I'd feel better if it could be done via DDL
also. That way typos don't end up being 5 day queries all of a sudden.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.  -- Arthur Schopenhauer

Re: Syntax for partitioning

From

Dimitri Fontaine

Date:

16 November 2011, 06:00:51

Martijn van Oosterhout <kleptog@svana.org> writes:
> That said, I still don't see how you can enforce a unique index over
> multiple segments over something other than the partition key while
> still allowing quick dropping of segments.  If you can fix that you can
> make it work for the current inheritence-style partitioning.

Well the Primary Key and the Physical Map Index do not need to be on the
same set of columns.

>> If you happen to drop a part of the data that fits in one or more
>> segments (and with a decent fillfactor you need less than 1GB to get
>> there), then you can unlink() whole files at a time.  That would be the
>> goal here.
>
> I feel uncomfortable with the "happen to". You can add the magic too,
> but for scripting purposes I'd feel better if it could be done via DDL
> also. That way typos don't end up being 5 day queries all of a sudden.

If the data fills less than a segment then you can't unlink() the file,
you have to mark the tuples / pages as free space.  If you have a
partial index matching the whole portion of data you're removing, you
can still drop it before hand — or maybe the system can be instructed to
do so?

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support