Thread: Syntax for partitioning
I'd like to improve partitioning feature in 8.5. Kedar-san's previous work is wonderful, but I cannot see any updated patch. http://archives.postgresql.org/message-id/bd8134a40906080702s96c90a9q3bbb581b9bd0d5d7@mail.gmail.com So, I'll take over the work if there are no ones to do it. I'm thinking to add syntax support first. Table partitioning was proposed many times, but it is still not applied into core. The reason is it is too difficult to make perfect partitioning feature at once. I think syntax support is a good start. First, I will add syntax for CREATE TABLE, ALTER TABLE ADD/DROP PARTITION. The syntax is borrowed from from Oracle and MySQL. Their characteristics are using "LESS THAN" in range partitioning. The keyword "PARTITION" is added to the full-reserved keyword list to support ADD/DROP PARTITION. Those syntax is merely a syntax sugar for INHERITS with CHECK. Declarations are translated into CHECK constraints. I have a plan to adjust pg_dump to dump definitions of partitioning in the correct format, but the actual implementation will be still based on constraint exclusion. In addition, hash partitioning is not implemented; syntax is parsed but "not implemented" error are raised for now. Here is syntax I propose: ---- ALTER TABLE table_name ADD PARTITION name ...; ALTER TABLE table_name DROP PARTITION [IF EXISTS] name [CASCADE | RESTRICT]; Range partitioning: CREATE TABLE table_name ( columns ) PARTITION BY RANGE ( a_expr ) ( PARTITION name VALUES LESSTHAN [(] const [)], PARTITION name VALUES LESS THAN [(] MAXVALUE [)] -- overflow partition ); List partitioning: CREATE TABLE table_name ( columns ) PARTITION BY LIST ( a_expr ) ( PARTITION name VALUES [IN]( const [, ...] ), PARTITION name VALUES [IN] [(] DEFAULT [)] -- overflow partition ); Hash partitioning: CREATE TABLE table_name ( columns ) PARTITION BY HASH ( a_expr ) PARTITIONS num_partitions; CREATE TABLE table_name ( columns ) PARTITION BY HASH ( a_expr ) ( PARTITION name, ... ); Note: * Each partition can have optional WITH (...) and TABLESPACE clauses. * '(' and ')' are optional to support both Oracleand MySQL syntax. ---- Comments welcome. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
2009/10/29 Itagaki Takahiro <itagaki.takahiro@oss.ntt.co.jp>: > I'd like to improve partitioning feature in 8.5. > Kedar-san's previous work is wonderful, but I cannot see any updated patch. > http://archives.postgresql.org/message-id/bd8134a40906080702s96c90a9q3bbb581b9bd0d5d7@mail.gmail.com > > So, I'll take over the work if there are no ones to do it. > I'm thinking to add syntax support first. Table partitioning was > proposed many times, but it is still not applied into core. > The reason is it is too difficult to make perfect partitioning > feature at once. I think syntax support is a good start. > > First, I will add syntax for CREATE TABLE, ALTER TABLE ADD/DROP PARTITION. > The syntax is borrowed from from Oracle and MySQL. Their characteristics > are using "LESS THAN" in range partitioning. The keyword "PARTITION" is > added to the full-reserved keyword list to support ADD/DROP PARTITION. > > Those syntax is merely a syntax sugar for INHERITS with CHECK. Declarations > are translated into CHECK constraints. I have a plan to adjust pg_dump to > dump definitions of partitioning in the correct format, but the actual > implementation will be still based on constraint exclusion. In addition, > hash partitioning is not implemented; syntax is parsed but "not implemented" > error are raised for now. > > Here is syntax I propose: > ---- > ALTER TABLE table_name ADD PARTITION name ...; > ALTER TABLE table_name DROP PARTITION [IF EXISTS] name [CASCADE | RESTRICT]; > > Range partitioning: > CREATE TABLE table_name ( columns ) > PARTITION BY RANGE ( a_expr ) > ( > PARTITION name VALUES LESS THAN [(] const [)], > PARTITION name VALUES LESS THAN [(] MAXVALUE [)] -- overflow partition > ); > > List partitioning: > CREATE TABLE table_name ( columns ) > PARTITION BY LIST ( a_expr ) > ( > PARTITION name VALUES [IN] ( const [, ...] ), > PARTITION name VALUES [IN] [(] DEFAULT [)] -- overflow partition > ); > > Hash partitioning: > CREATE TABLE table_name ( columns ) > PARTITION BY HASH ( a_expr ) > PARTITIONS num_partitions; > > CREATE TABLE table_name ( columns ) > PARTITION BY HASH ( a_expr ) > ( > PARTITION name, > ... > ); > > Note: > * Each partition can have optional WITH (...) and TABLESPACE clauses. > * '(' and ')' are optional to support both Oracle and MySQL syntax. > ---- > > Comments welcome. +1 Pavel > > Regards, > --- > ITAGAKI Takahiro > NTT Open Source Software Center > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
Hi, > So, I'll take over the work if there are no ones to do it. > I'm thinking to add syntax support first. Table partitioning was > proposed many times, but it is still not applied into core. > The reason is it is too difficult to make perfect partitioning > feature at once. I think syntax support is a good start. Guess we are back to square one again on Partitioning :), but as long as someone is willing to walk the whole nine yards with it, that would be just great! I had proposed Oracle style syntax a while back and had also submitted a WIP patch then. Again then my motive was to move forward in a piece-meal fashion on this feature. First solidify the syntax, keep using the existing inheritance mechanism and go one step at a time. I think a feature like Partitioning needs this kind of an approach, because it might turn out to be a lot of work with a lot of very many sub items. So +1 on solidifying the syntax first and then sorting out the other minute, intricate details later.. Regards, Nikhils > > First, I will add syntax for CREATE TABLE, ALTER TABLE ADD/DROP PARTITION. > The syntax is borrowed from from Oracle and MySQL. Their characteristics > are using "LESS THAN" in range partitioning. The keyword "PARTITION" is > added to the full-reserved keyword list to support ADD/DROP PARTITION. > > Those syntax is merely a syntax sugar for INHERITS with CHECK. Declarations > are translated into CHECK constraints. I have a plan to adjust pg_dump to > dump definitions of partitioning in the correct format, but the actual > implementation will be still based on constraint exclusion. In addition, > hash partitioning is not implemented; syntax is parsed but "not implemented" > error are raised for now. > > Here is syntax I propose: > ---- > ALTER TABLE table_name ADD PARTITION name ...; > ALTER TABLE table_name DROP PARTITION [IF EXISTS] name [CASCADE | RESTRICT]; > > Range partitioning: > CREATE TABLE table_name ( columns ) > PARTITION BY RANGE ( a_expr ) > ( > PARTITION name VALUES LESS THAN [(] const [)], > PARTITION name VALUES LESS THAN [(] MAXVALUE [)] -- overflow partition > ); > > List partitioning: > CREATE TABLE table_name ( columns ) > PARTITION BY LIST ( a_expr ) > ( > PARTITION name VALUES [IN] ( const [, ...] ), > PARTITION name VALUES [IN] [(] DEFAULT [)] -- overflow partition > ); > > Hash partitioning: > CREATE TABLE table_name ( columns ) > PARTITION BY HASH ( a_expr ) > PARTITIONS num_partitions; > > CREATE TABLE table_name ( columns ) > PARTITION BY HASH ( a_expr ) > ( > PARTITION name, > ... > ); > > Note: > * Each partition can have optional WITH (...) and TABLESPACE clauses. > * '(' and ')' are optional to support both Oracle and MySQL syntax. > ---- > > Comments welcome. > > Regards, > --- > ITAGAKI Takahiro > NTT Open Source Software Center > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- http://www.enterprisedb.com
On 29 Oct 2009, at 02:15, Itagaki Takahiro wrote: > I'd like to improve partitioning feature in 8.5. > Kedar-san's previous work is wonderful, but I cannot see any updated > patch. > http://archives.postgresql.org/message-id/bd8134a40906080702s96c90a9q3bbb581b9bd0d5d7@mail.gmail.com > > So, I'll take over the work if there are no ones to do it. > I'm thinking to add syntax support first. Table partitioning was > proposed many times, but it is still not applied into core. > The reason is it is too difficult to make perfect partitioning > feature at once. I think syntax support is a good start. > > First, I will add syntax for CREATE TABLE, ALTER TABLE ADD/DROP > PARTITION. > The syntax is borrowed from from Oracle and MySQL. Their > characteristics > are using "LESS THAN" in range partitioning. The keyword "PARTITION" > is > added to the full-reserved keyword list to support ADD/DROP PARTITION. > > Those syntax is merely a syntax sugar for INHERITS with CHECK. > Declarations > are translated into CHECK constraints. I have a plan to adjust > pg_dump to > dump definitions of partitioning in the correct format, but the actual > implementation will be still based on constraint exclusion. In > addition, > hash partitioning is not implemented; syntax is parsed but "not > implemented" > error are raised for now. > > Here is syntax I propose: > ---- > ALTER TABLE table_name ADD PARTITION name ...; > ALTER TABLE table_name DROP PARTITION [IF EXISTS] name [CASCADE | > RESTRICT]; > > Range partitioning: > CREATE TABLE table_name ( columns ) > PARTITION BY RANGE ( a_expr ) > ( > PARTITION name VALUES LESS THAN [(] const [)], > PARTITION name VALUES LESS THAN [(] MAXVALUE [)] -- overflow > partition > ); > > List partitioning: > CREATE TABLE table_name ( columns ) > PARTITION BY LIST ( a_expr ) > ( > PARTITION name VALUES [IN] ( const [, ...] ), > PARTITION name VALUES [IN] [(] DEFAULT [)] -- overflow > partition > ); > > Hash partitioning: > CREATE TABLE table_name ( columns ) > PARTITION BY HASH ( a_expr ) > PARTITIONS num_partitions; > > CREATE TABLE table_name ( columns ) > PARTITION BY HASH ( a_expr ) > ( > PARTITION name, > ... > ); > > Note: > * Each partition can have optional WITH (...) and TABLESPACE clauses. > * '(' and ')' are optional to support both Oracle and MySQL syntax. > ---- > > Comments welcome. +1000 Thanks ! (most anticipated feature for 8.5, here, next to replication [well, I am interested in multi master, but that's not going to happen :P ] )
Itagaki Takahiro wrote: > The keyword "PARTITION" is > added to the full-reserved keyword list to support ADD/DROP PARTITION. Any chance to avoid that? PARTITION seems like something people might well use as a column or variable name. OTOH, it is reserved in SQL2008 and SQL2003. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, Oct 29, 2009 at 3:35 AM, Nikhil Sontakke <nikhil.sontakke@enterprisedb.com> wrote: > So +1 on solidifying the syntax first and then sorting out the other > minute, intricate details later.. I like that idea as well but I have a concern. What will we do with pg_dump. If the PARTITION commands are just syntactic sugar for creating constraints and inherited tables then pg_dump will have to generate the more generic commands for those objects. When we eventually have real partitioning then restoring such a dump will not create real partitions, just inherited tables. Perhaps we need some kind of option to reverse-engineer partitioning commands from the inheritance structure, but I fear having pg_dump reverse engineer inherited tables to produce partitioning commands will be too hard and error-prone. Hopefully that's too pessimistic though, if they were produced by PARTITION commands they should be pretty regular. -- greg
On Thursday 29 October 2009 18:33:22 Greg Stark wrote: > On Thu, Oct 29, 2009 at 3:35 AM, Nikhil Sontakke > > <nikhil.sontakke@enterprisedb.com> wrote: > > So +1 on solidifying the syntax first and then sorting out the other > > minute, intricate details later.. > > I like that idea as well but I have a concern. What will we do with > pg_dump. If the PARTITION commands are just syntactic sugar for > creating constraints and inherited tables then pg_dump will have to > generate the more generic commands for those objects. When we > eventually have real partitioning then restoring such a dump will not > create real partitions, just inherited tables. Perhaps we need some > kind of option to reverse-engineer partitioning commands from the > inheritance structure, but I fear having pg_dump reverse engineer > inherited tables to produce partitioning commands will be too hard and > error-prone. Hopefully that's too pessimistic though, if they were > produced by PARTITION commands they should be pretty regular. One could have a system catalog containing the partitioning information and generate the constraints et al. from that and mark them in pg_depend... Andres
On tor, 2009-10-29 at 11:15 +0900, Itagaki Takahiro wrote: > Range partitioning: > CREATE TABLE table_name ( columns ) > PARTITION BY RANGE ( a_expr ) > ( > PARTITION name VALUES LESS THAN [(] const [)], > PARTITION name VALUES LESS THAN [(] MAXVALUE [)] -- overflow partition > ); Maybe this needs to mention the actual operator name instead of LESS THAN, in case the operator is not named < or the user wants to use a different one. > Hash partitioning: > CREATE TABLE table_name ( columns ) > PARTITION BY HASH ( a_expr ) > PARTITIONS num_partitions; > > CREATE TABLE table_name ( columns ) > PARTITION BY HASH ( a_expr ) > ( > PARTITION name, > ... > ); Unless someone comes up with a maintenance plan for stable hash functions, we should probably not dare look into this yet.
On Fri, 2009-10-30 at 00:10 +0200, Peter Eisentraut wrote: > On tor, 2009-10-29 at 11:15 +0900, Itagaki Takahiro wrote: > > Range partitioning: > > CREATE TABLE table_name ( columns ) > > PARTITION BY RANGE ( a_expr ) > > ( > > PARTITION name VALUES LESS THAN [(] const [)], > > PARTITION name VALUES LESS THAN [(] MAXVALUE [)] -- overflow partition > > ); > > Maybe this needs to mention the actual operator name instead of LESS > THAN, in case the operator is not named < or the user wants to use a > different one. I can't help but wonder if the PERIOD type might be better for representing a partition range. It would make it easier to express and enforce the constraint that no two partition ranges overlap ;) Regards,Jeff Davis
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > > The keyword "PARTITION" is > > added to the full-reserved keyword list to support ADD/DROP PARTITION. > > Any chance to avoid that? PARTITION seems like something people might > well use as a column or variable name. OTOH, it is reserved in SQL2008 > and SQL2003. CREATE TABLE does not require PARTITION to be a reserved keyword, but there are conflicts in ALTER TABLE ADD/DROP PARTITION: * ALTER TABLE ... DROP [COLUMN] name [CASCADE | RESTRICT] * ALTER TABLE ... DROP PARTITION name [CASCADE | RESTRICT] There are some solutions: 1. Change COLUMN not to an optional word (unlikely)2. Change syntax of DROP PARTITION to DROP TABLE PARITION or so3. ChangeALTER TABLE ADD/DROP PARTITION to top level => CREATE/DROP PARTITION name ON table_name Any better ideas? Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
On Thu, Oct 29, 2009 at 9:51 PM, Itagaki Takahiro <itagaki.takahiro@oss.ntt.co.jp> wrote: > > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > >> > The keyword "PARTITION" is >> > added to the full-reserved keyword list to support ADD/DROP PARTITION. >> >> Any chance to avoid that? PARTITION seems like something people might >> well use as a column or variable name. OTOH, it is reserved in SQL2008 >> and SQL2003. > > CREATE TABLE does not require PARTITION to be a reserved keyword, > but there are conflicts in ALTER TABLE ADD/DROP PARTITION: > > * ALTER TABLE ... DROP [COLUMN] name [CASCADE | RESTRICT] > * ALTER TABLE ... DROP PARTITION name [CASCADE | RESTRICT] > > There are some solutions: > > 1. Change COLUMN not to an optional word (unlikely) > 2. Change syntax of DROP PARTITION to DROP TABLE PARITION or so > 3. Change ALTER TABLE ADD/DROP PARTITION to top level > => CREATE/DROP PARTITION name ON table_name > > Any better ideas? I'm not sure if this is better, but what about: CREATE PARTITION name ON TABLE name DROP PARTITION name Since partitions will live in pg_class and are in some sense "top level" objects, it seems like it would make sense to use a syntax that is similar to the one we use for indices... we can't say "DROP COLUMN name", because the table must be specified. But a partition name must be unambiguous, so making the user write it out explicitly doesn't seem friendly. ...Robert
Peter Eisentraut <peter_e@gmx.net> wrote: > On tor, 2009-10-29 at 11:15 +0900, Itagaki Takahiro wrote: > > Range partitioning: > > CREATE TABLE table_name ( columns ) > > PARTITION BY RANGE ( a_expr ) > > ( > > PARTITION name VALUES LESS THAN [(] const [)], > > PARTITION name VALUES LESS THAN [(] MAXVALUE [)] -- overflow partition > > ); > > Maybe this needs to mention the actual operator name instead of LESS > THAN, in case the operator is not named < or the user wants to use a > different one. How about to use "sortby" or "index_elem" here? PARTITION BY RANGE '(' sortby-or-index_elem ')' '(' RangePartitions ')' sortby: a_expr USING qual_all_Op opt_nulls_order | a_expr opt_asc_desc opt_nulls_order index_elem: ColId opt_class opt_asc_desc opt_nulls_order | func_expr opt_class opt_asc_desc opt_nulls_order | '(' a_expr ')' opt_class opt_asc_desc opt_nulls_order We should allow only btree operator class here because we need to extract GREATER-THAN-OR-EQUAL operator from LESS THAN. In addition, we will be able to optimize parition search in the future if we restrict a range partition key should be comparable scalar value. Multidimensional partitioning will be implemented with another approach, something like "PARTITION BY GIST", because it would require different oprimization from range partitioning. BTW, "PARTITION BY <pg_am.amname>" crossed my mind here, but it is not well-investigated yet. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
On Thu, Oct 29, 2009 at 7:14 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> CREATE TABLE does not require PARTITION to be a reserved keyword, >> but there are conflicts in ALTER TABLE ADD/DROP PARTITION: >> >> * ALTER TABLE ... DROP [COLUMN] name [CASCADE | RESTRICT] >> * ALTER TABLE ... DROP PARTITION name [CASCADE | RESTRICT] >> >> There are some solutions: Do we need a DROP PARTITION command at all? What would it even do? Drop the partition from the parent table and throw it away in one step? I think in actual practice people usually remove the partition from the parent table first, then do things like archive it before actually throwing it away. -- greg
Greg Stark <gsstark@mit.edu> wrote: > >> * ALTER TABLE ... DROP [COLUMN] name [CASCADE | RESTRICT] > >> * ALTER TABLE ... DROP PARTITION name [CASCADE | RESTRICT] > > Do we need a DROP PARTITION command at all? What would it even do? Currently no. So, it would be good to treat PARTITION as just a synonym of TABLE. Not only DROP PARTITION but also ALTER PARTITION will work. > CREATE PARTITION name ON table_name > DROP PARTITION name ALTER PARTITION name ... We might need to specify partition keys with another syntax. ALTER TABLE will have only one new command "PARTITION BY". and we reuse TABLE command for PARTITION in other operations. ALTER TABLE table_name PARTITION BY RANGE (expr) (...) Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
On Thu, 2009-10-29 at 15:19 -0700, Jeff Davis wrote: > I can't help but wonder if the PERIOD type might be better for > representing a partition range. It would make it easier to express and > enforce the constraint that no two partition ranges overlap ;) I can't help but wonder if the period type might better be a generic container for pairs of scalar, totally-ordered types.
On Thu, 2009-10-29 at 11:15 +0900, Itagaki Takahiro wrote: > I'd like to improve partitioning feature in 8.5. Nice. > Here is syntax I propose: <snip> Is this the same as / similar to Oracle's syntax? IIRC Nikhil's patch was Oracle's syntax, and I prefer having that one instead of inventing our own wheel. Regards, -- Devrim GÜNDÜZ, RHCE Command Prompt - http://www.CommandPrompt.com devrim~gunduz.org, devrim~PostgreSQL.org, devrim.gunduz~linux.org.tr http://www.gunduz.org Twitter: http://twitter.com/devrimgunduz
On Fri, 2009-10-30 at 10:03 +0200, Peter Eisentraut wrote: > I can't help but wonder if the period type might better be a generic > container for pairs of scalar, totally-ordered types. That would be ideal. However, it doesn't really look like our type system was built to handle that kind of thing. We could use typmod, I suppose, but even that won't hold a full Oid. Any ideas/suggestions? Regards,Jeff Davis
>> PARTITION BY RANGE ( a_expr )
>> ...
>> PARTITION BY HASH ( a_expr )
>> PARTITIONS num_partitions;
> Unless someone comes up with a maintenance plan for stable hash functions, we should probably not dare look into this yet.
What would cover the common use case of per-day quals and drops over an extended history period, say six or nine months? You don't get quite the same locality of reference, generally, with an unpartitioned table, due to slop in the arrival of rows. Ideally, you don't want to depend on an administrator, or even an administrative script, to continually intervene in the structure of a table, as would be the case with partitioning by range, and you don't want to coalesce multiple dates, as an arbitrary hash might do. What the administrator would want would be to decide what rows were too old to keep, then process (e.g. archive, summarize, filter) and delete them.
Suppose that the number of partitions were taken as a hint rather than as a naming modulus, and that any quasi-hash function had to be specified explicitly (although storage assignment could be based on a hash of the quasi-hash output). If a_expr were allowed to include a to-date conversion of a timestamp, day-by-day partitioning would fall out naturally. If, in addition, single-parameter (?) functions were characterized as range-preserving and order-preserving, plan generation could be improved for time ranges on quasi-hash-partitioned tables, without a formal indexing requirement.
There are cases where additional partition dimensions would be useful, for eventual parallelized operation on large databases, and randomizing quasi-hash functions would help. IMHO stability is not needed, except to the extent that hash functions have properties that lend themselves to plan generation and/or table maintenance.
It is not clear to me what purpose there would be in dropping a partition. This would be tantamount to deleting all of the rows in a partition, if it were analogous to dropping a table, and would require some sort of compensatory aggregation of existing partitions (in effect, a second partitioning dimension), if it were merely structural.
Perhaps I'm missing something here.
David Hudson
>> ...
>> PARTITION BY HASH ( a_expr )
>> PARTITIONS num_partitions;
> Unless someone comes up with a maintenance plan for stable hash functions, we should probably not dare look into this yet.
What would cover the common use case of per-day quals and drops over an extended history period, say six or nine months? You don't get quite the same locality of reference, generally, with an unpartitioned table, due to slop in the arrival of rows. Ideally, you don't want to depend on an administrator, or even an administrative script, to continually intervene in the structure of a table, as would be the case with partitioning by range, and you don't want to coalesce multiple dates, as an arbitrary hash might do. What the administrator would want would be to decide what rows were too old to keep, then process (e.g. archive, summarize, filter) and delete them.
Suppose that the number of partitions were taken as a hint rather than as a naming modulus, and that any quasi-hash function had to be specified explicitly (although storage assignment could be based on a hash of the quasi-hash output). If a_expr were allowed to include a to-date conversion of a timestamp, day-by-day partitioning would fall out naturally. If, in addition, single-parameter (?) functions were characterized as range-preserving and order-preserving, plan generation could be improved for time ranges on quasi-hash-partitioned tables, without a formal indexing requirement.
There are cases where additional partition dimensions would be useful, for eventual parallelized operation on large databases, and randomizing quasi-hash functions would help. IMHO stability is not needed, except to the extent that hash functions have properties that lend themselves to plan generation and/or table maintenance.
It is not clear to me what purpose there would be in dropping a partition. This would be tantamount to deleting all of the rows in a partition, if it were analogous to dropping a table, and would require some sort of compensatory aggregation of existing partitions (in effect, a second partitioning dimension), if it were merely structural.
Perhaps I'm missing something here.
David Hudson
Jeff Davis wrote: > On Fri, 2009-10-30 at 10:03 +0200, Peter Eisentraut wrote: >> I can't help but wonder if the period type might better be a generic >> container for pairs of scalar, totally-ordered types. > > That would be ideal. However, it doesn't really look like our type > system was built to handle that kind of thing. > > We could use typmod, I suppose, but even that won't hold a full Oid. Any > ideas/suggestions? Wait, it doesn't? A typmod is a 32-bit integer, like Oids. Am I missing something? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2009-10-30 at 19:12 +0200, Heikki Linnakangas wrote: > Wait, it doesn't? A typmod is a 32-bit integer, like Oids. Am I missing > something? Oid is unsigned, typmod is signed. We might be able to get away with it, but -1 is treated specially in some places outside of the type-specific functions, e.g. exprTypmod(). I haven't looked at all of these places yet, so maybe a few simple changes would allow us to treat typmod as a full 32 bits. Or perhaps it could just be expanded to a signed 64-bit int. What do you think? Regards,Jeff Davis
* Jeff Davis (pgsql@j-davis.com) wrote: > Oid is unsigned, typmod is signed. We might be able to get away with it, > but -1 is treated specially in some places outside of the type-specific > functions, e.g. exprTypmod(). Any negative result returned from the input handling function is considered an error, as I recall. It's more than just '-1'.. > I haven't looked at all of these places yet, so maybe a few simple > changes would allow us to treat typmod as a full 32 bits. Or perhaps it > could just be expanded to a signed 64-bit int. What do you think? That was shot down previously due to the way typmods are passed around currently.. Not that it wouldn't be really nice.. Thanks, Stephen
Jeff Davis <pgsql@j-davis.com> writes: > I haven't looked at all of these places yet, so maybe a few simple > changes would allow us to treat typmod as a full 32 bits. Or perhaps it > could just be expanded to a signed 64-bit int. What do you think? Neither is likely to happen, and even disregarding that, I doubt people would be real happy with a design like this. Where are you going to put the typmod for the contained type? regards, tom lane
On Fri, Oct 30, 2009 at 5:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jeff Davis <pgsql@j-davis.com> writes: >> I haven't looked at all of these places yet, so maybe a few simple >> changes would allow us to treat typmod as a full 32 bits. Or perhaps it >> could just be expanded to a signed 64-bit int. What do you think? > > Neither is likely to happen, and even disregarding that, I doubt people > would be real happy with a design like this. Where are you going to > put the typmod for the contained type? IMO, the real problem is that the type interface is poorly encapsulated. There's way too much code that knows about the internal details of a type - namely, that it's a 32-bit integer modified by a second 32-bit integer. I think there are still places where the code doesn't even know about typmod. If we're going to go to the trouble of changing anything, I think it should probably involve inserting an abstraction layer that will make future extensions easier. But I have a feeling that's going to be a tough sell. ...Robert
On Fri, 2009-10-30 at 17:39 -0400, Robert Haas wrote: > IMO, the real problem is that the type interface is poorly > encapsulated. There's way too much code that knows about the internal > details of a type - namely, that it's a 32-bit integer modified by a > second 32-bit integer. I think there are still places where the code > doesn't even know about typmod. If we're going to go to the trouble > of changing anything, I think it should probably involve inserting an > abstraction layer that will make future extensions easier. But I have > a feeling that's going to be a tough sell. Yeah. We're way off topic for partitioning, so I think it's best to just table this discussion until someone comes up with a good idea. It's not the end of the world to write some generic C code, and have multiple types make use of it, e.g. PERIOD, PERIODTZ, INT4RANGE, FLOAT8RANGE, etc. It's a little redundant and creates some catalog bloat, but I'm not too concerned about it right now. Certainly not enough to rewrite the type system. Regards,Jeff Davis
Robert Haas <robertmhaas@gmail.com> writes: > IMO, the real problem is that the type interface is poorly > encapsulated. There's way too much code that knows about the internal > details of a type - namely, that it's a 32-bit integer modified by a > second 32-bit integer. I think there are still places where the code > doesn't even know about typmod. If we're going to go to the trouble > of changing anything, I think it should probably involve inserting an > abstraction layer that will make future extensions easier. But I have > a feeling that's going to be a tough sell. Yup, you're right. It would be an enormous amount of work and break a lot of third-party code, for largely hypothetical future benefits. We've got better places to invest our limited manpower. regards, tom lane
Devrim GNDZ <devrim@gunduz.org> wrote: > Is this the same as / similar to Oracle's syntax? Yes. > IIRC Nikhil's patch was Oracle's syntax No. See: http://download.oracle.com/docs/cd/B28359_01/server.111/b28286/statements_7002.htm#i2125922 Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Hi, >> Is this the same as / similar to Oracle's syntax? > > Yes. > >> IIRC Nikhil's patch was Oracle's syntax > > No. See: > http://download.oracle.com/docs/cd/B28359_01/server.111/b28286/statements_7002.htm#i2125922 > Any decent prevalent partitioning syntax should be ok IMHO. IIRC, MySQL paritioning syntax is also pretty similar to Oracle's. Regards, Nikhils -- http://www.enterprisedb.com
Here are details of partitioning syntax. ----------------- Syntax overview ----------------- Partitions are defined with 3 steps: 1. Create a plain table as parent. 2. Set a partition key to a table. 3. Add a partition to a table which has a partition key. i.e., CREATE TABLE table (...) PARTITION BY { RANGE | LIST } ( key ) (...) is just an abbreviated form of: 1. CREATE TABLE table (...); 2. ALTER TABLE table PARTITION BY { RANGE | LIST } ( key ); 3. CREATE PARTITION name ON table VALUES ...; Currently RANGE and LIST partitions are supported. No reserved keywords are required by the syntax, and that's why ALTER TABLE ADD PARTITION cannot be used here instead of CREATE PARTITION. HASH partitions are not supported, but we can use LIST paritions with an expression key as incomplete HASH partitions: CREATE TABLE table (...) PARTITION BY LIST ( hashtext(attr) ) (...); SELECT * FROM table WHERE hashtext(attr) = hashtext('search_key'); ------------------------------------------ Features *NOT* included in this proposal ------------------------------------------ To simplify patch, the following features are not included: - Partition triggers to dispatch rows inserted into parent table - Expanding some commands for a parent to partitions (ex. VACUUM) - ALTER commands except RENAME (ex. MERGE, SPLIT, UPDATE) - Ability to add an existing table to a parent as a partition (ex. ALTER TABLE table INHERIT parent AS PARTITION) - Planner and Executor improvements I have plans to implement some of them in separated patches, but not now. ----------------- Catalog changes ----------------- A new system catalog "pg_partition" represents partition keys for each table. A parent table of partitions has only one pg_partition row. I think separated pg_partition table is better than adding these columns to pg_class, but it might be debatable. CREATE TABLE pg_partition ( partrelid oid REFERENCES oid ON pg_class, -- partitioned table oid partopr oid REFERENCES oid ON pg_operator, -- operator to comapre keys partkind "char", -- kind of partition: 'R' (range) or 'L' (list) partkey text, -- expression tree of partition key PRIMARY KEY (partrelid) ) WITHOUT OIDS; In addition, we would need to store threshold values of child tables somewhere, but under consideration. I'm thinking to extract upper and lower bounds from CHECK constraint, but it might be unreliable. Comments and ideas welcome. ---------------- Syntax details ---------------- CREATE TABLE table (...) PARTITION BY RANGE ( expr [USING operator] ) [ ( PARTITION name VALUES LESS THAN [(] upper [)], PARTITION name VALUES LESS THAN [(] MAXVALUE [)] -- overflow partition ) ] ; -- default operator is '<' for range partitions CREATE TABLE table (...) PARTITION BY LIST ( expr [USING operator] ) [ ( PARTITION name VALUES [IN] ( values ), PARTITION name VALUES [IN] [(] DEFAULT [)] -- overflow partition ) ] ; -- default operator is '=' for list partitions ALTER TABLE table PARTITION BY { RANGE | LIST } ... ; ALTER TABLE table NO PARTITION; -- drop partition key CREATE PARTITION partition ON table VALUES LESS THAN ...; -- range CREATE PARTITION partition ON table VALUES IN (...); -- list DROP PARTITION partition; -- synonym for DROP TABLE ALTER PARTITION partition RENAME TO name; -- synonym for ALTER TABLE RENAME Note: * Each partition can have optional WITH (...) and TABLESPACE clauses. * '(' and ')' are optional to support both Oracle and MySQL syntax. ----------- WIP patch ----------- The attached partitioning_20091102.patch is a WIP patch. There are still not implemented features marked with TODO tags, but I'll use this design -- especially Node manipulations. Please notice me if I'm missing something. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Attachment
Here is a WIP partitioning patch. The new syntax are: 1. CREATE TABLE parent (...); 2. ALTER TABLE parent PARTITION BY { RANGE | LIST } ( key ); 3. CREATE TABLE child (...); 4. ALTER TABLE child INHERIT parent AS PARTITION VALUES ...; We can also use "CREATE TABLE PARTITION BY" as 1+2+3+4 and "CREATE PARTITION" as 3+4. I think "INHERIT AS PARTITION" is rarely used typically, but such orthogonality seems to be cleaner. The most complex logic of the patch is in ATExecAddInherit(). It scans existing partitions and generate CHECK constraint for the new partition. Any comments to the design? If no objections, I'd like to stop adding features in this CommitFest and go for remaining auxiliary works -- pg_dump, object dependency checking, documentation, etc. > ----------------- > Catalog changes > ----------------- In addition to pg_partition, I added pg_inherits.inhvalues field. The type of field is "anyarray" and store partition values. For range partition, an upper bound value is stored in the array. For list partition, list values are stored in it. These separated value fields will be useful to implement partition triggers in the future. In contrast, reverse engineering of check constraints is messy. CATALOG(pg_inherits,2611) BKI_WITHOUT_OIDS { Oid inhrelid; Oid inhparent; int4 inhseqno; anyarray inhvalues; /* values for partition */ } FormData_pg_inherits; > CREATE TABLE pg_partition ( > partrelid oid REFERENCES oid ON pg_class, -- partitioned table oid > partopr oid REFERENCES oid ON pg_operator, -- operator to compare keys > partkind "char", -- kind of partition: 'R' (range) or 'L' (list) > partkey text, -- expression tree of partition key > PRIMARY KEY (partrelid) > ) WITHOUT OIDS; ------------------------------ Limitations and Restrictions ------------------------------ * We can create a new partition as long as partitioning keys are not conflicted with existing partitions. Especially, we cannot add any partitions if we have overflow partitions because a new partition always split the overflow partition. * We cannot reuse an existing check constraint as a partition constraint. ALTER TABLE INHERIT AS PARTITION brings on a table scan to add a new CHECK constraint. * No partition triggers nor planner and executor improvements. It would come in the future development. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Attachment
I added psql and pg_dump support to Partitioning Syntax patch. Paritioning information can be retrieved with a new system function pg_get_partitiondef(parentRelid). Both psql and pg_dump use it. There are some changes from the last patch. Some of them seem to be a bit ugly. Ideas welcome. * If a table with the same name already exists when a partition is created, the table is re-used as partition. This behavior is required for pg_dump to be simple. * Don't create a new check constraint when a table is attached as partition to a parent table if the child table has constraints with the same definition. This behavior is required for pg_dump not to add duplicated check constraints in repeated dump and restore. * Inheritance is used for partitions, but pg_dump doesn't dump them as inheritance; It dump a child table without inheritance first, and re-add inheritance with ALTER TABLE PARTITION BY. PartitionInfo is added as a DumpableObject in pg_dump. * Dependencies of objects are managed with existing depencency manager except a check constraint to partition values. Partition constraints can be dropped even if the table is still in the partitioning set. A patch attached, and I'll summarize it: ==== Syntax ==== CREATE TABLE parent (...) PARTITION BY { RANGE | LIST } ( key [ USING oprator ] ) ( <partitions> ); ALTER TABLE parent PARTITION BY { RANGE | LIST } ...; CREATE PARTITION partition ON parent VALUES ...; <partitions> := PARTITION name VALUES LESS THAN { range_upper | MAXVALUE } | PARTITION name VALUES IN ( list_value [,...] | DEFAULT ) ==== System Catalog ==== CREATE TABLE pg_partition ( partrelid oid UNIQUE REFARENCES pg_class(oid), partopr oid REFARENCES pg_operatoroid), partkind "char", -- 'R':RANGE or 'L':LIST partkey text -- node dump of the partition key ) WITHOUT OIDS; CREATE TABLE pg_inherits ( inhrelid oid, inhparent oid, inhseqno integer, + inhvalues anyarray -- Non-null if the inheritance is for partitioning. ) WITHOUT OIDS; ==== Sample output from psql ==== =# \d sales_range Table "public.sales_range" Column | Type | Modifiers ---------------+-----------------------------+----------- salesman_id | numeric(5,0) | salesman_name | character varying(30) | sales_state | character varying(20) | sales_date | timestamp without time zone | Partitions: PARTITION BY RANGE ( sales_date USING < ) ( PARTITION sales_2006 VALUES LESS THAN '2007-01-01 00:00:00', ... ) ==== Sample output from pg_dump ==== CREATE TABLE sales_range (...); CREATE TABLE sales_2006 (...); -- without inheritance ALTER TABLE public.sales_range PARTITION BY RANGE ( sales_date USING < ) ( PARTITION sales_2006 VALUES LESS THAN '2007-01-01 00:00:00', ... ); Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Attachment
Hi, I'm reviewing your patch. The patch applies without problems and the feature works as advertised. I have yet to look at the code in detail, but it looks sane and seems to work. However, this looks like a mistake: partinfo = (PartitionInfo *) malloc(ntups * sizeof(PartitionInfo)); or am I missing something? The syntax itself seems a bit weird in some cases. Say you have: PARTITION BY RANGE ( foo USING > ) (PARTITION bar VALUES LESS THAN 0 ); which translates to CHECK (bar > 0). That doesn't sound at all like LESS THAN to me. This syntax seems to be the same Oracle uses, and I think it's nice for the general case, but I think the reversed operator weirdness is a bit too much. Maybe we should use something like PARTITION bar VALUES OPERATOR 0 when the user specifies the operator? Regards, Marko Tiikkaja
On Thu, Nov 12, 2009 at 5:54 AM, Itagaki Takahiro <itagaki.takahiro@oss.ntt.co.jp> wrote: > I added psql and pg_dump support to Partitioning Syntax patch. > Paritioning information can be retrieved with a new system function > pg_get_partitiondef(parentRelid). Both psql and pg_dump use it. > i haven't seen the patch but: > > * If a table with the same name already exists when a partition > is created, the table is re-used as partition. This behavior > is required for pg_dump to be simple. > i guess the table must be empty, if not we should be throw an error... and i actually prefer some more explicit syntax for this not just reusing a table > PARTITION name VALUES LESS THAN { range_upper | MAXVALUE } > | PARTITION name VALUES IN ( list_value [,...] | DEFAULT ) > i remember someone making a comment about actually using operators instead of LESS THEN and family -- Atentamente, Jaime Casanova Soporte y capacitación de PostgreSQL Asesoría y desarrollo de sistemas Guayaquil - Ecuador Cel. +59387171157
Jaime Casanova wrote: >> * If a table with the same name already exists when a partition >> is created, the table is re-used as partition. This behavior >> is required for pg_dump to be simple. >> > > i guess the table must be empty, if not we should be throw an error... > and i actually prefer some more explicit syntax for this not just > reusing a table I'd be OK with only a notification - even if the table wasn't empty -, similar to how inheritance combines rows currently. The patch currently silently reuses the table unless it has rows which don't satisfy the CHECK constraint, in which case it gives you the default CHECK constraint error. >> PARTITION name VALUES LESS THAN { range_upper | MAXVALUE } >> | PARTITION name VALUES IN ( list_value [,...] | DEFAULT ) >> > > i remember someone making a comment about actually using operators > instead of LESS THEN and family That doesn't sound like a bad idea.. Regards, Marko Tiikkaja
On Thu, 2009-10-29 at 11:15 +0900, Itagaki Takahiro wrote: > I think syntax support is a good start. I don't see a syntax-only patch as being any use at all to this community. We go to enormous lengths in other areas to never allow patches with restrictions. Why would we allow a patch that is essentially 100% restriction? i.e. It does nothing at all. Worse than that, it will encourage people to believe it exists in full, when that isn't the case. The syntax has never really been in question, so it doesn't really move us forwards in any direction. This is exactly the kind of shallow feature we have always shied away from and that other databases have encouraged. The only reason I can see is that it allows people to develop non-open source code that matches how Postgres will work when we get our act together. That seems likely to discourage, rather than encourage the funding of this work for open source. It may even muddy the water for people that don't understand that the real magic happens in the internals, not in the syntax. Why not just wait until we have a whole patch and then apply? -- Simon Riggs www.2ndQuadrant.com
On Tue, Nov 17, 2009 at 4:31 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Thu, 2009-10-29 at 11:15 +0900, Itagaki Takahiro wrote: > >> I think syntax support is a good start. > > I don't see a syntax-only patch as being any use at all to this > community. > > We go to enormous lengths in other areas to never allow patches with > restrictions. Why would we allow a patch that is essentially 100% > restriction? i.e. It does nothing at all. Worse than that, it will > encourage people to believe it exists in full, when that isn't the case. > > The syntax has never really been in question, so it doesn't really move > us forwards in any direction. This is exactly the kind of shallow > feature we have always shied away from and that other databases have > encouraged. > > The only reason I can see is that it allows people to develop non-open > source code that matches how Postgres will work when we get our act > together. That seems likely to discourage, rather than encourage the > funding of this work for open source. It may even muddy the water for > people that don't understand that the real magic happens in the > internals, not in the syntax. > > Why not just wait until we have a whole patch and then apply? Because big patches are really hard to get applied. Personally, I think a syntax-only patch makes a lot of sense, as long as the design is carefully thought about so that it can serve as a foundation for future work in this area. I don't think "the whole patch" is even necessarily a well-defined concept in this instance: different people could have very different ideas about what would constitute a complete solution, or which aspects of a complete solution are most important or should be pursued first. Settling on a syntax, and an internal representation for that syntax, seems like it will make subsequent discussions about those projects considerably more straightforward, and it has some value in and of itself since similar notation is used by other databases. At least, that's MHO. ...Robert
Marko Tiikkaja <marko.tiikkaja@cs.helsinki.fi> wrote: > Jaime Casanova wrote: > >> PARTITION name VALUES LESS THAN { range_upper | MAXVALUE } > >> | PARTITION name VALUES IN ( list_value [,...] | DEFAULT ) > > > > i remember someone making a comment about actually using operators > > instead of LESS THEN and family > > That doesn't sound like a bad idea.. I prefer to use widely-used syntax instead of postgres original one. Oracle and MySQL already use "LESS THAN" and "IN" for partitioning. I assume almost all user only use the default operators. I don't want to break de facto standard for small utilization area. I think truly what we want is a new partition "kind" in addition to RANGE and LIST. If we want to split geometric data into paritions, we need to treat the the partition key with gist-list operation. I agree with a plan to add some additional parition kinds, but want to keep RANGE and LIST partitions in the current syntax. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Simon Riggs <simon@2ndQuadrant.com> wrote: > Why not just wait until we have a whole patch and then apply? "A whole patch" can be written by many contributers instead of only one person, no? I think we need to split works for partitioning into serveral parts to encourage developing it. I just did one of the parts, "syntax". Anothe patch "Partitioning option for COPY" will do a good job in the field of "INSERT". Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Jaime Casanova <jcasanov@systemguards.com.ec> wrote: > > * If a table with the same name already exists when a partition > > is created, the table is re-used as partition. This behavior > > is required for pg_dump to be simple. > > i guess the table must be empty, if not we should be throw an error... > and i actually prefer some more explicit syntax for this not just > reusing a table Yeah, an explicit syntax is better. I've researched other syntax, but I cannot find any good ones. * ALTER TABLE child INHERIT parent AS PARTITION => implemenation "PARTITION is an INHERIT" is revealed to user.* ALTERPARTITION child ATTACH TO parent => child is not a partition yet at that point.* ALTER TABLE parent ADD PARTITIONchild => "partition" need to be a full-reserved word. Are there better idea? Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Marko Tiikkaja <marko.tiikkaja@cs.helsinki.fi> wrote: > this looks like a mistake: > partinfo = (PartitionInfo *) malloc(ntups * sizeof(PartitionInfo)); Oops, it should be "p"alloc. Thanks. > Maybe we should use something like > PARTITION bar VALUES OPERATOR 0 > when the user specifies the operator? I think we could have reasonable restrictions to the operator for future optimization. Is the VALUES OPERATOR syntax too freedom? For the same reason, USING operator also might be too freedom. RANGE (and maybe also LIST) partition keys should be sortable, operator class name might be better to the option instead of any operators. i.e., PARTITION BY RANGE ( foo [ USING operator ] ) should be: PARTITION BY RANGE ( foo [ btree_ops_name ] ) If we do so, there will be no inconsistency in LESS THAN syntax because btree_ops always have < operator. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
On Wed, 2009-11-18 at 13:24 +0900, Itagaki Takahiro wrote: > Simon Riggs <simon@2ndQuadrant.com> wrote: > > > Why not just wait until we have a whole patch and then apply? > > "A whole patch" can be written by many contributers instead of only > one person, no? I think we need to split works for partitioning > into serveral parts to encourage developing it. I just did one of > the parts, "syntax". Anothe patch "Partitioning option for COPY" > will do a good job in the field of "INSERT". If we can agree the parts that are required, I would at least be confident that we have understood this enough to allow one part to proceed ahead of the others. For partitioning the parts are these 1. Syntax for explicit partitioning 2. Internal data representations 3. Optimizations many and various 4. Data Routing a) Data routing on INSERT/COPY b) UPDATE handling when the UPDATE causes partition migration If this patch puts forward a solution for (2) also, then it is potentially worthwhile. That is the real blocking point here. Once we have that other people will quickly fill in the later parts. I foresee a data structure that is a sorted list of boundary-values, cached on the parent-relation. This should be accessible to allow bsearch of particular values during both planning and execution. Same rules apply as btree operator classes. For multi-level hierarchies the parent level should have the union of all sub-hierarchies. I think we need an index on pg_inherits also. So please do (1) and (2), not just (1) in isolation. -- Simon Riggs www.2ndQuadrant.com
Hi, Robert Haas wrote: > Settling on a syntax, and an internal representation for that syntax, I've been under the impression that this was only about syntax. What are the internal additions? Generally speaking, I'd agree with Simon or even vote for doing the internals first and add the syntactic sugar only later on. > seems like it will make subsequent > discussions about those projects considerably more straightforward, ..or subsequent implementations more complicated, because you have to support an awkward syntax. > and it has some value in and of itself since similar notation is used > by other databases. That point is well taken, but it would be more compelling if it were the same or at least a compatible syntax. Regards Markus Wanner
On Thu, Nov 19, 2009 at 9:58 AM, Markus Wanner <markus@bluegap.ch> wrote: > Hi, > > Robert Haas wrote: >> >> Settling on a syntax, and an internal representation for that syntax, > > I've been under the impression that this was only about syntax. What are the > internal additions? I haven't looked at it in detail, but it adds a new pg_partition table. Whether that table is suitably structured for use by the optimizer is not clear to me. > Generally speaking, I'd agree with Simon or even vote for doing the > internals first and add the syntactic sugar only later on. That's not really possible in this case. The internals consist of taking advantage of the fact that we have explicit knowledge of how the partitions are defined vs. just relying on the (slow) constraint exclusion logic. We can't do that unless, in fact, we have that explicit knowledge, and that requires inventing syntax. > That point is well taken, but it would be more compelling if it were the > same or at least a compatible syntax. There's been an effort to make it close, but I haven't followed it in enough detail to know how close. ...Robert
On ons, 2009-11-18 at 13:52 +0900, Itagaki Takahiro wrote: > > partinfo = (PartitionInfo *) malloc(ntups * sizeof(PartitionInfo)); > > Oops, it should be "p"alloc. Thanks. A very low-level comment: 1) Please stop casting the results of palloc and malloc. We are not writing C++ here. 2) I would prefer that you apply sizeof on the variable, not on the type. That way, the expression is independent of any type changes of the variable, and can be reviewed without having to scroll around for the variable definition. So how about, partinfo = palloc(ntups * sizeof(*partinfo));
Hi, >> > partinfo = (PartitionInfo *) malloc(ntups * sizeof(PartitionInfo)); > > 1) Please stop casting the results of palloc and malloc. We are not > writing C++ here. > I thought it was/is a good C programming practice to typecast (void *) always to the returning structure type!! Regards, Nikhils > 2) I would prefer that you apply sizeof on the variable, not on the > type. That way, the expression is independent of any type changes of > the variable, and can be reviewed without having to scroll around for > the variable definition. > > So how about, > > partinfo = palloc(ntups * sizeof(*partinfo)); > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- http://www.enterprisedb.com
On fre, 2009-11-20 at 11:14 +0530, Nikhil Sontakke wrote: > Hi, > > >> > partinfo = (PartitionInfo *) malloc(ntups * sizeof(PartitionInfo)); > > > > 1) Please stop casting the results of palloc and malloc. We are not > > writing C++ here. > > > > I thought it was/is a good C programming practice to typecast (void *) > always to the returning structure type!! This could be preferable if you use sizeof on the type, so that you have an additional check that the receiving variable actually has that type. But if you use sizeof on the variable itself, it's unnecessary: You just declare the variable to be of some type earlier, and then the expression allocates ntups of it, without having to repeat the type information. > > Regards, > Nikhils > > > 2) I would prefer that you apply sizeof on the variable, not on the > > type. That way, the expression is independent of any type changes of > > the variable, and can be reviewed without having to scroll around for > > the variable definition. > > > > So how about, > > > > partinfo = palloc(ntups * sizeof(*partinfo)); > > > > > > -- > > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > > To make changes to your subscription: > > http://www.postgresql.org/mailpref/pgsql-hackers > > > > > > -- > http://www.enterprisedb.com >
Nikhil Sontakke <nikhil.sontakke@enterprisedb.com> writes: >>> partinfo = (PartitionInfo *) malloc(ntups * sizeof(PartitionInfo)); >> >> 1) Please stop casting the results of palloc and malloc. �We are not >> writing C++ here. > I thought it was/is a good C programming practice to typecast (void *) > always to the returning structure type!! Yes. The above is good style because it ensures that the variable you're assigning the pointer to is the right type to match the sizeof computation. In C++ you'd use operator new instead and still have that type-check without the cast, but indeed we are not writing C++ here. The *real* bug in the quoted code is that it's using malloc. There are a few places in PG where it's appropriate to use malloc not palloc, but pretty darn few. regards, tom lane
Peter Eisentraut <peter_e@gmx.net> writes: > 2) I would prefer that you apply sizeof on the variable, not on the > type. That way, the expression is independent of any type changes of > the variable, and can be reviewed without having to scroll around for > the variable definition. FWIW, I think the general project style has been the other way. Yes, it means you write the type name three times not once, but the other side of that coin is that it makes it more obvious what is happening (and gives you an extra chance to realize that the type you wrote is wrong ...) regards, tom lane
On Thu, 2009-11-19 at 10:53 -0500, Robert Haas wrote: > On Thu, Nov 19, 2009 at 9:58 AM, Markus Wanner <markus@bluegap.ch> wrote: > > Hi, > > > > Robert Haas wrote: > >> > >> Settling on a syntax, and an internal representation for that syntax, > > > > I've been under the impression that this was only about syntax. What are the > > internal additions? > > I haven't looked at it in detail, but it adds a new pg_partition > table. Whether that table is suitably structured for use by the > optimizer is not clear to me. If it does, then my review comments to Kedar still apply: * why do we want another catalog table? what's wrong with pg_inherits? It might need additional columns, and it certainly needs another index. * We need an internal data structure (discussed on this thread also). Leaving stuff in various catalog tables would not be the same thing at all. -- Simon Riggs www.2ndQuadrant.com
On Fri, Nov 20, 2009 at 2:08 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Thu, 2009-11-19 at 10:53 -0500, Robert Haas wrote: >> On Thu, Nov 19, 2009 at 9:58 AM, Markus Wanner <markus@bluegap.ch> wrote: >> > Hi, >> > >> > Robert Haas wrote: >> >> >> >> Settling on a syntax, and an internal representation for that syntax, >> > >> > I've been under the impression that this was only about syntax. What are the >> > internal additions? >> >> I haven't looked at it in detail, but it adds a new pg_partition >> table. Whether that table is suitably structured for use by the >> optimizer is not clear to me. > > If it does, then my review comments to Kedar still apply: > > * why do we want another catalog table? what's wrong with pg_inherits? > It might need additional columns, and it certainly needs another index. That might work, I haven't looked at it enough to be sure one way or the other. > * We need an internal data structure (discussed on this thread also). > Leaving stuff in various catalog tables would not be the same thing at > all. Ultimately I'm guessing that for query optimization we'll need to include the relevant info in the relcache. But I think that can wait until we're ready to actually make the optimizer changes - not much point in caching data that is never used. Right now I think it's enough to verify (which I haven't) that the schema of the catalog table is suitable for straightforward construction of the data that will eventually need to be cached. ...Robert
Hi, Sorry for commenting only now but I think that we need to be able to store the partitions in different tablespaces. Even if originally the create table creates all partitions in the same tablespace, individual partitions should be allowed to be moved in different tablespaces using alter table or alter partition. I think that other databases allows the user to define a tablespace for each partition in the create table statement. In a warehouse, you might want to split your partitions on different volumes and over time, move older partitions to storage with higher compression if that data is not to be accessed frequently anymore. Altering tablespaces for partitions is important in that context. Are you also planning to provide partitioning extensions to 'create table as'? Thanks Emmanuel > Here is a WIP partitioning patch. The new syntax are: > 1. CREATE TABLE parent (...); > 2. ALTER TABLE parent PARTITION BY { RANGE | LIST } ( key ); > 3. CREATE TABLE child (...); > 4. ALTER TABLE child INHERIT parent AS PARTITION VALUES ...; > > We can also use "CREATE TABLE PARTITION BY" as 1+2+3+4 and > "CREATE PARTITION" as 3+4. I think "INHERIT AS PARTITION" is rarely > used typically, but such orthogonality seems to be cleaner. > > The most complex logic of the patch is in ATExecAddInherit(). It scans > existing partitions and generate CHECK constraint for the new partition. > > Any comments to the design? If no objections, I'd like to stop adding > features in this CommitFest and go for remaining auxiliary works > -- pg_dump, object dependency checking, documentation, etc. > > >> ----------------- >> Catalog changes >> ----------------- >> > In addition to pg_partition, I added pg_inherits.inhvalues field. > The type of field is "anyarray" and store partition values. > For range partition, an upper bound value is stored in the array. > For list partition, list values are stored in it. These separated > value fields will be useful to implement partition triggers in the > future. In contrast, reverse engineering of check constraints is messy. > > CATALOG(pg_inherits,2611) BKI_WITHOUT_OIDS > { > Oid inhrelid; > Oid inhparent; > int4 inhseqno; > anyarray inhvalues; /* values for partition */ > } FormData_pg_inherits; > > >> CREATE TABLE pg_partition ( >> partrelid oid REFERENCES oid ON pg_class, -- partitioned table oid >> partopr oid REFERENCES oid ON pg_operator, -- operator to compare keys >> partkind "char", -- kind of partition: 'R' (range) or 'L' (list) >> partkey text, -- expression tree of partition key >> PRIMARY KEY (partrelid) >> ) WITHOUT OIDS; >> > > ------------------------------ > Limitations and Restrictions > ------------------------------ > * We can create a new partition as long as partitioning keys > are not conflicted with existing partitions. Especially, > we cannot add any partitions if we have overflow partitions > because a new partition always split the overflow partition. > > * We cannot reuse an existing check constraint as a partition > constraint. ALTER TABLE INHERIT AS PARTITION brings on > a table scan to add a new CHECK constraint. > > * No partition triggers nor planner and executor improvements. > It would come in the future development. > > Regards, > --- > ITAGAKI Takahiro > NTT Open Source Software Center > > > ------------------------------------------------------------------------ > > > -- Emmanuel Cecchet Aster Data Web: http://www.asterdata.com
Emmanuel Cecchet <manu@asterdata.com> wrote: > I think that other databases allows the > user to define a tablespace for each partition in the create table > statement. WITH and TABLESPACE clause are supported for each partition. =# CREATE TABLE parent (...) PARTITION BY (key) ( PARTITION child_1 VALUES LESS THAN 10 WITH (...) TABLESPACE tbs_1 );=# CREATE PARTITION child_2 ON parent VALUES LESS THAN 20 WITH (...) TABLESPACE tbl_2; > Are you also planning to provide partitioning extensions to 'create > table as'? Ah, I forgot that. It would be possible to have the feature. There are no syntax issues. But it would be done after we support automatic INSERT routing. We can create the table will partitions, but tuples are not divided into child partitions because we have no insert-triggers at the time of CREATE TABLE AS. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
I just made a few updates to http://wiki.postgresql.org/wiki/Table_partitioning , merging in the stuff that had been on the ToDo page and expanding the links to discussion on this list a bit. The number of submitted patches over the last couple of years that handle some subset of the desired feature set here is really remarkable when you see them all together. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Greg Smith wrote: > I just made a few updates to > http://wiki.postgresql.org/wiki/Table_partitioning , merging in the > stuff that had been on the ToDo page and expanding the links to > discussion on this list a bit. The number of submitted patches over > the last couple of years that handle some subset of the desired > feature set here is really remarkable when you see them all together. > Should we add the 'WITH (...) TABLESPACE tbs' options to the syntax since they are supported? Do we support ALTER ... SET TABLESPACE? Emmanuel -- Emmanuel Cecchet Aster Data Web: http://www.asterdata.com
Emmanuel Cecchet <manu@asterdata.com> wrote: > Should we add the 'WITH (...) TABLESPACE tbs' options to the syntax > since they are supported? Added the description. > Do we support ALTER ... SET TABLESPACE? DROP/ALTER PARTITION are synonyms for DROP/ALTER TABLE. SET TABLESPACE is also supported. Added the description. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Here is an updated partitioning syntax patch. It implements syntax described here: http://wiki.postgresql.org/wiki/Table_partitioning#Syntax Changes: * New syntax: ALTER TABLE parent ATTACH/DETACH PARTITION. * Partition keys accepts an opclass name instead of an operator. * "lo <= key AND key < hi" is used in range check constraints instead of "key >= lo AND key < hi". ToDo items: * pg_get_partitiondef(parentOid, in_alter_format) might be ugly. It was designed only for psql -d and pg_dump. It might be cleaner if we move SQL formatter from the core function to client tools. In psql: pg_get_partitiondef(oid, false) Partitions: RANGE (sales_date) ( PARTITION sales_2006 VALUES LESS THAN '...', ... PARTITION sales_max VALUES LESS THAN MAXVALUE ) In pg_dump: pg_get_partitiondef(oid, true) ALTER TABLE parent PARTITION BY RANGE (sales_date); ALTER TABLE parent ATTACH PARTITION sales_2006 VALUES LESS THAN '...'; ... ALTER TABLE parent ATTACH PARTITION sales_max VALUES LESS THAN MAXVALUE; * The patch does not contain the following documentation, but I'll start writing them if the syntax is ok. - ddl-partitioning.sgml - alter-partition.sgml (new) - create-partition.sgml (new) - drop-partition.sgml (new) Note: * In fact, malloc was not a bug because it was the code in pg_dump. Comments welcome. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Attachment
Here is an update partitioning syntax patch. A bug reported by Marko is fixed. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Attachment
On Fri, 2009-12-04 at 11:54 +0900, Itagaki Takahiro wrote: > Here is an update partitioning syntax patch. > > A bug reported by Marko is fixed. I will review and eventually commit this, if appropriate, though it is 3rd in my queue and will probably not be done for at least 2 weeks, possibly 4 weeks. Some brief review comments * SQL:2008 contains PARTITION BY syntax, though in completely different context. A possible alternative would be to use PARTITIONED BY. Please justify either way. Possibly add short section to docs to explain this. * There are relatively few comments in-line. Please can you provide a README section for the code that explains how partitioning works? A reviewer's guide would also be helpful to explain some of the design decisions in particular places. * All of the tests use about 4 partitions, which is the kind of thing that makes me think the test coverage isn't wide enough. More tests please. This would include operations on 0?, 1 and many partitions. We also need more test failures, covering all the dumbass things people will attempt. Also need partitioning by strange datatypes, arrays, timestamps with timezones and stupidly long list values. Read Rob Treat's humorous dissection of earlier partitioning features at PGcon to see what needs to be covered. * Docs. This is looking fairly solid, so please begin working on docs. I won't hold you to this in next few weeks, but we know it needs doing. * It is essential that we have large real-world(ish) performance test results that proves this patch will work in the areas for which it is intended. We need a test with 500 partitions, using at least 10MB partitions to see if there are any scale-related issues. This test case will help set targets for later developments because it will highlight performance issues in planning, DDL and other areas. This doesn't have to be by the patch author, but we need to see evidence that this patch operates in its primary use case. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > I will review and eventually commit this, if appropriate, though it is > 3rd in my queue and will probably not be done for at least 2 weeks, > possibly 4 weeks. > I've marked Simon as the next reviewer and expected committer on this patch and have updated it to "Returned with Feedback". That's not saying work is going to stop on it. It just looks like that is going to extend beyond when we want this CommitFest to finish, and I want to pull it off the list of things I'm monitoring as part of that. Everyone should keep hammering away at nailing this fundamental bit down, so that the rest of the partitioning patch ideas floating around finally have a firm place to start attaching to. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Greg Smith <greg@2ndquadrant.com> wrote: > I've marked Simon as the next reviewer and expected committer on this > patch and have updated it to "Returned with Feedback". OK. I'll re-submit improved patches in the next commit fest. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
On Fri, 2009-12-04 at 09:00 +0000, Simon Riggs wrote: > On Fri, 2009-12-04 at 11:54 +0900, Itagaki Takahiro wrote: > > Here is an update partitioning syntax patch. > > > > A bug reported by Marko is fixed. > > I will review and eventually commit this, if appropriate, though it is > 3rd in my queue and will probably not be done for at least 2 weeks, > possibly 4 weeks. I'll have to go back on this unfortunately, sorry about that. I have enough items emerging from HS to keep me busy much longer than I thought. I'll run with VF if that's OK, since I have some other related changes in that area and it makes sense to understand that code also, if OK with you. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs <simon@2ndQuadrant.com> wrote: > I have enough items emerging from HS to keep me busy much longer than I > thought. I'll run with VF if that's OK, since I have some other related > changes in that area and it makes sense to understand that code also, if > OK with you. Sure. Many users want to see HS. BTW, New VACUUM FULL patch is waiting for being applied. https://commitfest.postgresql.org/action/patch_view?id=202 But I heard HS is attempting to modify VFI in another way or remove it completely. Do we still need the patch, or reject it and fix VFI in HS? Regards, --- Takahiro Itagaki NTT Open Source Software Center
On Tue, 2009-12-15 at 11:17 +0900, Takahiro Itagaki wrote: > Simon Riggs <simon@2ndQuadrant.com> wrote: > > > I have enough items emerging from HS to keep me busy much longer than I > > thought. I'll run with VF if that's OK, since I have some other related > > changes in that area and it makes sense to understand that code also, if > > OK with you. > > Sure. Many users want to see HS. > > BTW, New VACUUM FULL patch is waiting for being applied. > https://commitfest.postgresql.org/action/patch_view?id=202 > But I heard HS is attempting to modify VFI in another way or remove it > completely. Do we still need the patch, or reject it and fix VFI in HS? Plan is to apply patch for new VF, then for me to write another patch to allow new VF to work with system relations also. VACUUM FULL INPLACE would then be prohibited if recovery_connections = on, which given that is the default will pretty much reduce VFI to not working at all in 8.5. But it remains an option if problems occur. My intention is to keep all of the code there for 8.5 and then begin removing old VF code at beginning of 8.6dev. It's been there too long and is in far too deep to rip it out quickly. There's no mileage in spending time on removing a non-feature when there is feature work to be done. -- Simon Riggs www.2ndQuadrant.com
On Mon, Dec 14, 2009 at 7:29 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Fri, 2009-12-04 at 09:00 +0000, Simon Riggs wrote: >> On Fri, 2009-12-04 at 11:54 +0900, Itagaki Takahiro wrote: >> > Here is an update partitioning syntax patch. >> > >> > A bug reported by Marko is fixed. >> >> I will review and eventually commit this, if appropriate, though it is >> 3rd in my queue and will probably not be done for at least 2 weeks, >> possibly 4 weeks. > > I'll have to go back on this unfortunately, sorry about that. > the next patch for this will arrive in the next commitfest so maybe you have more time then -- Atentamente, Jaime Casanova Soporte y capacitación de PostgreSQL Asesoría y desarrollo de sistemas Guayaquil - Ecuador Cel. +59387171157
On Tue, 2009-12-15 at 11:17 +0900, Takahiro Itagaki wrote: > Simon Riggs <simon@2ndQuadrant.com> wrote: > > > I have enough items emerging from HS to keep me busy much longer than I > > thought. I'll run with VF if that's OK, since I have some other related > > changes in that area and it makes sense to understand that code also, if > > OK with you. > > Sure. Many users want to see HS. > > BTW, New VACUUM FULL patch is waiting for being applied. > https://commitfest.postgresql.org/action/patch_view?id=202 > But I heard HS is attempting to modify VFI in another way or remove it > completely. Do we still need the patch, or reject it and fix VFI in HS? HS has two options: (1) move forwards alongside this patch, or (2) make a separate fix for VFI in HS. We still want to apply this patch, but I feel it needs changes as discussed downthread as part of my review. Will it be possible for you to make those changes and resubmit for next commitfest, or earlier? -- Simon Riggs www.2ndQuadrant.com
On 15 December 2009 02:31, Jaime Casanova <jcasanov@systemguards.com.ec> wrote: > On Mon, Dec 14, 2009 at 7:29 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On Fri, 2009-12-04 at 09:00 +0000, Simon Riggs wrote: >>> On Fri, 2009-12-04 at 11:54 +0900, Itagaki Takahiro wrote: >>> > Here is an update partitioning syntax patch. >>> > >>> > A bug reported by Marko is fixed. >>> >>> I will review and eventually commit this, if appropriate, though it is >>> 3rd in my queue and will probably not be done for at least 2 weeks, >>> possibly 4 weeks. >> >> I'll have to go back on this unfortunately, sorry about that. >> > > the next patch for this will arrive in the next commitfest so maybe > you have more time then So will this be revived any time soon? Were there any subsequent proposals which were posted? -- Thom Brown Twitter: @darkixion IRC (freenode): dark_ixion Registered Linux user: #516935 EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Nov 9, 2011 at 8:26 AM, Thom Brown <thom@linux.com> wrote: > > So will this be revived any time soon? Were there any subsequent > proposals which were posted? > there was an updated patch, you can find in this thread: http://archives.postgresql.org/message-id/20100114181323.9A33.52131E4D@oss.ntt.co.jp not sure what happens after that. -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación
On Wed, Nov 9, 2011 at 9:35 AM, Jaime Casanova <jaime@2ndquadrant.com> wrote: > On Wed, Nov 9, 2011 at 8:26 AM, Thom Brown <thom@linux.com> wrote: >> >> So will this be revived any time soon? Were there any subsequent >> proposals which were posted? >> > > there was an updated patch, you can find in this thread: > http://archives.postgresql.org/message-id/20100114181323.9A33.52131E4D@oss.ntt.co.jp > > not sure what happens after that. I reviewed a later version here: http://archives.postgresql.org/pgsql-hackers/2010-07/msg00183.php -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 9 November 2011 15:15, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Nov 9, 2011 at 9:35 AM, Jaime Casanova <jaime@2ndquadrant.com> wrote: >> On Wed, Nov 9, 2011 at 8:26 AM, Thom Brown <thom@linux.com> wrote: >>> >>> So will this be revived any time soon? Were there any subsequent >>> proposals which were posted? >>> >> >> there was an updated patch, you can find in this thread: >> http://archives.postgresql.org/message-id/20100114181323.9A33.52131E4D@oss.ntt.co.jp >> >> not sure what happens after that. > > I reviewed a later version here: > > http://archives.postgresql.org/pgsql-hackers/2010-07/msg00183.php Ah yes, I've located a reference to this on the wiki now. No wiki updates needed. I guess it's a matter of whether Takahiro-san has the time and desire to pick this up again any time soon. Whenever I cross the topic of partitioning in PostgreSQL, it's always a tad embarrassing to explain that it's still hacky compared to other database systems (and this came up again last week), so this is of particular interest to me. At the moment there's no testing required as it's returned with feedback, but I'm very willing to help assist in testing it should this return to the fore again. The idea of getting both this and materialised views in time for 9.3 is extremely appealing; a performance release (9.2) followed by a usability release. -- Thom Brown Twitter: @darkixion IRC (freenode): dark_ixion Registered Linux user: #516935 EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Thom Brown <thom@linux.com> writes: > Whenever I cross the topic of > partitioning in PostgreSQL, it's always a tad embarrassing to explain > that it's still hacky compared to other database systems (and this > came up again last week), so this is of particular interest to me. At The more I think about this problem, the more I think that the reason why we still don't have declarative partitioning is that it basically sucks. Other vendors offer it because they couldn't do better, but they are just asking the DBA to implement a service the system should be able to care for itself. Who knows better than PostgreSQL which part of the data are the most often queried and how to best lay them out on disk to ensure QoS? If you think that's the DBA, go ask Tom to implement query hints… More seriously, partitioning in PostgreSQL could be mostly transparent to the users and "just working": it's all about data storage locality and we already have a sub-relation storage model. By using segment exclusion and some level of automatic clustering (physical organization) of data, we could have all the benefits of partitioning without the hassle of maintaining yet another explicit level of data definition. In particular, what part of the declarative partitioning system is intended to take care about creating the right partitionsbefore new data are sent to them? In a first implementation, we could decide to "partition" the data over an index that the DBA has to pick, and then maintain a "segment index" which is a map of the data distribution in the physical segments, for the indexed columns. The granularity could be different and maybe dynamic so that you start at a block level map index for very little tables and grow up to a segment map index for huge tables that require thousands of segments, 1GB each. Then the system have to organize data modifications so that it optimizes the ranges to be the smallest possible on each map entry. And the executor then has to be able to consult that map at query time and exclude whole segments of data (or blocks for little tables) when the segment indexing is able to exclude data. With some tricks because we realize that depending on the size of the portions you skip you might not benefit from moving the head on the platter in another way than what the ongoing seqscan does, but we already have GUCs about that. We might also need some internal facilities to lock out per segment (or rather "map entries") rather than per table so that we have something like a WHERE clause support for TRUNCATE. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On 9 November 2011 21:05, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote: > Thom Brown <thom@linux.com> writes: >> Whenever I cross the topic of >> partitioning in PostgreSQL, it's always a tad embarrassing to explain >> that it's still hacky compared to other database systems (and this >> came up again last week), so this is of particular interest to me. At > > The more I think about this problem, the more I think that the reason > why we still don't have declarative partitioning is that it basically > sucks. Other vendors offer it because they couldn't do better, but they > are just asking the DBA to implement a service the system should be able > to care for itself. > > Who knows better than PostgreSQL which part of the data are the most > often queried and how to best lay them out on disk to ensure QoS? If > you think that's the DBA, go ask Tom to implement query hints… That also sounds like an argument against indexes. Since the system knows which parts of data are most often queried, surely it should be managing indexes, not the DBA? I imagine structuring data after the fact would involve rewriting data whereas planning for upcoming data is something DBAs are expected to do anyway using constraints, triggers, indexes etc. And as it stands, what the DBA has to do at the moment is to exploit table inheritance, apply constraints to each of the child tables (mainly for constraint exclusion), then create a trigger to support various types of update/insert/delete. So what we have now is very un-user-friendly, and tends to surprise end users. The flow of data isn't part of the table definition, it's custom-programmed into an event on the parent table. And partitioning may not just be about performance, it can be about organising data and making it more manageable. Although I agree that if it came to partitioning across nodes for parallel access, PostgreSQL could be in a position to make decisions about how that is distributed. > More seriously, partitioning in PostgreSQL could be mostly transparent > to the users and "just working": it's all about data storage locality > and we already have a sub-relation storage model. By using segment > exclusion and some level of automatic clustering (physical organization) > of data, we could have all the benefits of partitioning without the > hassle of maintaining yet another explicit level of data definition. That could be unworkable in a high-load OLTP environment. > In particular, what part of the declarative partitioning system is > intended to take care about creating the right partitions before new > data are sent to them? > > In a first implementation, we could decide to "partition" the data over > an index that the DBA has to pick, and then maintain a "segment index" > which is a map of the data distribution in the physical segments, for > the indexed columns. The granularity could be different and maybe > dynamic so that you start at a block level map index for very little > tables and grow up to a segment map index for huge tables that require > thousands of segments, 1GB each. > > Then the system have to organize data modifications so that it optimizes > the ranges to be the smallest possible on each map entry. And the > executor then has to be able to consult that map at query time and > exclude whole segments of data (or blocks for little tables) when the > segment indexing is able to exclude data. With some tricks because we > realize that depending on the size of the portions you skip you might > not benefit from moving the head on the platter in another way than what > the ongoing seqscan does, but we already have GUCs about that. > > We might also need some internal facilities to lock out per segment (or > rather "map entries") rather than per table so that we have something > like a WHERE clause support for TRUNCATE. Would this solve the same set of problems that partitioning attempts to address? And what about the example case of quarterly data? In your proposed design could you drop an entire set of data without a DELETE? And maybe I'm not looking at it from the right angle. (probably) Although I appreciate some thought is needed about how useful partitioning implementations in other database systems really are. And now to demonstrate some additional ignorance on my part... does the standard cover this? -- Thom Brown Twitter: @darkixion IRC (freenode): dark_ixion Registered Linux user: #516935 EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 11/09/2011 04:05 PM, Dimitri Fontaine wrote: > The more I think about this problem, the more I think that the reason > why we still don't have declarative partitioning is that it basically > sucks. I think that we don't have it because no one has ever dumped the much larger than might be expected amount of time into pulling all the pieces together and smoothing out the rough parts. I don't think there's any design thinking leap needed over what's already been worked out. Just a lot of work to get all the edge cases right on the simplest possible thing that is useful. The path to reach something that could be considered for commit includes something like this set of things: 1) Add partitioning catalog support 2) Create new syntax for partitioning that writes to the catalog 3) Decide how to represent partition data in memory 4) Route new INSERTed data into the right place 5) Support moving UPDATEd data into a new partition 6) Handle COPY usefully The last rev of this submitted was still working through (1) here, i.e. this review from Robert: http://archives.postgresql.org/message-id/AANLkTikP-1_8B04eyIK0sDf8uA5KMo64o8sorFBZE_CT@mail.gmail.com And there's a whole pile of issues I don't think have been fully explored about even the most basic case. How to handle ALTER to these structures cleanly, locking, etc.. I don't think it's possible to design such that you skip a large portion of these details; someone needs to put some number of spend weeks+ getting them all right instead. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
On Wed, Nov 9, 2011 at 1:05 PM, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote: > Thom Brown <thom@linux.com> writes: >> Whenever I cross the topic of >> partitioning in PostgreSQL, it's always a tad embarrassing to explain >> that it's still hacky compared to other database systems (and this >> came up again last week), so this is of particular interest to me. At > > The more I think about this problem, the more I think that the reason > why we still don't have declarative partitioning is that it basically > sucks. Other vendors offer it because they couldn't do better, but they > are just asking the DBA to implement a service the system should be able > to care for itself. Your last sentence seems to be a better description of PostgreSQL's current implementation of partitioning via inheritance and constraints and triggers. Partitioning is a service the system should be able to care for itself, even if it does need the DBA to declare it. And why shouldn't it need a DBA to declare it? How is the system supposed to anticipate that at some point years in the future I will want to run the command sequence "create foo_archive as select from foo where year<2009; delete from foo where year<2009", or its partition-based equivalent, and have it operate on several billion rows cleanly and quickly? I don't think we can expect the system to anticipate what it has never before experienced. This is the DBA's job. > Who knows better than PostgreSQL which part of the data are the most > often queried and how to best lay them out on disk to ensure QoS? If > you think that's the DBA, go ask Tom to implement query hints… > > More seriously, partitioning in PostgreSQL could be mostly transparent > to the users and "just working": it's all about data storage locality > and we already have a sub-relation storage model. By using segment > exclusion and some level of automatic clustering (physical organization) > of data, we could have all the benefits of partitioning without the > hassle of maintaining yet another explicit level of data definition. While automatic clustering would be nice, it isn't the same thing as partitioning. Cheers, Jeff
Jeff Janes <jeff.janes@gmail.com> writes: > shouldn't it need a DBA to declare it? How is the system supposed to > anticipate that at some point years in the future I will want to run > the command sequence "create foo_archive as select from foo where > year<2009; delete from foo where year<2009", or its partition-based > equivalent, and have it operate on several billion rows cleanly and > quickly? I don't think we can expect the system to anticipate what it > has never before experienced. This is the DBA's job. Well, the not-fully spelled out proposal would be to still work it out from a list of columns picked by the DBA. I though that an existing index would be best, but maybe just columns would be good. I guess it's already time to play loose and invent some SQL syntax to make it easier talking about the same thing: ALTER TABLE foo SEGMENT ON (year, stamp); Now the aim would be to be able to implement the operation you describe by using the new segment map, which is an index pointing to sequential ranges of on-disk blocks where the data is known to share a common key range over the columns you're segmenting on. I would imagine this SQL: TRUNCATE foo WHERE year < 2009; As the on-disk location of the data that qualify this WHERE clause is known, it could be possible to (predicate) lock it and bulk remove it, unlinking whole segments (1GB) at a time when relevant. > While automatic clustering would be nice, it isn't the same thing as > partitioning. That has been my initial reaction to that kind of ideas too. After some more time brewing the ideas, I'm not convinced that the use cases that usually drives you to the latter can't be solved with the former. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Thu, Nov 10, 2011 at 1:19 PM, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote: > Now the aim would be to be able to implement the operation you describe > by using the new segment map, which is an index pointing to sequential > ranges of on-disk blocks where the data is known to share a common key > range over the columns you're segmenting on. I would imagine this SQL: > > TRUNCATE foo WHERE year < 2009; > > As the on-disk location of the data that qualify this WHERE clause is > known, it could be possible to (predicate) lock it and bulk remove it, > unlinking whole segments (1GB) at a time when relevant. I am basically in agreement with you. After wanting better partitioning (Oracle-style) in Postgres for some time just to be free of the mechanically painful table-inheritance version, I have come around to thinking it's basically a bad idea, but one that with a little bit of finessing can be made a good idea. The reason I have started to think this is because of an old feature that works very well: CREATE INDEX. In spite of what people might think, I think it's pretty clear that CREATE INDEX is not DDL: it's actually physical advice to the system. I have seen the fourth-generation-language promise delivered upon quite a few times in production, now: we witness an access pattern that becomes problematic, we run CREATE INDEX CONCURRENTLY, the problem is solved without any change to the application, and the index definition is backported to our application bootstrapping process. It would be hard for me to understate how valuable this has been to avoid both premature optimization and excessive panic when dealing with change. Similar to the overall project stance on query hints, I don't think Postgres should retreat on its ground from being a 4GL system. I think both indexes and a hypothetical partitioning feature should be clearly isolated as directives to the system about how to physically organize and access data, and any partitioning feature that creates new relation namespace entries and expects you to manipulate them to gain the benefits seems like extra, non-desirable surface area to me. I think this becomes especially apparent once one considers on-line repartitioning (I am exposing a bias here, but any feature in Postgres that cannot be done concurrently -- like VACUUM FULL -- is very dangerous to both me and my customers, whereas it may not be useless or dangerous to a build-your-own data warehouse). It feels like it would be desirable to have the physical partitions exist in an inconsistent-state whereby they are being brought into alignment with the newly desired physical description. Finally, I think a legitimate objection to this inclination is that it can be really easy to issue a DELETE that is usually fast, but when any mistake or change creeps in becomes very slow: I have heard from some friends making heavy use of table partitioning via inheritance that one of the problems is not quite exactly matching the table constraint, and then hosing their hardware. As a result, they mangle partitions explicitly in the application to prevent foot-gunning. That's clearly lame (and they know it), but I think may indicate a need to instead allow for some kind of physical-access-method assertion checking quite apart from the logical content of the query that can deliver a clear, crisp error to application developers if a preferred access pattern is not usable. My experience suggests that while solving problems is good, turning problems into flat-out errors is *nearly* as good, and worth some more investigation. -- fdr
On Thu, Nov 10, 2011 at 10:19:02PM +0100, Dimitri Fontaine wrote: > Now the aim would be to be able to implement the operation you describe > by using the new segment map, which is an index pointing to sequential > ranges of on-disk blocks where the data is known to share a common key > range over the columns you're segmenting on. I would imagine this SQL: > > TRUNCATE foo WHERE year < 2009; > > As the on-disk location of the data that qualify this WHERE clause is > known, it could be possible to (predicate) lock it and bulk remove it, > unlinking whole segments (1GB) at a time when relevant. While I agree that explicit partitioning is somewhat of a hack, it's a really useful hack. But for me the most important use of partitioning is "dropping a billion rows efficiently and getting the disk space back". And the biggest problem is always that dropping blocks of a table requires fixing all the indexes. For fixing the index of the partition key it's a simpler problem, you could probably prune the btree relatively efficiently. But for all other indexes there's no better solution than walk the entire index. However, in the very special case where the drop boundaries explicitly match the dataset, you can simply drop all the indexes. Now, if someone cames up with an efficient way to drop a huge number of rows quickly, then I admit one of the major issues is fixed. But recovering the disk space is much harder. Yes, recent versions of Linux come with ways to punch holes in existing files, but that doesn't make it quick or efficient. > > While automatic clustering would be nice, it isn't the same thing as > > partitioning. > > That has been my initial reaction to that kind of ideas too. After some > more time brewing the ideas, I'm not convinced that the use cases that > usually drives you to the latter can't be solved with the former. I hope so, but I'm not sure I'd like partitioning support to wait on someone hitting on the right idea. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > He who writes carelessly confesses thereby at the very outset that he does > not attach much importance to his own thoughts. -- Arthur Schopenhauer
Martijn van Oosterhout <kleptog@svana.org> writes: > While I agree that explicit partitioning is somewhat of a hack, it's a > really useful hack. But for me the most important use of partitioning > is "dropping a billion rows efficiently and getting the disk space > back". Right. The only way to make that speedy is for the partition boundaries to match the desired granularity of data removal. I don't really see any way that the database can be expected to know what that is, unless it's told in advance. So AFAICS you really have to have a declarative way of telling it how to do the partitioning --- it's not going to be able to infer that automatically. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > to match the desired granularity of data removal. I don't really see > any way that the database can be expected to know what that is, unless > it's told in advance. So AFAICS you really have to have a declarative > way of telling it how to do the partitioning --- it's not going to be > able to infer that automatically. Yes, I'm taking that back. Declarative is not the same thing as explicit partitioning though, that "index" like physical map is declarative too, e.g. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
Martijn van Oosterhout <kleptog@svana.org> writes: > While I agree that explicit partitioning is somewhat of a hack, it's a > really useful hack. But for me the most important use of partitioning > is "dropping a billion rows efficiently and getting the disk space > back". And the biggest problem is always that dropping blocks of a > table requires fixing all the indexes. The problem with partitions that are in fact table is that the index are separated and you can't enforce unique globally in the partition set. Even with that physical map idea (segment based partitioning, but allowing a finer control than segments), you could still maintain any number of partial indexes, but still use a single primary key e.g. > However, in the very special case where the drop boundaries explicitly > match the dataset, you can simply drop all the indexes. That's the idea with partial indexes too, right? > Now, if someone cames up with an efficient way to drop a huge number of > rows quickly, then I admit one of the major issues is fixed. But > recovering the disk space is much harder. Yes, recent versions of > Linux come with ways to punch holes in existing files, but that doesn't > make it quick or efficient. If you happen to drop a part of the data that fits in one or more segments (and with a decent fillfactor you need less than 1GB to get there), then you can unlink() whole files at a time. That would be the goal here. > I hope so, but I'm not sure I'd like partitioning support to wait on > someone hitting on the right idea. I would think that's exactly what's been happening to us for several years already. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Mon, Nov 14, 2011 at 10:33:36AM +0100, Dimitri Fontaine wrote: > Martijn van Oosterhout <kleptog@svana.org> writes: > > While I agree that explicit partitioning is somewhat of a hack, it's a > > really useful hack. But for me the most important use of partitioning > > is "dropping a billion rows efficiently and getting the disk space > > back". And the biggest problem is always that dropping blocks of a > > table requires fixing all the indexes. > > The problem with partitions that are in fact table is that the index are > separated and you can't enforce unique globally in the partition set. > > Even with that physical map idea (segment based partitioning, but > allowing a finer control than segments), you could still maintain any > number of partial indexes, but still use a single primary key e.g. Ah, well, if you can come up with a way to get the advantages of partition while still being able to enforce primary keys over partitions, that would be A Really Cool Idea. That said, I still don't see how you can enforce a unique index over multiple segments over something other than the partition key while still allowing quick dropping of segments. If you can fix that you can make it work for the current inheritence-style partitioning. > If you happen to drop a part of the data that fits in one or more > segments (and with a decent fillfactor you need less than 1GB to get > there), then you can unlink() whole files at a time. That would be the > goal here. I feel uncomfortable with the "happen to". You can add the magic too, but for scripting purposes I'd feel better if it could be done via DDL also. That way typos don't end up being 5 day queries all of a sudden. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > He who writes carelessly confesses thereby at the very outset that he does > not attach much importance to his own thoughts. -- Arthur Schopenhauer
Martijn van Oosterhout <kleptog@svana.org> writes: > That said, I still don't see how you can enforce a unique index over > multiple segments over something other than the partition key while > still allowing quick dropping of segments. If you can fix that you can > make it work for the current inheritence-style partitioning. Well the Primary Key and the Physical Map Index do not need to be on the same set of columns. >> If you happen to drop a part of the data that fits in one or more >> segments (and with a decent fillfactor you need less than 1GB to get >> there), then you can unlink() whole files at a time. That would be the >> goal here. > > I feel uncomfortable with the "happen to". You can add the magic too, > but for scripting purposes I'd feel better if it could be done via DDL > also. That way typos don't end up being 5 day queries all of a sudden. If the data fills less than a segment then you can't unlink() the file, you have to mark the tuples / pages as free space. If you have a partial index matching the whole portion of data you're removing, you can still drop it before hand — or maybe the system can be instructed to do so? Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support