Thread: Partitioning feature ...
Hi,
We are implementing table partitioning feature to support Range and Hash partitions. Please find attached, the WIP patch and test-cases.
The syntax used conforms to most of the suggestions mentioned in http://archives.postgresql.org/pgsql-hackers/2008-01/msg00413.php, barring the following:
-- Specification of partition names is optional. System will be able to generate partition names in such cases.
-- Sub partitioning
We are maintaining a system catalog(pg_partition) for partition meta-data. System will look-up this table to find appropriate partition to operate on.
System internally uses low-level 'C' triggers to row-movement.
Regards,
--
Kedar.
We are implementing table partitioning feature to support Range and Hash partitions. Please find attached, the WIP patch and test-cases.
The syntax used conforms to most of the suggestions mentioned in http://archives.postgresql.org/pgsql-hackers/2008-01/msg00413.php, barring the following:
-- Specification of partition names is optional. System will be able to generate partition names in such cases.
-- Sub partitioning
We are maintaining a system catalog(pg_partition) for partition meta-data. System will look-up this table to find appropriate partition to operate on.
System internally uses low-level 'C' triggers to row-movement.
Regards,
--
Kedar.
Attachment
Hi Kedar,
http://www.enterprisedb.com
The syntax used conforms to most of the suggestions mentioned in http://archives.postgresql.org/pgsql-hackers/2008-01/msg00413.php, barring the following:
-- Specification of partition names is optional. System will be able to generate partition names in such cases.
-- Sub partitioning
I was wondering if there is a need to mention the type of partition while dropping it.
E.g
ALTER table x DROP RANGE PARTITION x_part;
The type of partition (RANGE, HASH) could be dropped according to me.
We are maintaining a system catalog(pg_partition) for partition meta-data. System will look-up this table to find appropriate partition to operate on.
System internally uses low-level 'C' triggers to row-movement.
Can you elaborate more on how do you handle updates with these triggers?
Regards,
Nikhils
--
Hi Nikhil,
Update operation is performed as a combination of 'delete' and 'insert'.
In Update trigger, the row is deleted from relation according to it's 'ctid'. A look-up on system catalog for partitions is performed to identify the target table by evaluating values of partition-key attributes, of the given row. The constraints of this target table are evaluated for this new row and if found valid, the row is inserted.
Regards,
--
Kedar.
Update operation is performed as a combination of 'delete' and 'insert'.
In Update trigger, the row is deleted from relation according to it's 'ctid'. A look-up on system catalog for partitions is performed to identify the target table by evaluating values of partition-key attributes, of the given row. The constraints of this target table are evaluated for this new row and if found valid, the row is inserted.
Regards,
--
Kedar.
On Mon, Mar 23, 2009 at 5:09 PM, Nikhil Sontakke <nikhil.sontakke@enterprisedb.com> wrote:
Hi Kedar,http://www.enterprisedb.com
The syntax used conforms to most of the suggestions mentioned in http://archives.postgresql.org/pgsql-hackers/2008-01/msg00413.php, barring the following:
-- Specification of partition names is optional. System will be able to generate partition names in such cases.
-- Sub partitioningI was wondering if there is a need to mention the type of partition while dropping it.E.gALTER table x DROP RANGE PARTITION x_part;The type of partition (RANGE, HASH) could be dropped according to me.
We are maintaining a system catalog(pg_partition) for partition meta-data. System will look-up this table to find appropriate partition to operate on.
System internally uses low-level 'C' triggers to row-movement.Can you elaborate more on how do you handle updates with these triggers?Regards,Nikhils--
Hi Kedar, First of all, congratulations for the excellent work. I have some comments and questions. In get_relevent_partition (btw, relevant is spelled with an a) you are maintaining 2 lists. I guess this is only useful for multi-column partitions, right? If you have a single column partition (without subpartitions), I think you could directly return on the first match (without maintaining any list) since you guarantee that there is no overlap between partitions. A simple but effective optimization for inserts consists of caching the last partition used (consecutive inserts often go to the same partition) and try it first before going through the whole loop. The update trigger should first check if the tuple needs to be moved. If the updated tuple still matches the constraints of the partitions it will not have to be moved and will save a lot of overhead. The COPY operation should probably be optimized to use the same code as the one in the insert trigger for partitioned tables. I guess some code could be factorized in COPY to make the inserts more efficient. The current trigger approach should prevent other triggers to be added to the table, or you should make sure that the partition trigger is always the one to execute last. As we don't have automatic partition creation, it would be interesting to have an optional mechanism to deal with tuples that don't match any partition (very useful when you do bulk insert and some new data require a new partition). Having a simple overflow partition or an error logging mechanism would definitely help to identify these tuples and prevent things like large COPY operations to fail. Looking forward to your responses, Emmanuel > > We are implementing table partitioning feature to support Range and > Hash partitions. Please find attached, the WIP patch and test-cases. > > The syntax used conforms to most of the suggestions mentioned in > http://archives.postgresql.org/pgsql-hackers/2008-01/msg00413.php, > barring the following: > -- Specification of partition names is optional. System will be able > to generate partition names in such cases. > -- Sub partitioning > > We are maintaining a system catalog(pg_partition) for partition > meta-data. System will look-up this table to find appropriate > partition to operate on. > System internally uses low-level 'C' triggers to row-movement. > > Regards, > -- > Kedar. > > > > ------------------------------------------------------------------------ > > > -- Emmanuel Cecchet Aster Data Systems Web: http://www.asterdata.com
Hi Emmanuel,
Thanks for your time. This is a WIP patch and we will integrate your suggestions/comments as appropriate.
Regards,
--
Kedar.
Thanks for your time. This is a WIP patch and we will integrate your suggestions/comments as appropriate.
Regards,
--
Kedar.
On Fri, Mar 27, 2009 at 3:38 AM, Emmanuel Cecchet <manu@asterdata.com> wrote:
Hi Kedar,
First of all, congratulations for the excellent work.
I have some comments and questions.
In get_relevent_partition (btw, relevant is spelled with an a) you are maintaining 2 lists.
> Oops! 'a' typographical error.
I guess this is only useful for multi-column partitions, right?
If you have a single column partition (without subpartitions), I think you could directly return on the first match (without maintaining any list) since you guarantee that there is no overlap between partitions.
A simple but effective optimization for inserts consists of caching the last partition used (consecutive inserts often go to the same partition) and try it first before going through the whole loop.
> Yep.
The update trigger should first check if the tuple needs to be moved. If the updated tuple still matches the constraints of the partitions it will not have to be moved and will save a lot of overhead.
> Yes. We agree on that.
The COPY operation should probably be optimized to use the same code as the one in the insert trigger for partitioned tables. I guess some code could be factorized in COPY to make the inserts more efficient.
The current trigger approach should prevent other triggers to be added to the table, or you should make sure that the partition trigger is always the one to execute last.
> As triggers are executed in order of their names, we've prefixed the trigger names with "zz". This should work fine as long as no-one uses trigger-name which starts with "zz".
As we don't have automatic partition creation, it would be interesting to have an optional mechanism to deal with tuples that don't match any partition (very useful when you do bulk insert and some new data require a new partition). Having a simple overflow partition or an error logging mechanism would definitely help to identify these tuples and prevent things like large COPY operations to fail.
> Will get back on this.
Looking forward to your responses,
Emmanuel------------------------------------------------------------------------
We are implementing table partitioning feature to support Range and Hash partitions. Please find attached, the WIP patch and test-cases.
The syntax used conforms to most of the suggestions mentioned in http://archives.postgresql.org/pgsql-hackers/2008-01/msg00413.php, barring the following:
-- Specification of partition names is optional. System will be able to generate partition names in such cases.
-- Sub partitioning
We are maintaining a system catalog(pg_partition) for partition meta-data. System will look-up this table to find appropriate partition to operate on.
System internally uses low-level 'C' triggers to row-movement.
Regards,
--
Kedar.
--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com
On Mon, Mar 30, 2009 at 6:51 AM, Kedar Potdar <kedar.potdar@gmail.com> wrote: > >> As triggers are executed in order of their names, we've prefixed the >> trigger names with "zz". This should work fine as long as no-one uses >> trigger-name which starts with "zz". >> this seems a lot fragile... why system generated triggers has to be executed following the same rules (talking about order of execution) as user triggers? can't we simply execute them first or last or maybe be clever and mark one to be executed first and others last? -- Atentamente, Jaime Casanova Soporte y capacitación de PostgreSQL Asesoría y desarrollo de sistemas Guayaquil - Ecuador Cel. +59387171157
I agree with Jaime that system triggers should execute independently of user triggers. In the particular case of partitioning, the system trigger should execute after the user triggers. However, as the partitioning trigger is a row level trigger, it is not clear what is going to happen with user triggers that work at the statement level. Emmanuel Jaime Casanova wrote: > On Mon, Mar 30, 2009 at 6:51 AM, Kedar Potdar <kedar.potdar@gmail.com> wrote: > >>> As triggers are executed in order of their names, we've prefixed the >>> trigger names with "zz". This should work fine as long as no-one uses >>> trigger-name which starts with "zz". >>> >>> > > this seems a lot fragile... why system generated triggers has to be > executed following the same rules (talking about order of execution) > as user triggers? can't we simply execute them first or last or maybe > be clever and mark one to be executed first and others last? > > -- Emmanuel Cecchet FTO @ Frog Thinker Open Source Development & Consulting -- Web: http://www.frogthinker.org email: manu@frogthinker.org Skype: emmanuel_cecchet
Hi,
AFAICS, we do not have any category like system triggers. So yeah, it would have been nice to generate triggers with names (starting with __ for example) for such special triggers. But I don't think we disallow user-triggers starting with underscores etc.
So some of the options could be:
- to add a new column in pg_trigger to indicate special or system triggers which can be executed last (sorted order if multiple entries)
or
- invent a prefix "__partition__" or something and disallow user triggers to use such a prefix for their names, plus introduce logic to execute them (again sorted order if multiple entries) last.
Regards,
Nikhils
--
http://www.enterprisedb.com
this seems a lot fragile... why system generated triggers has to be
>> As triggers are executed in order of their names, we've prefixed the
>> trigger names with "zz". This should work fine as long as no-one uses
>> trigger-name which starts with "zz".
>>
executed following the same rules (talking about order of execution)
as user triggers? can't we simply execute them first or last or maybe
be clever and mark one to be executed first and others last?
AFAICS, we do not have any category like system triggers. So yeah, it would have been nice to generate triggers with names (starting with __ for example) for such special triggers. But I don't think we disallow user-triggers starting with underscores etc.
So some of the options could be:
- to add a new column in pg_trigger to indicate special or system triggers which can be executed last (sorted order if multiple entries)
or
- invent a prefix "__partition__" or something and disallow user triggers to use such a prefix for their names, plus introduce logic to execute them (again sorted order if multiple entries) last.
Regards,
Nikhils
http://www.enterprisedb.com
Nikhil Sontakke escribió: > > >> As triggers are executed in order of their names, we've prefixed the > > >> trigger names with "zz". This should work fine as long as no-one uses > > >> trigger-name which starts with "zz". > > > > this seems a lot fragile... why system generated triggers has to be > > executed following the same rules (talking about order of execution) > > as user triggers? can't we simply execute them first or last or maybe > > be clever and mark one to be executed first and others last? > > AFAICS, we do not have any category like system triggers. So yeah, it would > have been nice to generate triggers with names (starting with __ for > example) for such special triggers. But I don't think we disallow > user-triggers starting with underscores etc. We already have system triggers -- the FK triggers. I don't think we've had all that much trouble with them. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Tue, Mar 31, 2009 at 9:46 AM, Alvaro Herrera <alvherre@commandprompt.com> wrote: >> >> AFAICS, we do not have any category like system triggers. So yeah, it would >> have been nice to generate triggers with names (starting with __ for >> example) for such special triggers. But I don't think we disallow >> user-triggers starting with underscores etc. > > We already have system triggers -- the FK triggers. I don't think we've > had all that much trouble with them. > yeah! but we mark them with pg_trigger.tgisconstraint, but i'm not completely convinced that we should use that same field -- Atentamente, Jaime Casanova Soporte y capacitación de PostgreSQL Asesoría y desarrollo de sistemas Guayaquil - Ecuador Cel. +59387171157
Alvaro Herrera <alvherre@commandprompt.com> writes: > Nikhil Sontakke escribi�: >>> As triggers are executed in order of their names, we've prefixed the >>> trigger names with "zz". This should work fine as long as no-one uses >>> trigger-name which starts with "zz". >> this seems a lot fragile... > We already have system triggers -- the FK triggers. I don't think we've > had all that much trouble with them. In the case of the FK triggers, it's intentional (and maybe even documented) that users should be able to place their own triggers before or after the FK triggers. Is there a good reason why partitioning triggers should be different? If there is, maybe the feature shouldn't be implemented via triggers in the first place. regards, tom lane
Tom Lane wrote: >> We already have system triggers -- the FK triggers. I don't think we've >> had all that much trouble with them. >> > > In the case of the FK triggers, it's intentional (and maybe even > documented) that users should be able to place their own triggers before > or after the FK triggers. If it's documented I think it's well hidden ;-) ISTM that the fact that we implement FK constraints via triggers is really an implementation detail, not something the user should be encouraged to mess with. > Is there a good reason why partitioning > triggers should be different? > Probably not. ISTM that the scheme should turn tgisconstraint into a multi-valued item (tgkind: 'u' = userland, 'c'= constraint, 'p' = partition or some such). cheers andrew
Yes, there is a good reason. As a trigger can update the tuple value, this can change the routing decision. If you have a user trigger that tries to change the key value after the partition choice has been made, this will lead to an integrity constraint violation which is probably not what the user expects. Note that user triggers with partitions will be tricky anyway (regardless of how partitioning is implemented, that is with triggers or not). If 2 partitions have user triggers that update the key value to bounce the tuple to the other partition you may end up with an infinite loop. I am not sure what the semantic of statement triggers (still user triggers) should be on partitioned tables. We will probably have to come up with restrictions on triggers so that they can only be applied to the parent table and not on child tables to prevent nasty issues. Emmanuel Tom Lane wrote: > In the case of the FK triggers, it's intentional (and maybe even > documented) that users should be able to place their own triggers before > or after the FK triggers. Is there a good reason why partitioning > triggers should be different? If there is, maybe the feature shouldn't > be implemented via triggers in the first place. > > regards, tom lane > -- Emmanuel Cecchet Aster Data Systems Web: http://www.asterdata.com
Emmanuel Cecchet <manu@asterdata.com> writes: > Yes, there is a good reason. As a trigger can update the tuple value, > this can change the routing decision. If you have a user trigger that > tries to change the key value after the partition choice has been made, > this will lead to an integrity constraint violation which is probably > not what the user expects. [ shrug... ] Badly written user triggers can break FK constraints, too. We've tolerated that in the past because preventing it disables useful capabilities. I remain of the opinion that if you think you *have to* execute last, you should not be writing this as a trigger; you'd be better off embedding it lower in the system. regards, tom lane
On 3/31/09 9:45 AM, Emmanuel Cecchet wrote: > Yes, there is a good reason. As a trigger can update the tuple value, > this can change the routing decision. If you have a user trigger that > tries to change the key value after the partition choice has been made, > this will lead to an integrity constraint violation which is probably > not what the user expects. Actually, it's worse. Depending on the timing of the triggers, it's possible to bypass the FK check entirely, and you can end up with inconsistent data. --Josh
Hi,
+1.
This seems to be the best way forward if we stick to triggers for partitioning. I think they appear to serve the purpose well for this use-case and maybe with this scheme they will be low-level enough too.
Regards,
Nikhils
--
http://www.enterprisedb.com
If it's documented I think it's well hidden ;-) ISTM that the fact that we implement FK constraints via triggers is really an implementation detail, not something the user should be encouraged to mess with.We already have system triggers -- the FK triggers. I don't think we've
had all that much trouble with them.
In the case of the FK triggers, it's intentional (and maybe even
documented) that users should be able to place their own triggers before
or after the FK triggers.Probably not. ISTM that the scheme should turn tgisconstraint into a multi-valued item (tgkind: 'u' = userland, 'c'= constraint, 'p' = partition or some such).Is there a good reason why partitioning
triggers should be different?
+1.
Regards,
Nikhils
--
http://www.enterprisedb.com