Home > mailing lists

Thread: Partitioning feature ...

Partitioning feature ...

From

Kedar Potdar

Date:

23 March 2009, 06:45:32

Hi,

We are implementing table partitioning feature to support Range and Hash partitions. Please find attached, the WIP patch and test-cases.

The syntax used conforms to most of the suggestions mentioned in http://archives.postgresql.org/pgsql-hackers/2008-01/msg00413.php, barring the following:
-- Specification of partition names is optional. System will be able to generate partition names in such cases.
-- Sub partitioning

We are maintaining a system catalog(pg_partition) for partition meta-data. System will look-up this table to find appropriate partition to operate on.
System internally uses low-level 'C' triggers to row-movement.

Regards,
--
Kedar.

Attachment

Re: Partitioning feature ...

From

Nikhil Sontakke

Date:

23 March 2009, 08:46:18

Hi Kedar,

The syntax used conforms to most of the suggestions mentioned in http://archives.postgresql.org/pgsql-hackers/2008-01/msg00413.php, barring the following:
-- Specification of partition names is optional. System will be able to generate partition names in such cases.
-- Sub partitioning

I was wondering if there is a need to mention the type of partition while dropping it.

E.g

ALTER table x DROP RANGE PARTITION x_part;

The type of partition (RANGE, HASH) could be dropped according to me.

We are maintaining a system catalog(pg_partition) for partition meta-data. System will look-up this table to find appropriate partition to operate on.
System internally uses low-level 'C' triggers to row-movement.

Can you elaborate more on how do you handle updates with these triggers?

Regards,

Nikhils

--

http://www.enterprisedb.com

Re: Partitioning feature ...

From

Kedar Potdar

Date:

24 March 2009, 09:16:25

Hi Nikhil,

Update operation is performed as a combination of 'delete' and 'insert'.

In Update trigger, the row is deleted from relation according to it's 'ctid'. A look-up on system catalog for partitions is performed to identify the target table by evaluating values of partition-key attributes, of the given row. The constraints of this target table are evaluated for this new row and if found valid, the row is inserted.

Regards,
--
Kedar.

On Mon, Mar 23, 2009 at 5:09 PM, Nikhil Sontakke <nikhil.sontakke@enterprisedb.com> wrote:

Hi Kedar,

The syntax used conforms to most of the suggestions mentioned in http://archives.postgresql.org/pgsql-hackers/2008-01/msg00413.php, barring the following:
-- Specification of partition names is optional. System will be able to generate partition names in such cases.
-- Sub partitioning

I was wondering if there is a need to mention the type of partition while dropping it.

E.g
ALTER table x DROP RANGE PARTITION x_part;

The type of partition (RANGE, HASH) could be dropped according to me.

We are maintaining a system catalog(pg_partition) for partition meta-data. System will look-up this table to find appropriate partition to operate on.
System internally uses low-level 'C' triggers to row-movement.

Can you elaborate more on how do you handle updates with these triggers?

Regards,
Nikhils
--
http://www.enterprisedb.com

Re: Partitioning feature ...

From

Emmanuel Cecchet

Date:

26 March 2009, 19:08:10

Hi Kedar,

First of all, congratulations for the excellent work.
I have some comments and questions.

In get_relevent_partition (btw, relevant is spelled with an a) you are 
maintaining 2 lists. I guess this is only useful for multi-column 
partitions, right?
If you have a single column partition (without subpartitions), I think 
you could directly return on the first match (without maintaining any 
list) since you guarantee that there is no overlap between partitions.
A simple but effective optimization for inserts consists of caching the 
last partition used (consecutive inserts often go to the same partition) 
and try it first before going through the whole loop.

The update trigger should first check if the tuple needs to be moved. If 
the updated tuple still matches the constraints of the partitions it 
will not have to be moved and will save a lot of overhead.

The COPY operation should probably be optimized to use the same code as 
the one in the insert trigger for partitioned tables. I guess some code 
could be factorized in COPY to make the inserts more efficient.

The current trigger approach should prevent other triggers to be added 
to the table, or you should make sure that the partition trigger is 
always the one to execute last.

As we don't have automatic partition creation, it would be interesting 
to have an optional mechanism to deal with tuples that don't match any 
partition (very useful when you do bulk insert and some new data require 
a new partition). Having a simple overflow partition or an error logging 
mechanism would definitely help to identify these tuples and prevent 
things like large COPY operations to fail.


Looking forward to your responses,
Emmanuel

>
> We are implementing table partitioning feature to support Range and 
> Hash partitions. Please find attached, the WIP patch and test-cases.
>
> The syntax used conforms to most of the suggestions mentioned in 
> http://archives.postgresql.org/pgsql-hackers/2008-01/msg00413.php, 
> barring the following:
> -- Specification of partition names is optional. System will be able 
> to generate partition names in such cases.
> -- Sub partitioning
>
> We are maintaining a system catalog(pg_partition) for partition 
> meta-data. System will look-up this table to find appropriate 
> partition to operate on.
> System internally uses low-level 'C' triggers to row-movement.
>
> Regards,
> --
> Kedar.
>
>
>
> ------------------------------------------------------------------------
>
>
>   


-- 
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com

Re: Partitioning feature ...

From

Kedar Potdar

Date:

30 March 2009, 09:16:42

Hi Emmanuel,

Thanks for your time. This is a WIP patch and we will integrate your suggestions/comments as appropriate.

Regards,
--
Kedar.

On Fri, Mar 27, 2009 at 3:38 AM, Emmanuel Cecchet <manu@asterdata.com> wrote:

Hi Kedar,

First of all, congratulations for the excellent work.
I have some comments and questions.

In get_relevent_partition (btw, relevant is spelled with an a) you are maintaining 2 lists.

> Oops! 'a' typographical error.

I guess this is only useful for multi-column partitions, right?
If you have a single column partition (without subpartitions), I think you could directly return on the first match (without maintaining any list) since you guarantee that there is no overlap between partitions.
A simple but effective optimization for inserts consists of caching the last partition used (consecutive inserts often go to the same partition) and try it first before going through the whole loop.

> Yep.

The update trigger should first check if the tuple needs to be moved. If the updated tuple still matches the constraints of the partitions it will not have to be moved and will save a lot of overhead.

> Yes. We agree on that.

The COPY operation should probably be optimized to use the same code as the one in the insert trigger for partitioned tables. I guess some code could be factorized in COPY to make the inserts more efficient.

The current trigger approach should prevent other triggers to be added to the table, or you should make sure that the partition trigger is always the one to execute last.

> As triggers are executed in order of their names, we've prefixed the trigger names with "zz". This should work fine as long as no-one uses trigger-name which starts with "zz".

As we don't have automatic partition creation, it would be interesting to have an optional mechanism to deal with tuples that don't match any partition (very useful when you do bulk insert and some new data require a new partition). Having a simple overflow partition or an error logging mechanism would definitely help to identify these tuples and prevent things like large COPY operations to fail.

> Will get back on this.

Looking forward to your responses,
Emmanuel

We are implementing table partitioning feature to support Range and Hash partitions. Please find attached, the WIP patch and test-cases.

The syntax used conforms to most of the suggestions mentioned in http://archives.postgresql.org/pgsql-hackers/2008-01/msg00413.php, barring the following:
-- Specification of partition names is optional. System will be able to generate partition names in such cases.
-- Sub partitioning

We are maintaining a system catalog(pg_partition) for partition meta-data. System will look-up this table to find appropriate partition to operate on.
System internally uses low-level 'C' triggers to row-movement.

Regards,
--
Kedar.

------------------------------------------------------------------------

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com

Re: Partitioning feature ...

From

Jaime Casanova

Date:

30 March 2009, 11:59:14

On Mon, Mar 30, 2009 at 6:51 AM, Kedar Potdar <kedar.potdar@gmail.com> wrote:
>
>>  As triggers are executed in order of their names, we've prefixed the
>> trigger names with "zz". This should work fine as long as no-one uses
>> trigger-name which starts with "zz".
>>

this seems a lot fragile... why system generated triggers has to be
executed following the same rules (talking about order of execution)
as user triggers? can't we simply execute them first or last or maybe
be clever and mark one to be executed first and others last?

--
Atentamente,
Jaime Casanova
Soporte y capacitación de PostgreSQL
Asesoría y desarrollo de sistemas
Guayaquil - Ecuador
Cel. +59387171157

Re: Partitioning feature ...

From

Emmanuel Cecchet

Date:

30 March 2009, 12:49:38

I agree with Jaime that system triggers should execute independently of 
user triggers.
In the particular case of partitioning, the system trigger should 
execute after the user triggers. However, as the partitioning trigger is 
a row level trigger, it is not clear what is going to happen with user 
triggers that work at the statement level.

Emmanuel

Jaime Casanova wrote:
> On Mon, Mar 30, 2009 at 6:51 AM, Kedar Potdar <kedar.potdar@gmail.com> wrote:
>   
>>>   As triggers are executed in order of their names, we've prefixed the
>>> trigger names with "zz". This should work fine as long as no-one uses
>>> trigger-name which starts with "zz".
>>>
>>>       
>
> this seems a lot fragile... why system generated triggers has to be
> executed following the same rules (talking about order of execution)
> as user triggers? can't we simply execute them first or last or maybe
> be clever and mark one to be executed first and others last?
>
>   

-- 
Emmanuel Cecchet
FTO @ Frog Thinker 
Open Source Development & Consulting
--
Web: http://www.frogthinker.org
email: manu@frogthinker.org
Skype: emmanuel_cecchet

Re: Partitioning feature ...

From

Nikhil Sontakke

Date:

31 March 2009, 11:10:42

Hi,

>> As triggers are executed in order of their names, we've prefixed the
>> trigger names with "zz". This should work fine as long as no-one uses
>> trigger-name which starts with "zz".
>>

this seems a lot fragile... why system generated triggers has to be
executed following the same rules (talking about order of execution)
as user triggers? can't we simply execute them first or last or maybe
be clever and mark one to be executed first and others last?

AFAICS, we do not have any category like system triggers. So yeah, it would have been nice to generate triggers with names (starting with __ for example) for such special triggers. But I don't think we disallow user-triggers starting with underscores etc.

So some of the options could be:

- to add a new column in pg_trigger to indicate special or system triggers which can be executed last (sorted order if multiple entries)

or

- invent a prefix "__partition__" or something and disallow user triggers to use such a prefix for their names, plus introduce logic to execute them (again sorted order if multiple entries) last.

Regards,
Nikhils

--
http://www.enterprisedb.com

Re: Partitioning feature ...

From

Alvaro Herrera

Date:

31 March 2009, 11:47:15

Nikhil Sontakke escribió:

> > >>  As triggers are executed in order of their names, we've prefixed the
> > >> trigger names with "zz". This should work fine as long as no-one uses
> > >> trigger-name which starts with "zz".
> >
> > this seems a lot fragile... why system generated triggers has to be
> > executed following the same rules (talking about order of execution)
> > as user triggers? can't we simply execute them first or last or maybe
> > be clever and mark one to be executed first and others last?
> 
> AFAICS, we do not have any category like system triggers. So yeah, it would
> have been nice to generate triggers with names (starting with __ for
> example) for such special triggers. But I don't think we disallow
> user-triggers starting with underscores etc.

We already have system triggers -- the FK triggers.  I don't think we've
had all that much trouble with them.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Partitioning feature ...

From

Jaime Casanova

Date:

31 March 2009, 12:21:06

On Tue, Mar 31, 2009 at 9:46 AM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
>>
>> AFAICS, we do not have any category like system triggers. So yeah, it would
>> have been nice to generate triggers with names (starting with __ for
>> example) for such special triggers. But I don't think we disallow
>> user-triggers starting with underscores etc.
>
> We already have system triggers -- the FK triggers.  I don't think we've
> had all that much trouble with them.
>

yeah! but we mark them with pg_trigger.tgisconstraint, but i'm not
completely convinced that we should use that same field


--
Atentamente,
Jaime Casanova
Soporte y capacitación de PostgreSQL
Asesoría y desarrollo de sistemas
Guayaquil - Ecuador
Cel. +59387171157

Re: Partitioning feature ...

From

Tom Lane

Date:

31 March 2009, 12:40:41

Alvaro Herrera <alvherre@commandprompt.com> writes:
> Nikhil Sontakke escribi�:
>>> As triggers are executed in order of their names, we've prefixed the
>>> trigger names with "zz". This should work fine as long as no-one uses
>>> trigger-name which starts with "zz".

>> this seems a lot fragile...

> We already have system triggers -- the FK triggers.  I don't think we've
> had all that much trouble with them.

In the case of the FK triggers, it's intentional (and maybe even
documented) that users should be able to place their own triggers before
or after the FK triggers.  Is there a good reason why partitioning
triggers should be different?  If there is, maybe the feature shouldn't
be implemented via triggers in the first place.
        regards, tom lane

Re: Partitioning feature ...

From

Andrew Dunstan

Date:

31 March 2009, 13:09:07

Tom Lane wrote:
>> We already have system triggers -- the FK triggers.  I don't think we've
>> had all that much trouble with them.
>>     
>
> In the case of the FK triggers, it's intentional (and maybe even
> documented) that users should be able to place their own triggers before
> or after the FK triggers. 

If it's documented I think it's well hidden ;-) ISTM that the fact that 
we implement FK constraints via triggers is really an implementation 
detail, not something the user should be encouraged to mess with.

>  Is there a good reason why partitioning
> triggers should be different?  
>   

Probably not. ISTM that the scheme should turn tgisconstraint into a 
multi-valued item (tgkind: 'u' = userland, 'c'= constraint, 'p' = 
partition or some such).

cheers

andrew

Re: Partitioning feature ...

From

Emmanuel Cecchet

Date:

31 March 2009, 13:50:56

Yes, there is a good reason. As a trigger can update the tuple value, 
this can change the routing decision. If you have a user trigger that 
tries to change the key value after the partition choice has been made, 
this will lead to an integrity constraint violation which is probably 
not what the user expects.
Note that user triggers with partitions will be tricky anyway 
(regardless of how partitioning is implemented, that is with triggers or 
not). If 2 partitions have user triggers that update the key value to 
bounce the tuple to the other partition you may end up with an infinite 
loop.
I am not sure what the semantic of statement triggers (still user 
triggers) should be on partitioned tables.
We will probably have to come up with restrictions on triggers so that 
they can only be applied to the parent table and not on child tables to 
prevent nasty issues.

Emmanuel

Tom Lane wrote:
> In the case of the FK triggers, it's intentional (and maybe even
> documented) that users should be able to place their own triggers before
> or after the FK triggers.  Is there a good reason why partitioning
> triggers should be different?  If there is, maybe the feature shouldn't
> be implemented via triggers in the first place.
>
>                         regards, tom lane
>   

-- 
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com

Re: Partitioning feature ...

From

Tom Lane

Date:

31 March 2009, 13:55:57

Emmanuel Cecchet <manu@asterdata.com> writes:
> Yes, there is a good reason. As a trigger can update the tuple value, 
> this can change the routing decision. If you have a user trigger that 
> tries to change the key value after the partition choice has been made, 
> this will lead to an integrity constraint violation which is probably 
> not what the user expects.

[ shrug... ]  Badly written user triggers can break FK constraints,
too.  We've tolerated that in the past because preventing it disables
useful capabilities.

I remain of the opinion that if you think you *have to* execute last,
you should not be writing this as a trigger; you'd be better off
embedding it lower in the system.
        regards, tom lane

Re: Partitioning feature ...

From

Josh Berkus

Date:

31 March 2009, 14:17:10

On 3/31/09 9:45 AM, Emmanuel Cecchet wrote:
> Yes, there is a good reason. As a trigger can update the tuple value,
> this can change the routing decision. If you have a user trigger that
> tries to change the key value after the partition choice has been made,
> this will lead to an integrity constraint violation which is probably
> not what the user expects.

Actually, it's worse.  Depending on the timing of the triggers, it's 
possible to bypass the FK check entirely, and you can end up with 
inconsistent data.

--Josh

Re: Partitioning feature ...

From

Nikhil Sontakke

Date:

01 April 2009, 02:08:36

Hi,

We already have system triggers -- the FK triggers. I don't think we've
had all that much trouble with them.

In the case of the FK triggers, it's intentional (and maybe even
documented) that users should be able to place their own triggers before
or after the FK triggers.

If it's documented I think it's well hidden ;-) ISTM that the fact that we implement FK constraints via triggers is really an implementation detail, not something the user should be encouraged to mess with.

Is there a good reason why partitioning
triggers should be different?

Probably not. ISTM that the scheme should turn tgisconstraint into a multi-valued item (tgkind: 'u' = userland, 'c'= constraint, 'p' = partition or some such).

+1.

This seems to be the best way forward if we stick to triggers for partitioning. I think they appear to serve the purpose well for this use-case and maybe with this scheme they will be low-level enough too.

Regards,
Nikhils
--
http://www.enterprisedb.com