Thread: suggestion about time based partitioning and hibernate

suggestion about time based partitioning and hibernate

From
Luca Ferrari
Date:
Dear all,
I'm looking for ideas here, and it could be someone already stepped
into declarative partitioning of an existing database where Hibernate
(a Java ORM) handles the tables.
The situation is as follows:

create table foo( id primary key, a_date date, ... );

Now, the trivial way to partition this would be on a range based on
a_date, so that the primary key of the tables shifts from id to (id,
a_date). One thing that frightens me is that Hibernate does a lot of
per-row lookups by means of the id, so while the partitioning is
probably going to make things more manageable and even faster in some
scenarios, could lead to drawbacks when Hibernate queries by id.
Moreover, hibernate will think id is unique while it is not anymore.
Last but not least, referencing foreign keys are made by Hibernate
thru the id column, and it means that incoming foreign keys to foo
will not be in place anymore.

Now, I know that I can define a composite key in hibernate, in order
to match the effective new data structure, but this requires a huge
rewrite of the application code.
And after all, we are talking about a non-PostgreSQL related piece, so
the problem is not on the PostgreSQL side.

Anyone has already done a partitioning in such scenario?

I am thinking that partitioning on an hash of id could be the only way
to go without having to touch the hibernate side, even if this would
bring up a less balanced partitioned structure. In such case, I mean
partitioning by hash, having a table with 60 millions rows per 50 GB
in size, what would be the rule of thumb to select the number of
partitions (i.e., a suggested modulus)?

I will appreciate any suggestion.

Thanks,
Luca



Re: suggestion about time based partitioning and hibernate

From
Ron
Date:
On 7/18/23 01:18, Luca Ferrari wrote:
Dear all,
I'm looking for ideas here, and it could be someone already stepped
into declarative partitioning of an existing database where Hibernate
(a Java ORM) handles the tables.
The situation is as follows:

create table foo( id primary key, a_date date, ... );

Now, the trivial way to partition this would be on a range based on
a_date, so that the primary key of the tables shifts from id to (id,
a_date). One thing that frightens me is that Hibernate does a lot of
per-row lookups by means of the id, so while the partitioning is
probably going to make things more manageable and even faster in some
scenarios, could lead to drawbacks when Hibernate queries by id.
Moreover, hibernate will think id is unique while it is not anymore.
Last but not least, referencing foreign keys are made by Hibernate
thru the id column, and it means that incoming foreign keys to foo
will not be in place anymore.

Now, I know that I can define a composite key in hibernate, in order
to match the effective new data structure, but this requires a huge
rewrite of the application code.
And after all, we are talking about a non-PostgreSQL related piece, so
the problem is not on the PostgreSQL side.

Anyone has already done a partitioning in such scenario?

I am thinking that partitioning on an hash of id could be the only way
to go without having to touch the hibernate side, even if this would
bring up a less balanced partitioned structure. In such case, I mean
partitioning by hash, having a table with 60 millions rows per 50 GB
in size, what would be the rule of thumb to select the number of
partitions (i.e., a suggested modulus)?

We're in the exact same situation, using Hibernate and having many tables like foo, partitioned by a_date with PK of (id, a_date) and FK definitions (id, a_date).

It was a massive mistake, since many queries span partitions.  Within a year, I "departitioned" all tables except two giant tables that have large bytea columns.  (All the formerly partitioned tables still have (id, a_date) as PK and FK.  I'd like to change that, but the amount of code change is untennable given them amount of new features that need to be added.)

Thus, my recommendations are:
1. only partition the tables you must, and
2. partition by id.

--
Born in Arizona, moved to Babylonia.

 

 

От: Ron <ronljohnsonjr@gmail.com>
Отправлено: 18 июля 2023 г. 9:48
Кому: pgsql-general@lists.postgresql.org
Тема: Re: suggestion about time based partitioning and hibernate

 

On 7/18/23 01:18, Luca Ferrari wrote:

Dear all,
I'm looking for ideas here, and it could be someone already stepped
into declarative partitioning of an existing database where Hibernate
(a Java ORM) handles the tables.
The situation is as follows:
 
create table foo( id primary key, a_date date, ... );
 
Now, the trivial way to partition this would be on a range based on
a_date, so that the primary key of the tables shifts from id to (id,
a_date). One thing that frightens me is that Hibernate does a lot of
per-row lookups by means of the id, so while the partitioning is
probably going to make things more manageable and even faster in some
scenarios, could lead to drawbacks when Hibernate queries by id.
Moreover, hibernate will think id is unique while it is not anymore.
Last but not least, referencing foreign keys are made by Hibernate
thru the id column, and it means that incoming foreign keys to foo
will not be in place anymore.
 
Now, I know that I can define a composite key in hibernate, in order
to match the effective new data structure, but this requires a huge
rewrite of the application code.
And after all, we are talking about a non-PostgreSQL related piece, so
the problem is not on the PostgreSQL side.
 
Anyone has already done a partitioning in such scenario?
 
I am thinking that partitioning on an hash of id could be the only way
to go without having to touch the hibernate side, even if this would
bring up a less balanced partitioned structure. In such case, I mean
partitioning by hash, having a table with 60 millions rows per 50 GB
in size, what would be the rule of thumb to select the number of
partitions (i.e., a suggested modulus)?


We're in the exact same situation, using Hibernate and having many tables like foo, partitioned by a_date with PK of (id, a_date) and FK definitions (id, a_date).

It was a massive mistake, since many queries span partitions.  Within a year, I "departitioned" all tables except two giant tables that have large bytea columns.  (All the formerly partitioned tables still have (id, a_date) as PK and FK.  I'd like to change that, but the amount of code change is untennable given them amount of new features that need to be added.)

Thus, my recommendations are:
1. only partition the tables you must, and
2. partition by id.

--
----------------------------------------

Hi!

 

It is always depends on database architecture. Partitioning should be planned in advance and your queries must be aware of it.

There are a number of caveats here

- if your database grows indefinitely you must have a dynamic partitioning. As for me I have created a service procedure to create new partitions to handle new data, and a backup procedure to check existence of partitions on every insert.

- if database is more or less finite and predictable I have created all the partitions created in advance at once

 

Next I give up range partitioning, I have created a surrogate key function and partition by it. Function is available on *client* and server, so every time I want to query data I can calculate this function and address data directly in known partition. In Hibernate too, just pass both elements of *composite* key.

Partition key MUST be in primary key, so query becomes cheap.

In case on subquery it is a matter of query optimizer if it is able to understand composite keys and do partition pruning. PG 12 can not, you have to tweak queries.

 

To summarize, if you query by ID – create a surrogate key from ID and pass it to query with original ID. All the FK data must follow the same schema, partition key must remain the same across the tables. If on the righth, other end of database schema you need other way of partitioning – I have created extra column to store “left side” partitioning key, so joins may respect all partitioning schemas.

 

What you should consider more – query optimizer. Once it may decide to create a general plan instead of custom. This will lead to locking all the (thousands of) partitions, than applying parameters, than pruning. In my case it was nearly half a million of locks for single query.

There was a discussion here, and there was a proposed patch to exchange pruning-analyzing, I do not know it`s faith. But I given up setting force_custom_plan and this speed up my queries by more than 10x and no locks, ~ 5 per query.

Re: suggestion about time based partitioning and hibernate

From
Alvaro Herrera
Date:
On 2023-Jul-18, Luca Ferrari wrote:

> Dear all,
> I'm looking for ideas here, and it could be someone already stepped
> into declarative partitioning of an existing database where Hibernate
> (a Java ORM) handles the tables.
> The situation is as follows:
> 
> create table foo( id primary key, a_date date, ... );
> 
> Now, the trivial way to partition this would be on a range based on
> a_date, so that the primary key of the tables shifts from id to (id,
> a_date).

Hmm.  If you can make partitioning give you some benefit without having
to change the model, then by all means explore it.  But if you're forced
to change the model (in this case, by adding a column to your primary
key), that is going to cause you lots of pain, and some queries might
become slower rather than faster.  Therefore I suggest to avoid doing
that.  Either look at some other partitioning scheme that doesn't
involve adding columns to the primary key, or disregard partitioning for
this table entirely.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



Re: suggestion about time based partitioning and hibernate

From
Luca Ferrari
Date:
On Wed, Jul 19, 2023 at 6:45 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> Therefore I suggest to avoid doing
> that.  Either look at some other partitioning scheme that doesn't
> involve adding columns to the primary key, or disregard partitioning for
> this table entirely.

What do you mean by "other partitioning scheme"? There is nothing that
comes into my mind at the moment.
The problem is that, unluckily, the table is already greater than 50GB
in seize and is keep growing, so I guess something has to be done, at
least for a manainance point of view.

Would a partition by hash on the single column primary key be such a bad idea?

Thanks,
Luca



Re: suggestion about time based partitioning and hibernate

From
Ron
Date:
On 7/20/23 10:31, Luca Ferrari wrote:
> On Wed, Jul 19, 2023 at 6:45 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>> Therefore I suggest to avoid doing
>> that.  Either look at some other partitioning scheme that doesn't
>> involve adding columns to the primary key, or disregard partitioning for
>> this table entirely.
> What do you mean by "other partitioning scheme"? There is nothing that
> comes into my mind at the moment.
> The problem is that, unluckily, the table is already greater than 50GB
> in seize and is keep growing, so I guess something has to be done, at
> least for a manainance point of view.
>
> Would a partition by hash on the single column primary key be such a bad idea?

Just do a range partition by the existing PK.

-- 
Born in Arizona, moved to Babylonia.



Re: suggestion about time based partitioning and hibernate

From
Marc Millas
Date:

On Tue, Jul 18, 2023 at 8:18 AM Luca Ferrari <fluca1978@gmail.com> wrote:
Dear all,
I'm looking for ideas here, and it could be someone already stepped
into declarative partitioning of an existing database where Hibernate
(a Java ORM) handles the tables.
The situation is as follows:

create table foo( id primary key, a_date date, ... );

Now, the trivial way to partition this would be on a range based on
a_date, so that the primary key of the tables shifts from id to (id,
a_date). One thing that frightens me is that Hibernate does a lot of
per-row lookups by means of the id, so while the partitioning is
probably going to make things more manageable and even faster in some
scenarios, could lead to drawbacks when Hibernate queries by id.
Moreover, hibernate will think id is unique while it is not anymore.
Last but not least, referencing foreign keys are made by Hibernate
thru the id column, and it means that incoming foreign keys to foo
will not be in place anymore.

Now, I know that I can define a composite key in hibernate, in order
to match the effective new data structure, but this requires a huge
rewrite of the application code.
And after all, we are talking about a non-PostgreSQL related piece, so
the problem is not on the PostgreSQL side.

Anyone has already done a partitioning in such scenario?

I am thinking that partitioning on an hash of id could be the only way
to go without having to touch the hibernate side, even if this would
bring up a less balanced partitioned structure. In such case, I mean
partitioning by hash, having a table with 60 millions rows per 50 GB
in size, what would be the rule of thumb to select the number of
partitions (i.e., a suggested modulus)?

I will appreciate any suggestion.

Thanks,
Luca

Hello,

quite often, in such cases, the id is a sequence or something like that.
so, partitioning on id range or a_date range is similar, on a maintenance point of vue. 
and, obviously if you partition by id range, no pk pb.
the difference is the SQL plan to access the data, as to prune partitions, Postgres need to have the partition key in the where clause.
obviously, if your code do access the data thru id... then partition by id.

  Marc MILLAS