Thread: Partitioning options
Hi ,
We have a system which stores customers' transactions. There are a total of ~100K customers currently and the list will increase in future but not drastically though(maybe ~50K more or so). The number of transactions per day is ~400million. and we want to persist them in our postgres database for ~5months.
The key transaction table is going to have ~450 Million transactions per day and the data querying/filtering will always happen based on the "transaction date" column. And mostly "JOIN" will be happening on the "CUTSOMER_ID" column along with filters for some scenarios on customer_id columns. Each one day worth of transaction consumes ~130GB of storage space as we verified using the "pg_relation_size" function, for a sample data set.
As mentioned, there will be ~100K distinct "customer_id" but the data won't be equally distributed , they will be skewed in nature for e.g. Some of the big customers will hold majority of the transactions (say 20-30% of total transactions) and other are distributed among others, but again not equally.
So my question was , in the above scenario should we go for a composite partitioning strategy i.e range/hash (daily range partition by transaction_date and hash subpartition by customer_id)?
OR
Should we go for simple daily range partitioning on the transaction_date column?
OR
Range/list composite partitioning (range partition by transaction_date and list subpartition by customer_id)?
The key transaction table is going to have ~450 Million transactions per day and the data querying/filtering will always happen based on the "transaction date" column. And mostly "JOIN" will be happening on the "CUTSOMER_ID" column along with filters for some scenarios on customer_id columns. Each one day worth of transaction consumes ~130GB of storage space as we verified using the "pg_relation_size" function, for a sample data set.
As mentioned, there will be ~100K distinct "customer_id" but the data won't be equally distributed , they will be skewed in nature for e.g. Some of the big customers will hold majority of the transactions (say 20-30% of total transactions) and other are distributed among others, but again not equally.
So my question was , in the above scenario should we go for a composite partitioning strategy i.e range/hash (daily range partition by transaction_date and hash subpartition by customer_id)?
OR
Should we go for simple daily range partitioning on the transaction_date column?
OR
Range/list composite partitioning (range partition by transaction_date and list subpartition by customer_id)?
Thanks and Regards
Sud
On Thu, Feb 8, 2024 at 12:42 AM sud <suds1434@gmail.com> wrote:
...
The key transaction table is going to have ~450 Million transactions per day and the data querying/filtering will always happen based on the "transaction date" column.
...
Should we go for simple daily range partitioning on the transaction_date column?
This one gets my vote. That and some good indexes.
Cheers,
Greg
On Thu, Feb 8, 2024 at 12:42 AM sud <suds1434@gmail.com> wrote:...The key transaction table is going to have ~450 Million transactions per day and the data querying/filtering will always happen based on the "transaction date" column....Should we go for simple daily range partitioning on the transaction_date column?This one gets my vote. That and some good indexes.Cheers,Greg
Hello Greg,
Out of curiosity, As OP mentioned that there will be Joins and also filters on column Customer_id column , so why don't you think that subpartition by customer_id will be a good option? I understand List subpartition may not be an option considering the new customer_ids gets added slowly in the future(and default list may not be allowed) and also OP mentioned, there is skewed distribution of data for customer_id column. However what is the problem if OP will opt for HASH subpartition on customer_id in this situation?
Is it because the number of partitions will be higher i.e.
If you go with simple range partitioning, for 5 months you will have ~150 daily range partitions and with each index the count of partition will gets double, for e.g if you will have 10 indexes, the total partitions will be = ~150 table partition+ (10*150)index partition= 1650 total number of partitions.
If OP goes for , range-hash, and hash will mostly have to be 2^N, so say 8, hash sub-partitions , then the total number of partitions will be = (8*150) table partitions+ (8*150*10) index partitions= ~13200 partitions.
Though there are no theoretical limits to the number of partitions in postgres, there are some serious issues noted in the past with higher number of table partitions. One such is below. Is this the reason?
https://www.kylehailey.com/post/postgres-partition-pains-lockmanager-waits
Out of curiosity, As OP mentioned that there will be Joins and also filters on column Customer_id column , so why don't you think that subpartition by customer_id will be a good option? I understand List subpartition may not be an option considering the new customer_ids gets added slowly in the future(and default list may not be allowed) and also OP mentioned, there is skewed distribution of data for customer_id column. However what is the problem if OP will opt for HASH subpartition on customer_id in this situation?
Is it because the number of partitions will be higher i.e.
If you go with simple range partitioning, for 5 months you will have ~150 daily range partitions and with each index the count of partition will gets double, for e.g if you will have 10 indexes, the total partitions will be = ~150 table partition+ (10*150)index partition= 1650 total number of partitions.
If OP goes for , range-hash, and hash will mostly have to be 2^N, so say 8, hash sub-partitions , then the total number of partitions will be = (8*150) table partitions+ (8*150*10) index partitions= ~13200 partitions.
Though there are no theoretical limits to the number of partitions in postgres, there are some serious issues noted in the past with higher number of table partitions. One such is below. Is this the reason?
https://www.kylehailey.com/post/postgres-partition-pains-lockmanager-waits
Regards
Veem
On 2/8/24 1:43 PM, veem v wrote: > > On Thu, 8 Feb 2024 at 20:08, Greg Sabino Mullane <htamfids@gmail.com > <mailto:htamfids@gmail.com>> wrote: <snip> > > Should we go for simple daily range partitioning on the > transaction_date column? > > > This one gets my vote. That and some good indexes. > > > Hello Greg, > > Out of curiosity, As OP mentioned that there will be Joins and also > filters on column Customer_id column , so why don't you think that > subpartition by customer_id will be a good option? I understand List > subpartition may not be an option considering the new customer_ids gets > added slowly in the future(and default list may not be allowed) and also > OP mentioned, there is skewed distribution of data for customer_id > column. However what is the problem if OP will opt for HASH subpartition > on customer_id in this situation? > > Is it because the number of partitions will be higher i.e. > > If you go with simple range partitioning, for 5 months you will have > ~150 daily range partitions and with each index the count of partition > will gets double, for e.g if you will have 10 indexes, the total > partitions will be = ~150 table partition+ (10*150)index partition= 1650 > total number of partitions. > > If OP goes for , range-hash, and hash will mostly have to be 2^N, so say > 8, hash sub-partitions , then the total number of partitions will be = > (8*150) table partitions+ (8*150*10) index partitions= ~13200 partitions. > > Though there are no theoretical limits to the number of partitions in > postgres, there are some serious issues noted in the past with higher > number of table partitions. One such is below. Is this the reason? > > https://www.kylehailey.com/post/postgres-partition-pains-lockmanager-waits <https://www.kylehailey.com/post/postgres-partition-pains-lockmanager-waits> The issue with partitioning by customer_id is that it won't do much (if anything) to improve data locality. When partitioning by date, you can at least benefit from partition elimination *IF* your most frequent queries limit the number of days that the query will look at. Per the OP, all queries will include transaction date. Note that does NOT actually mean the number of days/partitions will be limited (ie, WHERE date > today - 150 will hit all the partitions), but if we assume that the majority of queries will limit themselves to the past few days then partitioning by date should greatly increase data locality. Also, when it comes to customer partitioning... really what you probably want there isn't partitioning but sharding. -- Jim Nasby, Data Architect, Austin TX
Out of curiosity, As OP mentioned that there will be Joins and also filters on column Customer_id column , so why don't you think that subpartition by customer_id will be a good option? I understand List subpartition may not be an option considering the new customer_ids gets added slowly in the future(and default list may not be allowed) and also OP mentioned, there is skewed distribution of data for customer_id column. However what is the problem if OP will opt for HASH subpartition on customer_id in this situation?
It doesn't really gain you much, given you would be hashing it, the customers are unevenly distributed, and OP talked about filtering on the customer_id column. A hash partition would just be a lot more work and complexity for us humans and for Postgres. Partitioning for the sake of partitioning is not a good thing. Yes, smaller tables are better, but they have to be smaller targeted tables.
sud wrote:
130GB of storage space as we verified using the "pg_relation_size" function, for a sample data set.
You might also want to closely examine your schema. At that scale, every byte saved per row can add up.
Cheers,
Greg
Hi Sud,
Would not look at HASH partitioning as it is very expensive to add or subtract the number of partitions.
Would probably look at a nested partitioning using customer ID using range or list of IDs then by transaction date, Its easy to add partitions and balance the partitions segments.
Keep in mind that SELECT queries being used on the partition must use the partitioning KEY in the WHERE clause of the query or performance will suffer.
Suggest doing a query analysis before deploying partition to confirm the queries WHERE clauses matched the planned partition rule. I suggest that 80% of the queries of the executed queries must match the partition rule if not don't deploy partitioning or change all the queries in the application to match the partition rule
Would not look at HASH partitioning as it is very expensive to add or subtract the number of partitions.
Would probably look at a nested partitioning using customer ID using range or list of IDs then by transaction date, Its easy to add partitions and balance the partitions segments.
Keep in mind that SELECT queries being used on the partition must use the partitioning KEY in the WHERE clause of the query or performance will suffer.
Suggest doing a query analysis before deploying partition to confirm the queries WHERE clauses matched the planned partition rule. I suggest that 80% of the queries of the executed queries must match the partition rule if not don't deploy partitioning or change all the queries in the application to match the partition rule
On Thu, Feb 8, 2024 at 3:51 PM Greg Sabino Mullane <htamfids@gmail.com> wrote:
Out of curiosity, As OP mentioned that there will be Joins and also filters on column Customer_id column , so why don't you think that subpartition by customer_id will be a good option? I understand List subpartition may not be an option considering the new customer_ids gets added slowly in the future(and default list may not be allowed) and also OP mentioned, there is skewed distribution of data for customer_id column. However what is the problem if OP will opt for HASH subpartition on customer_id in this situation?It doesn't really gain you much, given you would be hashing it, the customers are unevenly distributed, and OP talked about filtering on the customer_id column. A hash partition would just be a lot more work and complexity for us humans and for Postgres. Partitioning for the sake of partitioning is not a good thing. Yes, smaller tables are better, but they have to be smaller targeted tables.sud wrote:130GB of storage space as we verified using the "pg_relation_size" function, for a sample data set.You might also want to closely examine your schema. At that scale, every byte saved per row can add up.Cheers,Greg
On Thu, Feb 8, 2024 at 10:25 PM Justin <zzzzz.graf@gmail.com> wrote:
Hi Sud,
Would not look at HASH partitioning as it is very expensive to add or subtract the number of partitions.
Would probably look at a nested partitioning using customer ID using range or list of IDs then by transaction date, Its easy to add partitions and balance the partitions segments.
I'll not do that because, then, when getting rid of obsolete data, you must delete a huge number of records, and vacuum each partition.
if partitioning by date, you will ease greatly the cleaning, by just getting rid of obsolete partitions which is quite speedy.( no delete, no vacuum, no index updates, ...)
Marc
Keep in mind that SELECT queries being used on the partition must use the partitioning KEY in the WHERE clause of the query or performance will suffer.
Suggest doing a query analysis before deploying partition to confirm the queries WHERE clauses matched the planned partition rule. I suggest that 80% of the queries of the executed queries must match the partition rule if not don't deploy partitioning or change all the queries in the application to match the partition rule
On Thu, Feb 8, 2024 at 3:51 PM Greg Sabino Mullane <htamfids@gmail.com> wrote:Out of curiosity, As OP mentioned that there will be Joins and also filters on column Customer_id column , so why don't you think that subpartition by customer_id will be a good option? I understand List subpartition may not be an option considering the new customer_ids gets added slowly in the future(and default list may not be allowed) and also OP mentioned, there is skewed distribution of data for customer_id column. However what is the problem if OP will opt for HASH subpartition on customer_id in this situation?It doesn't really gain you much, given you would be hashing it, the customers are unevenly distributed, and OP talked about filtering on the customer_id column. A hash partition would just be a lot more work and complexity for us humans and for Postgres. Partitioning for the sake of partitioning is not a good thing. Yes, smaller tables are better, but they have to be smaller targeted tables.sud wrote:130GB of storage space as we verified using the "pg_relation_size" function, for a sample data set.You might also want to closely examine your schema. At that scale, every byte saved per row can add up.Cheers,Greg
Hi Marc,
Nested partitioning still allows for simple data deletion by dropping the table that falls in that date range.
Probably thinking of partitioning by multicolomn rules which is very complex to set up
On Fri, Feb 9, 2024, 10:29 AM Marc Millas <marc.millas@mokadb.com> wrote:
On Thu, Feb 8, 2024 at 10:25 PM Justin <zzzzz.graf@gmail.com> wrote:Hi Sud,
Would not look at HASH partitioning as it is very expensive to add or subtract the number of partitions.
Would probably look at a nested partitioning using customer ID using range or list of IDs then by transaction date, Its easy to add partitions and balance the partitions segments.I'll not do that because, then, when getting rid of obsolete data, you must delete a huge number of records, and vacuum each partition.if partitioning by date, you will ease greatly the cleaning, by just getting rid of obsolete partitions which is quite speedy.( no delete, no vacuum, no index updates, ...)Marc
Keep in mind that SELECT queries being used on the partition must use the partitioning KEY in the WHERE clause of the query or performance will suffer.
Suggest doing a query analysis before deploying partition to confirm the queries WHERE clauses matched the planned partition rule. I suggest that 80% of the queries of the executed queries must match the partition rule if not don't deploy partitioning or change all the queries in the application to match the partition rule
On Thu, Feb 8, 2024 at 3:51 PM Greg Sabino Mullane <htamfids@gmail.com> wrote:Out of curiosity, As OP mentioned that there will be Joins and also filters on column Customer_id column , so why don't you think that subpartition by customer_id will be a good option? I understand List subpartition may not be an option considering the new customer_ids gets added slowly in the future(and default list may not be allowed) and also OP mentioned, there is skewed distribution of data for customer_id column. However what is the problem if OP will opt for HASH subpartition on customer_id in this situation?It doesn't really gain you much, given you would be hashing it, the customers are unevenly distributed, and OP talked about filtering on the customer_id column. A hash partition would just be a lot more work and complexity for us humans and for Postgres. Partitioning for the sake of partitioning is not a good thing. Yes, smaller tables are better, but they have to be smaller targeted tables.sud wrote:130GB of storage space as we verified using the "pg_relation_size" function, for a sample data set.You might also want to closely examine your schema. At that scale, every byte saved per row can add up.Cheers,Greg
"Would probably look at a nested partitioning" I'm not the original poster, but I have a schema with nested (composite) partitions and I do run into some significant inefficiencies compared to flat partitions in various schema metadata operations (queries to get the list of tables, creating foreign keys, etc.) in tables with 1,000+ total partitions. One example: https://www.postgresql.org/message-id/CAE%2BE%3DSQacy6t_3XzCWnY1eiRcNWfz4pp02FER0N7mU_F%2Bo8G_Q%40mail.gmail.com Alec On Sun, Feb 11, 2024 at 8:25 AM Justin <zzzzz.graf@gmail.com> wrote: > > Hi Marc, > > Nested partitioning still allows for simple data deletion by dropping the table that falls in that date range. > > Probably thinking of partitioning by multicolomn rules which is very complex to set up > > On Fri, Feb 9, 2024, 10:29 AM Marc Millas <marc.millas@mokadb.com> wrote: >> >> >> >> >> On Thu, Feb 8, 2024 at 10:25 PM Justin <zzzzz.graf@gmail.com> wrote: >>> >>> Hi Sud, >>> >>> Would not look at HASH partitioning as it is very expensive to add or subtract the number of partitions. >>> >>> Would probably look at a nested partitioning using customer ID using range or list of IDs then by transaction date, Its easy to add partitions and balance the partitions segments. >> >> >> I'll not do that because, then, when getting rid of obsolete data, you must delete a huge number of records, and vacuumeach partition. >> if partitioning by date, you will ease greatly the cleaning, by just getting rid of obsolete partitions which is quitespeedy.( no delete, no vacuum, no index updates, ...) >> Marc >> >>> >>> Keep in mind that SELECT queries being used on the partition must use the partitioning KEY in the WHERE clause of thequery or performance will suffer. >>> >>> Suggest doing a query analysis before deploying partition to confirm the queries WHERE clauses matched the planned partitionrule. I suggest that 80% of the queries of the executed queries must match the partition rule if not don't deploypartitioning or change all the queries in the application to match the partition rule >>> >>> >>> On Thu, Feb 8, 2024 at 3:51 PM Greg Sabino Mullane <htamfids@gmail.com> wrote: >>>>> >>>>> Out of curiosity, As OP mentioned that there will be Joins and also filters on column Customer_id column , so why don'tyou think that subpartition by customer_id will be a good option? I understand List subpartition may not be an optionconsidering the new customer_ids gets added slowly in the future(and default list may not be allowed) and also OP mentioned,there is skewed distribution of data for customer_id column. However what is the problem if OP will opt for HASHsubpartition on customer_id in this situation? >>>> >>>> >>>> It doesn't really gain you much, given you would be hashing it, the customers are unevenly distributed, and OP talkedabout filtering on the customer_id column. A hash partition would just be a lot more work and complexity for us humansand for Postgres. Partitioning for the sake of partitioning is not a good thing. Yes, smaller tables are better, butthey have to be smaller targeted tables. >>>> >>>> sud wrote: >>>> >>>>> 130GB of storage space as we verified using the "pg_relation_size" function, for a sample data set. >>>> >>>> >>>> You might also want to closely examine your schema. At that scale, every byte saved per row can add up. >>>> >>>> Cheers, >>>> Greg >>>>
On Sun, Feb 18, 2024 at 5:20 PM Alec Lazarescu <alecl@alecl.com> wrote:
"Would probably look at a nested partitioning"
I'm not the original poster, but I have a schema with nested
(composite) partitions and I do run into some significant
inefficiencies compared to flat partitions in various schema metadata
operations (queries to get the list of tables, creating foreign keys,
etc.) in tables with 1,000+ total partitions.
One example: https://www.postgresql.org/message-id/CAE%2BE%3DSQacy6t_3XzCWnY1eiRcNWfz4pp02FER0N7mU_F%2Bo8G_Q%40mail.gmail.com
Alec
Hi Alec,
would need to see the DDL of the partitions and the queries accessing these partitions to have an opinion
Thank you
Justin
would need to see the DDL of the partitions and the queries accessing these partitions to have an opinion
Thank you
Justin
Hi, Justin. The example link has self-contained DDL to create the partitions (in flat vs composite mode for comparison) and then making the FK's on each showing the marked speed difference for the same net number of partitions (1200 flat vs 80x15 = 1200 composite): https://www.postgresql.org/message-id/CAE%2BE%3DSQacy6t_3XzCWnY1eiRcNWfz4pp02FER0N7mU_F%2Bo8G_Q%40mail.gmail.com Alec On Tue, Feb 20, 2024 at 11:59 AM Justin <zzzzz.graf@gmail.com> wrote: > > > On Sun, Feb 18, 2024 at 5:20 PM Alec Lazarescu <alecl@alecl.com> wrote: >> >> "Would probably look at a nested partitioning" >> >> I'm not the original poster, but I have a schema with nested >> (composite) partitions and I do run into some significant >> inefficiencies compared to flat partitions in various schema metadata >> operations (queries to get the list of tables, creating foreign keys, >> etc.) in tables with 1,000+ total partitions. >> >> One example: https://www.postgresql.org/message-id/CAE%2BE%3DSQacy6t_3XzCWnY1eiRcNWfz4pp02FER0N7mU_F%2Bo8G_Q%40mail.gmail.com >> >> Alec >> >> > > Hi Alec, > > would need to see the DDL of the partitions and the queries accessing these partitions to have an opinion > > Thank you > Justin