Thread: Partitioning and unique key
Hello,
We have our transaction tables daily range partitioned based on transaction_timestamp column which is timestamptz data type and these are having composite primary key on (transaction_id, transaction_timestamp). And we were using an "insert on conflict" for loading data to our system , which means if another record comes to the system with the same transaction_id and transaction_timestamp, it will get updated. This way we already have 60 days worth of data stored in our system with approx. 70 million transactions per day.
But we just got to know from business that the data should be unique by only transaction_id but not transaction_timestamp. Any incoming data with the same transaction_id(even different transaction_timestamp) should get updated but not inserted.
Also these daily partitions are going to hold 400million rows in future and will be queried on the transaction_timestamp filter so we can't really avoid the partitioning option here considering future growth.
But due to postgres limitations we are unable to have this unique constraint or primary key only on the transaction_id column, we have to include transaction_timestamp with it as a composite key. So I want to understand from experts if there is any possible way to satisfy both partitioning on transaction_timestamp column and unique key or pk just on transaction_id only?
Note-its 15.4 postgres database.
Regards
Veem
On 8/31/24 13:02, veem v wrote: > Hello, > > We have our transaction tables daily range partitioned based on > transaction_timestamp column which is timestamptz data type and these > are having composite primary key on (transaction_id, > transaction_timestamp). And we were using an "insert on conflict" for > loading data to our system , which means if another record comes to the > system with the same transaction_id and transaction_timestamp, it will > get updated. This way we already have 60 days worth of data stored in > our system with approx. 70 million transactions per day. > > But we just got to know from business thatthe data should be unique by > only transaction_id but not transaction_timestamp. Any incoming data > with the same transaction_id(even different transaction_timestamp) > should get updated but not inserted. > > Also these daily partitions are going to hold 400million rows in future > and will be queried on the transaction_timestamp filter so we can't > really avoid the partitioning option here considering future growth. > > But due to postgres limitations we are unable to have this unique > constraint or primary key only on the transaction_id column, we have to > include transaction_timestamp with it as a composite key. So I want to > understand from experts if there is any possible way to satisfy both > partitioning on transaction_timestamp column and unique key or pk just > on _trans_action_id only? The model is at odds with itself and untenable. If the tables hold multiple rows for a given transaction_id then you cannot have a PK/Unique constraint on that column. Seems there is a decided lack of any planning. The only way I can see this happening is consolidating all the duplicate transaction_id rows into a single row for each transaction_id. That then leads to the question of how to do that and retain the 'correct' information from the selection of rows for each transaction_id. > > Note-its 15.4 postgres database. > > > Regards > > Veem > -- Adrian Klaver adrian.klaver@aklaver.com
The model is at odds with itself and untenable. If the tables hold
multiple rows for a given transaction_id then you cannot have a
PK/Unique constraint on that column. Seems there is a decided lack of
any planning. The only way I can see this happening is consolidating all
the duplicate transaction_id rows into a single row for each
transaction_id. That then leads to the question of how to do that and
retain the 'correct' information from the selection of rows for each
transaction_id.
SELECT column1_id, column2_timestamptz,
ROW_NUMBER() OVER (PARTITION BY column1_id, date_trunc('day', column2_timestamptz)
ORDER BY column2_timestamptz DESC) AS rn
FROM partition_name
)
DELETE FROM partition_name T1
WHERE EXISTS (
SELECT 1
FROM ranked_records T2
WHERE T1.column1_id = T2.column1_id
AND T1.column2_timestamptz = T2.column2_timestamptz
AND T2.rn > 1
)
1) if it's technically possible to have a unique key on only the transaction_id column having the partition key on the transaction_timestamp, because the table is going to be queried/purged based on the transaction_timestamp?
iii)And then alter the datatype of the partition key transaction_date to DATE in one shot at the table level(which should be fast as its having more granularity as compare to existing timestamptype, so should be catalog or dictionary change only), and that will remain the part of composite PK (transaction_id,transaction_date).
iv) Repeat this step for all child partition tables and then for the parent partition tables.
On Saturday, August 31, 2024, veem v <veema0000@gmail.com> wrote:iii)And then alter the datatype of the partition key transaction_date to DATE in one shot at the table level(which should be fast as its having more granularity as compare to existing timestamptype, so should be catalog or dictionary change only), and that will remain the part of composite PK (transaction_id,transaction_date).While this might seem logical, in reality date and timestamptz are different fixed-width data types and thus any attempt to change from one to the other will involve a table rewrite. Best you could do is leave the timestamptz in place and just truncate to day so the time is always midnight UTC.
On Sun, 1 Sept 2024 at 09:13, David G. Johnston <david.g.johnston@gmail.com> wrote:On Saturday, August 31, 2024, veem v <veema0000@gmail.com> wrote:iii)And then alter the datatype of the partition key transaction_date to DATE in one shot at the table level(which should be fast as its having more granularity as compare to existing timestamptype, so should be catalog or dictionary change only), and that will remain the part of composite PK (transaction_id,transaction_date).While this might seem logical, in reality date and timestamptz are different fixed-width data types and thus any attempt to change from one to the other will involve a table rewrite. Best you could do is leave the timestamptz in place and just truncate to day so the time is always midnight UTC.Here , if we keep the PK column as is i.e. the transaction_timestamp as timestamptz but truncate the time component , in that case again in future if someone tries to insert(using insert on conflict) data into the table with time component , it will get consumed and will not be restricted by the PK constraint. So I was trying to make the data type also as DATE for the transaction_timestap column.As in this case anyway we have to create another column to populate the date+timestamp values as we cant throw those values away per business need, so we will be kind of rewriting the table.So is it okay if if we will1) Detach all the partitions.2)Do the alter using "only" key word in table level. (For adding new column transaction_timestamp_new to hold date+timestamp value and also altering the existing transaction_timestamp column to DATE from type timestamptz).3)Then do the data fix(delete the duplicates) and alter the column, one partition at a time for all of the partitions and once done , attach those partitions one by one.5)Rename the columns at table level.Hope this won't need any table rewrite.Is there any downside if we go by the above approach?
On Sun, 1 Sept 2024 at 10:03, veem v <veema0000@gmail.com> wrote:On Sun, 1 Sept 2024 at 09:13, David G. Johnston <david.g.johnston@gmail.com> wrote:On Saturday, August 31, 2024, veem v <veema0000@gmail.com> wrote:iii)And then alter the datatype of the partition key transaction_date to DATE in one shot at the table level(which should be fast as its having more granularity as compare to existing timestamptype, so should be catalog or dictionary change only), and that will remain the part of composite PK (transaction_id,transaction_date).While this might seem logical, in reality date and timestamptz are different fixed-width data types and thus any attempt to change from one to the other will involve a table rewrite. Best you could do is leave the timestamptz in place and just truncate to day so the time is always midnight UTC.Here , if we keep the PK column as is i.e. the transaction_timestamp as timestamptz but truncate the time component , in that case again in future if someone tries to insert(using insert on conflict) data into the table with time component , it will get consumed and will not be restricted by the PK constraint. So I was trying to make the data type also as DATE for the transaction_timestap column.As in this case anyway we have to create another column to populate the date+timestamp values as we cant throw those values away per business need, so we will be kind of rewriting the table.So is it okay if if we will1) Detach all the partitions.2)Do the alter using "only" key word in table level. (For adding new column transaction_timestamp_new to hold date+timestamp value and also altering the existing transaction_timestamp column to DATE from type timestamptz).3)Then do the data fix(delete the duplicates) and alter the column, one partition at a time for all of the partitions and once done , attach those partitions one by one.5)Rename the columns at table level.Hope this won't need any table rewrite.Is there any downside if we go by the above approach?Or do you mean to say there is no way we can modify the data type of a partition key even by detaching the partitions one by one? And thus we may have only way left is to create the table from scratch with partitions and populate the data to it? I was avoiding this because we have many indexes also in it , so creating from scratch means creating those indexes again. So I wanted to achieve it by detaching partitions, doing the required change and attaching it again.
On Sun, 2024-09-01 at 01:32 +0530, veem v wrote: > due to postgres limitations we are unable to have this unique constraint or primary key > only on the transaction_id column, we have to include transaction_timestamp with it as > a composite key. So I want to understand from experts if there is any possible way to > satisfy both partitioning on transaction_timestamp column and unique key or pk just on > transaction_id only? No, you cannot have both. Usually the solution is to *not* create a primary key on the partitioned table and instead create a primary key on each partition. That won't guarantee global uniqueness (and there is no way to do that), but it goes a long way by ensuring that the column is unique within each partition. Yours, Laurenz Albe
On Sun, 2024-09-01 at 01:32 +0530, veem v wrote:
> due to postgres limitations we are unable to have this unique constraint or primary key
> only on the transaction_id column, we have to include transaction_timestamp with it as
> a composite key. So I want to understand from experts if there is any possible way to
> satisfy both partitioning on transaction_timestamp column and unique key or pk just on
> transaction_id only?
No, you cannot have both.
Usually the solution is to *not* create a primary key on the partitioned table
and instead create a primary key on each partition.
That won't guarantee global uniqueness (and there is no way to do that), but it
goes a long way by ensuring that the column is unique within each partition.
Yours,
Laurenz Albe
On Mon, 2024-09-02 at 21:39 +0530, veem v wrote: > On Mon, 2 Sept 2024 at 19:13, Laurenz Albe <laurenz.albe@cybertec.at> wrote: > > On Sun, 2024-09-01 at 01:32 +0530, veem v wrote: > > > due to postgres limitations we are unable to have this unique constraint or primary key > > > only on the transaction_id column, we have to include transaction_timestamp with it as > > > a composite key. So I want to understand from experts if there is any possible way to > > > satisfy both partitioning on transaction_timestamp column and unique key or pk just on > > > transaction_id only? > > > > No, you cannot have both. > > > > Usually the solution is to *not* create a primary key on the partitioned table > > and instead create a primary key on each partition. > > > > That won't guarantee global uniqueness (and there is no way to do that), but it > > goes a long way by ensuring that the column is unique within each partition. > > So it means in our case the existing PK on table level on column (txn_id and txn_timestamp), > we should drop that and create a unique index on each partition level and also the same way > the foreign key also maps to the parent table partitions. And in that case , can we follow > this as best practices to not have the primary keys defined at the tabe level at all, but > on the partition level only, or there exist any down side to it too? Please suggest. You can keep the primary key defined on both columns if it is good enough for you. But it will give you lower guarantees of uniqueness: with that primary key, there could be two rows with a different timestamp, but the same "txn_id", and these two rows could be in the same partition... Also, if you need a foreign key pointing *to* the partitioned table, you cannot do without a primary key. But I recommend that you do *not* define such foreign keys: they will make it more difficult to detach a partition. If you partition two tables in the same way, you can use foreign keys between the partitions instead of foreign keys between the partitioned tables. Such foreign keys won't be a problem. > Also then what I see is, it will make the data load query fail which uses "insert on conflict" > to insert data into the table and that requires the primary key on both the columns to have > on table level. Yes, that is true. A disadvantage of not having a unique constraint on the partitioned table. > Also the partition maintenance job which uses partman extension uses the template table which > in turn uses table level properties for creating new partitions and they will not have these > unique indexes created for the new partitions as because the unique index property is not on > the table level but partition level. Can you share your thoughts on these? Don't use partman. Or if you do, create the primary key yourself, after partman has created the partition. I wouldn't let the limitations of a tool govern my design choices. Yours, Laurenz Albe
You can keep the primary key defined on both columns if it is good enough for you.
But it will give you lower guarantees of uniqueness: with that primary key, there could
be two rows with a different timestamp, but the same "txn_id", and these two rows could
be in the same partition...
Also, if you need a foreign key pointing *to* the partitioned table, you cannot do without
a primary key. But I recommend that you do *not* define such foreign keys: they will make
it more difficult to detach a partition.
If you partition two tables in the same way, you can use foreign keys between the partitions
instead of foreign keys between the partitioned tables. Such foreign keys won't be a problem.
On Tue, 2024-09-03 at 10:39 +0530, veem v wrote: > As you rightly said "they will make it more difficult to detach a partition." , > we are really seeing a longer time when detaching parent table partitions. > It runs forever sometimes. So do you mean it's because we have primary key > defined table level or it's because we have FK defined in table level > (for multiple child tables which are also partitioned)? I'd say it is because of the foreign key. If you have a foreign key that points to a partitioned table, and you detach a partition, PostgreSQL has to verify that that won't violate the constraint, so it has to scan the tables, which will take time if the partitions are large. > We were thinking it's because we have FK defined on tablelevel , so we were > planning to make the FK on partition level. Good move. > But as you just pointed now , even keeping the PK on table level will also > make the detach partition slow? I understand, for detaching partitions , > it may be scanning while child because of the FK defined on the table level. > but i am unable to understand how the table level PK impacts the detach > partition from parent here. No, a primary key on the partitioned table won't be a problem for performance. My concern was that if what you really would like is "id" to be unique, how does a primary key on (id, some_timestamp) benefit you? > My understanding is PK can only be created on table level but not on the > partition level. On the partition level we only can have a "unique index" > defined. Correct me if my understanding is wrong. No, you can define a primary key on the partition. That is, if you have no primary key on the partitioned table. A primary key on the partitioned table is a primary key on each partition, and a table can only have a single primary key, so adding another primary key on the partition would cause an error. Yours, Laurenz Albe