Re: How to do faster DML - Mailing list pgsql-general

From veem v
Subject Re: How to do faster DML
Date
Msg-id CAB+=1TUQ2OHRqfGupzJHGwNDAcWksR-F9fWzQBFpCU80iiwDTw@mail.gmail.com
Whole thread Raw
In response to Re: How to do faster DML  (Lok P <loknath.73@gmail.com>)
List pgsql-general
I have not much experience with postgres on this scale though, others may provide better suggestions. However, with this scale you will have a single table size ~30TB+. And someone just pointed out in this thread,  ~180 partitions/table as the good to have limit,  and if that is true it would be ~170GB per partition. Looks bulky for a system where readresponse time is expected to be in milliseconds.

On Mon, 5 Feb 2024 at 16:51, Lok P <loknath.73@gmail.com> wrote:


On Mon, 5 Feb, 2024, 2:21 am Lok P, <loknath.73@gmail.com> wrote:

On Sun, Feb 4, 2024 at 9:18 PM Greg Sabino Mullane <htamfids@gmail.com> wrote:
Partitioning will depend a lot on how you access the data. Is it always using that transaction_date column in the where clause? Can you share the table definition via \d? We've talked about this table quite a bit, but not actually seen what it looks like. We are flying blind a little bit. You mentioned your queries are slow, but can you share an explain analyze on one of these slow queries?

45 billion rows is quite manageable. How many rows are in each day? You may want to do a larger bucket than 24 hours per partition.

1)Even after partitioning the target table , to speed up the data load on this table , Is there an option to disable the primary and foreign keys and re-enable them post data load finishes. Will that be a good idea

No.

3)As the size of the table or each partition is going to be very large and this will be a read intensive application, compressing the historical partition will help us save the storage space and will also help the read queries performance

I am not sure what this means. If you are not going to need the data anymore, dump the data to deep storage and drop the partition.

Cheers,
Greg
 

Thank you.

The table has ~127 columns of different data types , combinations of Numeric, varchar, date etc. And is having current size ~1TB holding ~3billion rows currently and the row size is ~300bytes.

Currently it has lesser volume , but in future the daily transaction per day which will be inserted into this table will be Max ~500million rows/day. And the plan is to persist at least ~3months of transaction data which will be around 45billion rows in total. And if all works well , we may need to persist ~6 months worth of data in this database in future and that will be ~90 billion.

This table will always be queried on the transaction_date column as one of the filters criteria. But the querying/search criteria can span from a day to a month worth of transaction date data.

When you said "You may want to do a larger bucket than 24 hours per partition.", do you mean to say partition by weekly or so? Currently as per math i.e. 1TB of storage for ~3billion rows. So the daily range partition size( to hold ~500million transactions/day) will be around ~100-150GB. Won't that be too much data for a single partition to operate on, and increasing the granularity further(say weekly) will make the partition more bulkier?

What I mean was, we will definitely need the data for querying purposes by the users, but just to keep the storage space incontrol (and to help the read queries), we were thinking of having the historical partitions compressed. And for that , if any compression strategy should we follow on postgres?

With regards to loading data to the table faster, wondering why you said 'NO' to load the data first and enabling/Creating the Primary key and Foreign key constraint later approach. Because this table is a child table and the parent is already having data in it, loading data to this table in presence of PK and FK makes it too slow as it tries to validate those for each set of rows. So we were thinking if doing it at a later stage at oneshot will be a faster approach. Please suggest.

I will try to collect some SELECT query and post the explain analyze. Currently we are trying to get rid of the duplicates.

Regards
Lok

Any thoughts, based on above usage pattern? 

While I did the maths based on the current stats with limited data sets. The partitions size coming as 100 to 150gb as I explained above, if we keep it daily range partition. Should we have to make it hourly then? 

 So want some experts view if this going to work fine for a read latency intensive applications or we should have some different strategy? 

pgsql-general by date:

Previous
From: Wyatt Tellis
Date:
Subject: Improving pg_dump performance when handling large numbers of LOBs
Next
From: Ron Johnson
Date:
Subject: Re: Improving pg_dump performance when handling large numbers of LOBs