34.4. Capacity Planning #

When deploying the built-in analytical platform, sufficient computational resources must be allocated for executing analytical queries. The amount of required resources is determined by the following storage parameters:

This section provides guidelines for estimating the computational resources required for different OLAP workloads. After deploying the built-in analytical platform, monitor the available computational resources and be prepared to allocate additional resources outside the scope of this estimation if required.

For capacity planning, you can also use the results of benchmark tests performed by Postgres Pro. For more information, see Benchmark Test Results.

34.4.1. Total Amount of OLAP Data in the Storage #

Sufficient space must be allocated in a local, network, or S3 storage for Parquet files used in analytical queries. Parquet files are usually stored in a compressed format, such as Snappy or zstd, and constitute approximately 30% of the primary unstructured text data or Postgres Pro table data.

The storage throughput must be sufficient to sustain all read and write analytical queries at peak workloads.

34.4.2. Amount of Data Processed by a Standard Analytical Query #

Certain analytical queries require a large amount of RAM to be executed effectively, particularly:

  • Joining two large tables

  • Aggregations over a large number of unique keys

  • Sorting large tables

If the pgpro_duckdb extension lacks RAM for executing an analytical query, the associated data is placed in temporary files, which significantly increases the execution time. To maximize the total Postgres Pro cluster throughput and the number of analytical queries executed per unit of time, process most analytical queries in RAM.

However, temporary files can be used when processing resource-intensive analytical queries with long execution time (hours).

Monitor the amount of temporary files created by the pgpro_duckdb extension every 24 hours and avoid peak values. Temporary files are located in the PGDATA/.tmp directory.

The amount of RAM required for a specific OLAP workload depends on the number and size of analytical queries in the Postgres Pro Enterprise instance. Monitor the amount of available RAM and allocate additional RAM if required. For the initial deployment, allocate approximately 6 GB of RAM per 1 CPU core reserved for executing analytical queries.

34.4.3. Required Analytical Query Execution Time #

Certain analytical queries must complete within a specific time limit. In particular, this applies to interactive queries and queries with results that must be available by a certain date and time, for example, at the start of business hours. If sufficient RAM is allocated for executing analytical queries, and temporary files are not created for them, the execution time is determined by the number of CPU cores reserved for executing analytical queries.

Use internal benchmark tests to estimate the required number of CPU cores. During these tests, time-sensitive analytical queries are executed to the test data, the size and structure of which mimic the real data that will be used.

For the initial deployment, reserve approximately 4 CPU cores per analytical query, i.e., per analyst.

34.4.4. Number of Analytical Queries Executed per Unit of Time #

After allocating the computational resources required for executing analytical queries, calculate the total workload from analytical queries executed in parallel. This workload tends to peak once during business hours or when performing resource-intensive ETL tasks.

Certain ETL tasks can be scheduled to be performed at a specific time of day, whereas others — such as ad-hoc analytical queries — cannot.

Avoid OLAP workload peaks by evenly distributing the execution of analytical queries throughout the day and leaving a 30% margin for unexpected tasks.

The pgpro_duckdb extension processes analytical queries linearly as long as it has sufficient I/O resources, RAM, and CPUs. Once these resources are exhausted, the performance drops.

Recommendations for avoiding resource exhaustion:

  • Eliminate bottlenecks in the Postgres Pro cluster.

  • Reduce OLAP workload peaks.

  • Optimize the execution of analytical queries.

  • Redistribute ETL tasks and ad-hoc analytical queries to a dedicated Postgres Pro server, for example asynchronous standby replica.