34.5. Storage for Analytical Files #

The pgpro_duckdb extension allows executing analytical queries on OLAP data stored in Postgres Pro tables or a standalone storage. In most cases, a standalone storage is used due to the following reasons:

  • OLAP data is saved to a standalone storage in the columnar format optimized for executing analytical queries.

  • Standalone storage allows isolating I/O of analytical queries from OLTP workload.

  • Network storages allow storing large amounts of historical data and provide sufficient performance for executing resource-intensive analytical queries.

  • Shared storages can be used by standby servers improving scalability and enabling workload isolation.

Note

Currently, it is not recommended to use local storages in the main deployment scenario since the pgpro_duckdb extension requires membership in a global role that grants access under the postgres user.

However, a local storage can be used in a single Postgres Pro Enterprise instance with a dedicated analyst, where data is easily restored.

34.5.1. Parquet Files #

The pgpro_duckdb extension stores OLAP data as Parquet files. This columnar format supports different compression methods and includes basic data statistics. The stored data statistics can be used for performing filter and projection pushdown to reduce the amount of data being read.

34.5.2. Directory Structure #

Any directory structure can be used for storing OLAP data, for example:

  • In a local storage:

            rootpath/db_name/schema_name/table_name
    
  • In an S3 storage:

            s3://bucket/db_name/schema_name/table_name
    

The pgpro_duckdb extension allows automatically exporting OLAP data to multiple Parquet files and adding a unique number to each file name. It is easier to store OLAP data as multiple Parquet files of the same size.

You can also use Hive partitioning to organize OLAP data by partition keys in a directory hierarchy:

    table_name
    ├── year=2024
    │    ├── month=1
    │    │   ├── file1.parquet
    │    │   └── file2.parquet
    │    └── month=2
    │        └── file3.parquet
    └── year=2025
        ├── month=11
        │   ├── file4.parquet
        │   └── file5.parquet
        └── month=12
            └── file6.parquet

This type of hierarchy can be useful for large historical tables when analytical queries require data associated with a specific subset of partition keys.

Performing filter pushdown on the path level is also supported. This allows skipping paths that do not contain required OLAP data when reading.