34.7. Using the Built-in Analytical Platform #

34.7.1. Getting Started #

To start working with the built-in analytical platform, perform the following steps:

  1. Read information about considerations and limitations and ensure that the pgpro_duckdb extension suits internal requirements of your organization and analytical query workload profile.

  2. Define your optimal deployment diagram. For more information, see Deployment Recommendations.

  3. Install the pgpro_duckdb extension.

  4. Allocate the required resources. For more information, see Capacity Planning.

  5. Define the security model for accessing OLAP resources. For more information, see Configuring Security for Accessing OLAP Resources.

  6. Limit resources for analytical queries considering OLTP workload. For more information, see Configuring Computational Resource Limits.

  7. Provide pgpro_duckdb users with the required access rights.

  8. Export the required OLAP data to a columnar view and move it to the corresponding storage. For more information, see Exporting OLAP Data to a Storage.

  9. Create views for the exported Parquet files. For more information, see Creating a View for Accessing Parquet Files.

  10. Ensure that analytical queries can be executed based on the created views.

34.7.2. Configuring Security for Accessing OLAP Resources #

To configure security for accessing OLAP resources, configure role access and storages.

34.7.2.1. Configuring Role Access #

You can use the duckdb.postgres_role configuration parameter to specify the name of the role that can call pgpro_duckdb functions:

      postgres=# show duckdb.postgres_role;
       duckdb.postgres_role
      ----------------------

      (1 row)

By default, the duckdb.postgres_role configuration parameter is set to NULL, meaning only a superuser can call pgpro_duckdb functions:

      postgres=> SELECT usename, usesuper FROM pg_user WHERE usename = current_user;
       usename | usesuper
      ---------+----------
       user1   | f
      (1 row)

      postgres=> SELECT * FROM duckdb.query('SELECT * FROM a;');
      ERROR:  DuckDB execution is not allowed because you have not been granted the duckdb.postgres_role

To allow members of a user role to call pgpro_duckdb functions, specify the name of the required role, then reload the server:

      postgres=# ALTER SYSTEM SET duckdb.postgres_role TO 'user1';
      ALTER SYSTEM
      -- after server reboot --
      postgres=# show duckdb.postgres_role ;
       duckdb.postgres_role
      ----------------------
       user1
      (1 row)

34.7.2.2. Configuring Storages #

You can use the duckdb.disabled_filesystems configuration parameter to specify the storages that will be disabled for the pgpro_duckdb extension.

Possible values:

  • LocalFileSystem: Local storage

  • S3FileSystem: S3 storage

  • HTTPFileSystem: Network storage

By default, the local storage is disabled:

      postgres=> show duckdb.disabled_filesystems;
       duckdb.disabled_filesystems
      -----------------------------
       LocalFileSystem
      (1 row)

Note

Before enabling a local storage or an S3 storage, see Considerations for Working with a Local Storage and Considerations for Working with an S3 Storage.

To specify multiple storages, separate their names with commas without spaces:

      postgres=# ALTER SYSTEM SET duckdb.disabled_filesystems TO 'LocalFileSystem,HTTPFileSystem';
      ALTER SYSTEM
      -- after server reboot --
      postgres=# show duckdb.disabled_filesystems;
        duckdb.disabled_filesystems
      --------------------------------
       LocalFileSystem,HTTPFileSystem
      (1 row)
34.7.2.2.1. Considerations for Working with a Local Storage #

The local storage is disabled by default preventing users from exporting or reading files from it:

        postgres=> show duckdb.disabled_filesystems;
         duckdb.disabled_filesystems
        -----------------------------
         LocalFileSystem
        (1 row)

        postgres=> COPY test TO 'out.parquet';
        ERROR:  (PGDuckDB/DuckdbUtilityHook_Cpp) Permission Error: File system LocalFileSystem has been disabled by configuration
        postgres=> SELECT * FROM read_parquet('test.parquet');
        ERROR:  (PGDuckDB/CreatePlan) Prepared query returned an error: 'Permission Error: File system LocalFileSystem has been disabled by configurationinput>

If the local storage is enabled, members of the role specified in the duckdb.postgres_role configuration parameter have the same privileges within this storage as the postgres user. It is required for reading and writing to Parquet files:

        postgres=> show duckdb.disabled_filesystems;
         duckdb.disabled_filesystems
        -----------------------------

        (1 row)

        postgres=> COPY test TO 'out.parquet';
        COPY 7
        postgres=> SELECT * FROM read_parquet('out.parquet');
         a | b
        ---+---
         1 | 2
        ...
        (7 rows)

However, in this case, users can also access files in the PGDATA directory:

        postgres=> show duckdb.disabled_filesystems ;
         duckdb.disabled_filesystems
        -----------------------------

        (1 row)

        postgres=> SELECT pg_read_file('postgresql.auto.conf');
        ERROR:  permission denied for function pg_read_file

        postgres=> SELECT * FROM duckdb.query('SELECT content FROM read_text(''postgresql.auto.conf'')');
                              content
        -------------------------------------------------------
        # Do not edit this file manually!                    +
        # It will be overwritten by the ALTER SYSTEM command.+

Whereas if you disable the local storage:

        postgres=> show duckdb.disabled_filesystems ;
         duckdb.disabled_filesystems
        -----------------------------
         LocalFileSystem
        (1 row)

        postgres=> SELECT * FROM duckdb.query('SELECT content FROM read_text(''postgresql.auto.conf'')');
        ERROR:  (PGDuckDB/CreatePlan) Prepared query returned an error: 'Permission Error: File system LocalFileSystem has been disabled by configuration

Recommendations for using the pgpro_duckdb extension with a local storage:

  • Use a local storage only if you trust users to call pgpro_duckdb functions correctly.

  • Isolate Postgres Pro processes from files outside the PGDATA directory and files that will be read or written to.

34.7.2.2.2. Considerations for Working with an S3 Storage #

To work with an S3 storage, specify the required connection information. For example, if you have an instance of the MinIO S3 storage deployed, perform the following steps under the postgres user:

  1. Create a server for the S3 storage:

                postgres=# CREATE SERVER my_secret1
                TYPE 's3'
                FOREIGN DATA WRAPPER duckdb
                OPTIONS (
                    endpoint '127.0.0.1:9000',
                    url_style 'path',
                    use_ssl 'FALSE'
                );
                CREATE SERVER
    
  2. Create a user mapping for the user that will work with the S3 storage:

                postgres=# CREATE USER MAPPING FOR user1 SERVER my_secret1
                OPTIONS (KEY_ID 'minioadmin', SECRET 'minioadmin');
                CREATE USER MAPPING
    

The specified user now has required privileges for working with the S3 storage:

        postgres=> SELECT * FROM read_parquet('s3://bucket1/results.parquet');
         a | b
        ---+---
         1 | 2
         2 |
        (2 rows)

        postgres=> COPY a TO 's3://bucket1/a.parquet';
        COPY 8

If you create a user mapping for the user role specified in the duckdb.postgres_role configuration parameter, all members of this role will be able to work with the S3 storage.

To revoke the user access to the S3 storage, delete the corresponding user mapping:

        postgres=# DROP USER MAPPING FOR user1 SERVER my_secret1;
        DROP USER MAPPING

To revoke all user access to the S3 storage, disable it:

        postgres=> show duckdb.disabled_filesystems ;
          duckdb.disabled_filesystems
        ------------------------------
          LocalFileSystem,S3FileSystem
        (1 row)

        postgres=> SELECT * FROM read_parquet('s3://bucket1/results.parquet');
        ERROR:  (PGDuckDB/CreatePlan) Prepared query returned an error: 'Permission Error: File system S3FileSystem has been disabled by configuration
        postgres=> COPY a TO 's3://bucket1/a.parquet';
        ERROR:  (PGDuckDB/DuckdbUtilityHook_Cpp) Permission Error: File system S3FileSystem has been disabled by configuration

34.7.3. Exporting OLAP Data to a Storage #

Considerations for exporting OLAP data to a storage:

  • The zstd compression method with default compression level provides optimal file size / operation speed ratio.

  • You can specify the file size using the FILE_SIZE_BYTES parameter. The value must be in the range from 100 MB to 10 GB.

  • You can leave the ROW_GROUP_SIZE parameter unchanged.

34.7.3.1. Exporting OLAP Data to a Columnar View #

OLAP data is exported using the COPY command. You can export an entire Postgres Pro table:

      
        COPY tbl TO 'output.parquet' (FORMAT parquet);
      
    

Or only the required OLAP data:

      
        COPY (SELECT * FROM tbl) TO 'output.parquet' (FORMAT parquet);
      
    

You can follow guidelines for directory structure and export to Parquet files:

      
        COPY (SELECT * FROM tbl) TO '/rootpath/db_name/schema_name/ table_name'
        (FORMAT parquet, COMPRESSION zstd, FILE_SIZE_BYTES 1g);
      
    

34.7.3.2. Exporting OLAP Data to an S3 Storage #

To export OLAP data to an S3 storage, specify the required connection information. For example, if you have an instance of the MinIO S3 storage deployed, perform the following steps under the postgres user:

  1. Create a server for the S3 storage:

              
                postgres=# CREATE SERVER my_secret1
                TYPE 's3'
                FOREIGN DATA WRAPPER duckdb
                OPTIONS (
                    endpoint '127.0.0.1:9000',
                    url_style 'path',
                    use_ssl 'FALSE'
                );
                CREATE SERVER
              
            
  2. Create a user mapping for the user that will work with the S3 storage:

              
                postgres=# CREATE USER MAPPING FOR user1 SERVER my_secret1
                OPTIONS (KEY_ID 'minioadmin', SECRET 'minioadmin');
                CREATE USER MAPPING
              
            
  3. Execute the COPY command under the required user:

              
                COPY (SELECT * FROM tbl) TO 's3://bucket/db_name/schema_name/table_name' (FORMAT parquet, COMPRESSION zstd, FILE_SIZE_BYTES 1g);
              
            

34.7.3.3. Partitioning Support #

To organize OLAP data by partition keys within the Postgres Pro table directory using Hive partitioning, specify the PARTITION_BY parameter when exporting data:

      
        COPY (SELECT * FROM tbl) TO 's3://bucket/db_name/schema_name/table_name' (FORMAT parquet, PARTITION_BY (year, month), COMPRESSION zstd);
      
    

Since writing a large number of Parquet files is a resource-intensive OLAP operation, select partition keys in a way that the size of each resulting partition is at least 100 MB. You cannot configure the size of Parquet files along with the PARTITION_BY parameter. The number of Parquet files created in each directory (partition) corresponds to the number of pgpro_duckdb threads.

For more information about the COPY command parameters and Parquet format, see the official DuckDB documentation on statements and Parquet files.

34.7.3.4. Transactions and Performance #

The COPY command is executed in a transaction so the snapshot of the exported Postgres Pro table is consistent. You can use an explicit transaction to consistently export multiple tables by continually executing the COPY command.

Open transactions delay vacuum, which can negatively impact performance and increase the size of tables. To avoid this, perform the following steps:

  • If you have available computational resources, execute the COPY command for each table in a separate session. In this case, connect to a single transaction snapshot in each session using the SET TRANSACTION SNAPSHOT command.

  • Append-only tables usually have a column with monotonically increasing values. You can split the values into ranges and export them in different transactions.

Note that you can specify the number of postgres processes for reading tables and the number of pgpro_duckdb threads for exporting OLAP data to Parquet files. However, the conversion from columnar to row view is currently performed in a single thread.

34.7.3.5. Export Errors #

If an error occurs when exporting OLAP data to Parquet files, for example due to insufficient disk space, partially exported files are not deleted. For such cases, configure automatic deletion and work continuation.

34.7.4. Creating a View for Accessing Parquet Files #

The pgpro_duckdb extension uses the read_parquet function for reading Parquet files. This function returns a special data type — duckdb.row that provides compatibility between DuckDB and Postgres Pro. The read_parquet function uses a unique syntax for accessing columns:

    
      SELECT r['id'], r['name'] FROM read_parquet('file.parquet') r WHERE r['age'] > 21;
    
   

34.7.4.1. Creating a View for Parquet Files #

After exporting OLAP data to Parquet files, you can work with them as you would with standard Postgres Pro tables. To do this, create a view for Parquet files:

      
        CREATE VIEW parquet_table AS
        SELECT
                r['id']::BIGINT AS id,
                r['age']::SMALLINT AS age,
                r['name']::VARCHAR AS name
        FROM read_parquet('file.parquet') AS r;
      
     

In this example, data types used in Parquet files are specified explicitly. This is optional, but it allows you to cast columns to required DuckDB types, otherwise metadata from Parquet files is used. If data types are not specified explicitly, the view description will include a special duckdb.unresolved_type data type.

Once the view is created, standard analytical queries can be executed:

      
        SELECT id, name FROM parquet_table r WHERE age > 21;
      
     

34.7.4.2. Accessing Parquet Files Using a Glob Pattern #

The read_parquet function can take a list of Parquet file names. It also allows using a glob pattern syntax and specifying lists of glob patterns. The glob pattern syntax can be used for reading a directory with Parquet files or multiple directories if Hive partitioning is used.

For more information, see the official DuckDB documentation on Parquet files.

34.7.4.3. Performance Optimization for Parquet Files #

The pgpro_duckdb extension includes the following performance optimizations for Parquet files:

  • Projection pushdown: Excludes columns that are not specified in analytical queries.

  • Filter pushdown: Uses Parquet file metadata to read only the required rows.

  • Hive partitioning: Reduces the amount of OLAP data being read. You can organize OLAP data by partition keys in a directory hierarchy and use partition keys in analytical queries. In this case, only the directories that correspond to the specified partition keys will be read.

For more information, see the official DuckDB documentation on partial reading and filter pushdown.

34.7.5. Configuring Computational Resource Limits #

To avoid negative impact on OLTP workloads, it is important to limit the consumption of computational resources by analytical sessions.

34.7.5.1. Limiting Threads and Memory Consumption #

The pgpro_duckdb extension allows limiting the number of threads and memory consumption for analytical sessions using the following configuration parameters:

The real memory consumption by OS can be higher, for example, allocating space for query results does not take configured limits into account. For more information, see the official DuckDB documentation on resource management.

34.7.5.2. Disk Spilling #

OLAP data is processed by the pgpro_duckdb extension in streams, i.e., part by part, avoiding full materialization when possible. If the pgpro_duckdb extension lacks RAM for executing an analytical query, the associated data is placed in temporary files, which significantly increases the execution time.

Temporary files are located in the PGDATA/.tmp directory and can take up to 90% of disk space. Monitor the amount of temporary files created by the pgpro_duckdb extension every 24 hours and avoid peak values.

34.7.5.3. Simultaneous OLAP and OLTP Workloads Synthetic Test Results #

The CH-benCHmark methodology was used to evaluate performance. It includes measuring the average number of transactions per second (TPS) during a specific time period, as well as the number of analytical queries executed in parallel. In this case, the nature of analytical queries is not important, as they are executed in a random order, and the execution time of a single analytical query is much less than the duration of the test run.

OLTP workload was provided by the pgbench program, and analytical ClickBench queries were generated in the background by a script. In the CH-benCHmark file, analytical TCP-H queries were replaced with ClickBench queries. As the initial implementation, citus-benchmark was used.

Test details:

  • Server with 64 vCPUs and 128 GB RAM

  • Data size: 14 GB each for pgbench and ClickBench datasets, processed in RAM

  • Test duration: 100 seconds

  • Metrics: TPS, number of analytical queries, and load average

34.7.5.3.1. Test 1 #

For this test, a single ClickBench process with 4 threads was initialized. The number of pgbench -S (select-only) connections varied for each test run.

Table 34.2. Test 1 Results

Number of Connections

TPS / Analytical Queries (Load Average)

25

287/39 (23.8)

50

487/39 (42)

100

532/38 (68)

200

610/34 (158)


34.7.5.3.2. Test 2 #

For this test, 200 pgbench -S (select-only) connections were initialized. The number of ClickBench processes with 2 threads each varied for each test run.

Table 34.3. Test 2 Results

Number of ClickBench processes

TPS / Analytical Queries (Load Average)

2

574/49 (147)

4

533/66 (156)

8

522/127 (170)


34.7.5.3.3. Test 3 #

For this test, 200 pgbench TCP-B connections were initialized. The number of threads varied for each test run.

Table 34.4. Test 3 Results

Number of Threads

TPS / Analytical Queries (Load Average)

0

53

1

42/17 (66)

4

42/39 (69)

Not limited

42/108 (86)


34.7.5.4. Simultaneous OLAP and OLTP Workload #

Recommendations for simultaneous OLAP and OLTP workload:

  • If OLTP sessions use a large number of CPU cores, you can initialize a limited number of OLAP threads without negatively impacting OLTP workload. This is provided by the OS scheduler that evenly distributes CPUs across threads.

  • Limit the number of threads and memory consumption by analytical sessions using pgpro_duckdb configuration parameters.

  • Store Parquet files and temporary file directory in a separate physical storage to avoid I/O competition.

34.7.6. Basic Scenario of an Analyst's work #

This section describes a basic work scenario for an analyst, assuming the following prerequisites are met:

  • Postgres Pro is deployed with the pgpro_duckdb extension installed.

  • An instance of the MinIO S3 storage is deployed and contains Parquet files.

  • The DBeaver application is installed and connected to the database.

The basic scenario includes the following steps:

  1. Configuring the pgpro_duckdb extension with the MinIO S3 storage. The administrator creates a server for the S3 storage and a user mapping for the user that will work with the S3 storage. The commands are executed under the postgres user:

              postgres=# CREATE SERVER my_secret1
              TYPE 's3'
              FOREIGN DATA WRAPPER duckdb
              OPTIONS (
                  endpoint '127.0.0.1:9000',
                  url_style 'path',
                  use_ssl 'FALSE'
              );
              CREATE SERVER
    
              postgres=# CREATE USER MAPPING FOR user1 SERVER my_secret1
              OPTIONS (KEY_ID 'minioadmin', SECRET 'minioadmin');
              CREATE USER MAPPING
    
  2. Creating a view for accessing Parquet files:

              postgres=# CREATE VIEW orders_view AS SELECT r['o_orderkey']::bigint o_orderkey, r['o_custkey']::bigint o_custkey, r['o_orderstatus']::char o_orderstatus, r['o_totalprice']::double precision o_totalprice, r['o_orderdate']::date o_orderdate, r['o_orderpriority']::text o_orderpriority, r['o_clerk']::text o_clerk, r['o_shippriority']::bigint o_shippriority, r['o_comment']::text o_comment FROM read_parquet('s3://bucket1/orders/orders.parquet') r;
              CREATE VIEW
    

    In this query, the required columns are selected from the Parquet file using the r['column_name'] parameter, and data types are specified by the r['column_name']:: type syntax. Furthermore, column aliases and the r alias for the read_parquet parameter are created for the convenience of working with data.

    Now, the analyst can work with the created view:

              test=# SELECT o_orderkey, o_totalprice FROM orders_view LIMIT 3;
               o_orderkey | o_totalprice
              ------------+--------------
                        1 |    224560.83
                        2 |     75388.65
                        3 |    255287.36
              (3 rows)
    
              test=# SELECT AVG(o_totalprice) FROM orders_view ;
                      AVG
              -------------------
               151121.1229494574
              (1 row)
    
  3. Working with the created view using the DBeaver application. The analyst can connect to the database, access the created view, and execute analytical queries. The DBeaver application displays column types specified at view creation.