G.3. pgpro_duckdb — DuckDB columnar-vectorized analytics engine and features #

pgpro_duckdb is a Postgres Pro extension that embeds DuckDB columnar-vectorized analytics engine and features into Postgres Pro. pgpro_duckdb is the key component of the Built-in Analytical Platform, a Postgres Pro solution designed to operate with OLAP workloads within a Postgres Pro instance.

Note

The pgpro_duckdb extension is currently in an experimental phase.

G.3.1. Installation #

The pgpro_duckdb extension is provided with Postgres Pro Enterprise as a separate pre-built package pgpro-duckdb-ent-17 (for the detailed installation instructions, see Chapter 17). To enable the pgpro_duckdb extension, create it using the following query:

CREATE EXTENSION pgpro_duckdb;

By default known extensions are allowed to be automatically installed and loaded when a DuckDB query depends on them. This behavior can be configured using the duckdb.autoinstall_known_extensions and duckdb.autoload_known_extensions settings.

It is also possible to manually install an extension. This can be useful when this autoinstall/autoload behavior is disabled, or when DuckDB fails to realize an extension is necessary to execute the query. Installing an extension requires superuser rights.

SELECT duckdb.install_extension('extname');

Installing an extension causes it to be loaded and installed globally for any connection that uses DuckDB. The current list of installed extensions is maintained in the duckdb.extensions table. Superusers can use this table to view, disable, or uninstall extensions, as follows:

-- install an extension --
SELECT duckdb.install_extension('iceberg');
-- view currently installed extensions --
SELECT * FROM duckdb.extensions;
-- disable or enable an extension --
UPDATE duckdb.extensions SET enabled = (false|true) WHERE name = 'iceberg';
-- remove an extension --
DELETE FROM duckdb.extensions WHERE name = 'iceberg';
-- you can also install community extensions --
SELECT duckdb.install_extension('duckpgq', 'community');

There is currently no practical difference between a disabled and uninstalled extension.

G.3.2. Supported Extensions #

The following extensions are supported by pgpro_duckdb.

iceberg

iceberg support adds functions to read iceberg tables and metadata. For a list of iceberg functions, see Section G.3.8.

delta

delta support adds the ability to read delta Lake files via delta_scan.

G.3.3. Secrets #

DuckDB secrets can be configured in the duckdb.secrets table. For example:

-- session token is optional --
INSERT INTO duckdb.secrets
(type, key_id, secret, session_token, region)
VALUES ('S3', 'access_key_id', 'secret_access_key', 'session_token', 'us-east-1');

G.3.3.1. Columns #

Name Type Required Description
name text no Automatically generated UUID (primary key)
type text yes One of the following: S3 for Amazon S3, GCS for Google Cloud Storage, R2 for Cloudflare R2, or Azure for Azure Blob Storage.
key_id text S3, GCS, R2 The ID portion of the secret
secret text S3, GCS, R2 The password portion of the secret
session_token text no The AWS S3 session token if required for your credential
region text S3 only For AWS S3, this specifies the region of your bucket
endpoint text no If using an S3-compatible service other than AWS, this specifies the endpoint of the service
r2_account_id text R2 only If using Cloudflare R2, the account ID for the credential
use_ssl boolean no true by default; false is principally for use with custom minio configurations
scope text no The URL prefix which applies to this credential. This is used to select between multiple credentials for the same service
connection_string text Azure only Connection string for Azure

G.3.3.2. How It Works #

Secrets are stored in a Postgres Pro heap table. Each time a DuckDB instance is created by pgpro_duckdb, and when a secret is modified, the secrets are loaded into the DuckDB secrets manager as non-persistent secrets.

G.3.3.3. Caveats #

  • Only the listed types of secrets above are currently supported. As of DuckDB 1.1.3, Postgres Pro secrets are not supported.

  • Other authentication providers are not yet supported, e.g. CHAIN.

G.3.4. Transactions in pgpro_duckdb #

Multi-statement transactions are supported in pgpro_duckdb. There is one important restriction on this though, which is is currently necessary to ensure the expected ACID guarantees: you cannot write to both a Postgres Pro table and a DuckDB table in the same transaction.

Similarly, you are allowed to do DDL (like CREATE/DROP TABLE) on DuckDB tables inside a transaction, but it's not allowed to combine such statements with DDL involving Postgres Pro objects.

Finally, it's possible to disable this restriction completely and allow writes to both DuckDB and Postgres Pro in the same transaction by setting duckdb.unsafe_allow_mixed_transactions to true but it isn't recommended to do so as it can result in the transaction being committed only in DuckDB, but not in Postgres Pro. This can lead to inconsistencies and even data loss. For example, the following code might result in deleting the duckdb_table, while not copying its contents to pg_table:

BEGIN;
SET LOCAL duckdb.unsafe_allow_mixed_transactions TO true;
CREATE TABLE pg_table AS SELECT * FROM duckdb_table;
DROP TABLE duckdb_table;
COMMIT;

G.3.5. Types #

pgpro_duckdb can read many data types that exist in both Postgres Pro and DuckDB. The following data types are currently supported for use in queries:

  • Integer types (integer, bigint, etc.)

  • Floating point types (real, double precision)

  • numeric (might get converted to double precision internally, see Section G.3.5.1 limitations below for details)

  • text/varchar/bpchar

  • bit-related types, including both fixed and varied-sized bit array

  • bytea/blob

  • timestamp/timstampz/date/interval/timestamp_ns/timestamp_ms/timestamp_s

  • boolean

  • uuid

  • json/jsonb

  • domain

  • arrays for all of the above types, but see limitations below about multi-dimensional arrays

G.3.5.1. Known Limitations #

The type support in pgpro_duckdb is not yet complete. The following are known issues that you might run into:

  1. enum types are not supported.

  2. The DuckDB decimal type does not support the wide range of values that the Postgres Pro numeric type does. To avoid errors when converting between the two, numeric is converted to double precision internally if DuckDB does not support the required precision. This might cause precision loss of the values.

  3. The DuckDB STRUCT type is not supported.

  4. The DuckDB timestamp_ns type gets truncated to microseconds when it is converted to the Postgres Pro timestamp type, which loses precision in the output. Operations on a timestamp_ns value, such as sorting/grouping/comparing, will use the full precision.

  5. jsonb columns are converted to json columns when reading from DuckDB. This is because DuckDB does not have a jsonb type.

  6. Many Postgres Pro json and jsonb functions and operators are not implemented in DuckDB. Instead you can use DuckDB json functions and operators. See the DuckDB documentation for more information on these functions.

  7. The DuckDB tinyint type is converted to a char type in Postgres Pro. This is because Postgres Pro does not have a tinyint type. This causes it to be displayed as a hex code instead of a regular number.

  8. Conversion between Postgres Pro multi-dimensional arrays and DuckDB nested LISTs in DuckDB can run into various problems, because neither database supports the thing that the other supports exactly. Specifically, in Postgres Pro it is allowed for different arrays in a column to have a different number of dimensions, e.g. [1] and [[1], [2]] can both occur in the same column. In DuckDB, that is not allowed, i.e. the amount of nesting should always be the same. On the other hand, in DuckDB it is valid for different lists at the same nest-level to contain a different number of elements, e.g. [[1], [1, 2]]. This is not allowed in Postgres Pro. So conversion between these types is only possible when the arrays follow the subset. Another possible problem that you can run into is that pgpro_duckdb uses the Postgres Pro column metadata to determine the number of dimensions that an array has. Since Postgres Pro does not complain when you add arrays of different dimensions, it is possible that the number of dimensions in the column metadata does not match the actual number of dimensions. To solve this you need to alter the column type:

    -- this configures the column to be a 3-dimensional array of text --
     ALTER TABLE s ALTER COLUMN a SET DATA TYPE text[][][];
    
  9. For the domain actually, during the execution of the INSERT operation, the check regarding domain is conducted by Postgres Pro rather than DuckDB. When we execute the SELECT operation and the type of the queried field is a domain, we will convert it to the corresponding base type and let DuckDB handle it.

G.3.6. Special Types #

pgpro_duckdb introduces a few special Postgres Pro types. You should not create these types explicitly and normally you do not need to know about their existence, but they might show up in error messages from Postgres Pro. These are explained below:

duckdb.row #

The duckdb.row type is returned by functions like read_parquet, read_csv, scan_iceberg, etc. Depending on the arguments of these functions, they can return rows with different columns and types. Postgres Pro does not support such functions well at this point in time, so for now a custom type is returned. Then to be able to get the actual columns out of these rows, you have to use the square bracket indexing syntax, similarly to how you would get field:

SELECT r['id'], r['name'] FROM read_parquet('file.parquet') r WHERE r['age'] > 21;

Using SELECT * will result in the columns of this row being expanded, so your query result will never have a column that has duckdb.row as its type:

SELECT * FROM read_parquet('file.parquet');

Due to limitations in Postgres Pro, there are some limitations when using a function that returns a duckdb.row in a CTE or subquery. The main problem is that pgpro_duckdb cannot automatically assign useful aliases to the selected columns from the row. So while this query without a CTE/subquery returns the r[company] column as company:

SELECT r['company']
FROM duckdb.query($$ SELECT 'DuckDB Labs' company $$) r;
--    company
-- ─────────────
--  DuckDB Labs

The same query in a subquery or CTE will return the column simply as r:

WITH mycte AS (
SELECT r['company']
FROM duckdb.query($$ SELECT 'DuckDB Labs' company $$) r
)
SELECT * FROM mycte;
--       r
-- ─────────────
--  DuckDB Labs

This is easy to work around by adding an explicit alias to the column in the CTE/subquery:

WITH mycte AS (
SELECT r['company'] AS company
FROM duckdb.query($$ SELECT 'DuckDB Labs' company $$) r
)
SELECT * FROM mycte;
--    company
-- ─────────────
--  DuckDB Labs

Another limitation that can be similarly confusing is that when using SELECT * inside the CTE/subquery, you want to reference a specific column outside the CTE/subquery, then you still need to use the r['colname'] syntax instead of simply colname. So while this works as expected:

WITH mycte AS (
SELECT *
FROM duckdb.query($$ SELECT 'DuckDB Labs' company $$) r
)
SELECT * FROM mycte;
--    company
-- ─────────────
--  DuckDB Labs

The following query will throw an error:

WITH mycte AS (
SELECT *
FROM duckdb.query($$ SELECT 'DuckDB Labs' company $$) r
)
SELECT * FROM mycte WHERE company = 'DuckDB Labs';
-- ERROR:  42703: column "company" does not exist
-- LINE 5: SELECT * FROM mycte WHERE company = 'DuckDB Labs';
--                                   ^
-- HINT:  If you use DuckDB functions like read_parquet, you need to use the r['colname'] syntax to use columns. If you're already doing that, maybe you forgot to to give the function the r alias.

This is easy to work around by using the r['colname'] syntax like so:

> WITH mycte AS (
SELECT *
FROM duckdb.query($$ SELECT 'DuckDB Labs' company $$) r
)
SELECT * FROM mycte WHERE r['company'] = 'DuckDB Labs';
--    company
-- ─────────────
--  DuckDB Labs
duckdb.unresolved_type #

The duckdb.unresolved_type type is a type that is used to make Postgres Pro understand an expression for which the type is not known at query parse time. This is the type of any of the columns extracted from a duckdb.row using the r['mycol'] syntax. Many operators and aggregates will return a duckdb.unresolved_type when one of the sides of the operator is of the type duckdb.unresolved_type, for instance r['age'] + 10.

Once the query gets executed by DuckDB, the actual type will be filled in by DuckDB. So, a query result will never contain a column that has duckdb.unresolved_type as its type. And generally you should not even realize that this type exists.

You might get errors that say that functions or operators do not exist for the duckdb.unresolved_type, such as:

ERROR:  function some_func(duckdb.unresolved_type) does not exist
LINE 6:  some_func(r['somecol']) as somecol

In such cases, a simple workaround is often to add an explicit cast to the type that the function accepts, such as some_func(r['somecol']::text) as somecol.

duckdb.json #

The duckdb.json type is used as arguments to DuckDB JSON functions. This type exists so that these functions can take values of json, jsonb and duckdb.unresolved_type.

G.3.7. Configuration Parameters #

Many of these settings are used to configure specific DuckDB settings.

duckdb.force_execution #

Force queries to use DuckDB execution. This is only necessary when accessing only Postgres Pro tables in a query. As soon as you use a DuckDB-only features, then DuckDB execution will be used automatically. DuckDB-only features are reading from DuckDB tables, using DuckDB functions (like read_parquet), or COPY to remote storage (s3://, etc).

Default: false

G.3.7.1. Security Settings #

duckdb.postgres_role #

Which Postgres Pro role should be allowed to use DuckDB execution and use the secrets. Defaults to superusers only. If this is configured, but the role does not exist when running CREATE EXTENSION pgpro_duckdb, it will be created automatically. This role will have access to DuckDB secrets.

Default: ""

Access: Needs to be in the postgresql.conf file and requires a restart

duckdb.disabled_filesystems #

Disable specific file systems preventing access. This setting only applies to non-superusers. Superusers can always access all file systems. Unless you completely trust the user in duckdb.postgres_role, it is recommended to disable LocalFileSystem. Otherwise they can trivially read and write any file on the machine that the Postgres Pro process can.

Default: "LocalFileSystem"

Access: Superuser-only

duckdb.autoinstall_known_extensions #

Whether known extensions are allowed to be automatically installed when a DuckDB query depends on them.

Default: true

Access: Superuser-only

duckdb.autoload_known_extensions #

Whether known extensions are allowed to be automatically loaded when a DuckDB query depends on them.

Default: true

duckdb.allow_community_extensions #

Disable installing community extensions.

Default: false

Access: Superuser-only

duckdb.enable_external_access #

Allow the DuckDB to access external storages (e.g., HTTP, S3, etc.). This setting is not tested very well yet and disabling it may break unintended pgpro_duckdb functionality.

Default: true

Access: Superuser-only

G.3.7.2. Resource Management Settings #

Since any connection that uses DuckDB will have its own DuckDB instance, these settings are per-connection. When using pgpro_duckdb in many concurrent connections it can be a good idea to set some of these more conservatively than their defaults.

duckdb.max_memory / duckdb.memory_limit #

The maximum memory DuckDB can use within a single Postgres Pro connection. This is somewhat comparable to Postgres Pro work_mem setting. When set to an empty string, this will use DuckDB normal default, which is 80% of RAM.

Default: "4GB"

Access: Superuser-only

duckdb.threads / duckdb.worker_threads #

Maximum number of DuckDB threads per Postgres Pro connection. -1 means to use DuckDB its default, which is the number of CPU cores on the machine.

Default: -1

Access: Superuser-only

duckdb.max_workers_per_postgres_scan #

Maximum number of PostgreSQL workers used for a single Postgres Pro scan. This is similar to Postgres Pro max_parallel_workers_per_gather setting.

Default: 2

Access: General

G.3.7.3. Developer Settings #

duckdb.allow_unsigned_extensions #

Allow DuckDB to load extensions with invalid or missing signatures. Mostly useful for development of DuckDB extensions.

Default: false

Access: Superuser-only

G.3.8. Functions #

By default, functions without a schema listed below are installed into public. You can choose to install these functions to an alternate location by running CREATE EXTENSION pgpro_duckdb WITH SCHEMA schema.

ALTER EXTENSION pgpro_duckdb WITH SCHEMA schema is not currently supported.

G.3.8.1. Data Lake Functions #

read_parquet(path TEXT or TEXT[], optional_parameters) returns SETOF duckdb.row #

Reads a Parquet file, either from a remote storage (via httpfs) or a local storage.

This returns DuckDB rows, you can expand them using * or you can select specific columns using the r['mycol'] syntax. If you want to select specific columns, you should give the function call an easy alias, like r. For example:

SELECT * FROM read_parquet('file.parquet');
SELECT r['id'], r['name'] FROM read_parquet('file.parquet') r WHERE r['age'] > 21;
SELECT COUNT(*) FROM read_parquet('file.parquet');

Further information:

The following arguments are required:

Name Type Description
path text or text[] The path, either to the remote httpfs storage or the local storage (if enabled) with the Parquet files to read. The path can be a glob or array of files to read.

Optional parameters mirror DuckDB read_parquet function. To specify optional parameters, use parameter := 'value'.

read_csv(path TEXT or TEXT[], optional_parameters) returns SETOF duckdb.row #

Reads a CSV file, either from a remote storage (via httpfs) or a local storage.

This returns DuckDB rows, you can expand them using * or you can select specific columns using the r['mycol'] syntax. If you want to select specific columns, you should give the function call an easy alias, like r. For example:

SELECT * FROM read_csv('file.csv');
SELECT r['id'], r['name'] FROM read_csv('file.csv') r WHERE r['age'] > 21;
SELECT COUNT(*) FROM read_csv('file.csv');

Further information:

The following arguments are required:

Name Type Description
path text or text[] The path, either to the remote httpfs storage or the local storage (if enabled) with the CSV files to read. The path can be a glob or array of files to read.

Optional parameters mirror DuckDB read_csv function. To specify optional parameters, use parameter := 'value'.

Compatibility notes:

  • columns is not currently supported.

  • nullstr must be an array (TEXT[]).

read_json(path TEXT or TEXT[], optional_parameters) returns SETOF duckdb.row #

Reads a JSON file, either from a remote storage (via httpfs) or a local storage.

This returns DuckDB rows, you can expand them using * or you can select specific columns using the r['mycol'] syntax. If you want to select specific columns, you should give the function call an easy alias, like r. For example:

SELECT * FROM read_parquet('file.parquet');
SELECT r['id'], r['name'] FROM read_parquet('file.parquet') r WHERE r['age'] > 21;
SELECT COUNT(*) FROM read_parquet('file.parquet');

Further information:

The following arguments are required:

Name Type Description
path text or text[] The path, either to the remote httpfs storage or the local storage (if enabled) with the JSON files to read. The path can be a glob or array of files to read.

Optional parameters mirror DuckDB read_json function. To specify optional parameters, use parameter := 'value'.

Compatibility notes:

  • columns is not currently supported.

iceberg_scan(path TEXT, optional_parameters) returns SETOF duckdb.row #

Reads an Iceberg table, either from a remote storage (via httpfs) or a local storage.

To use iceberg_scan, you must enable the iceberg extension:

SELECT duckdb.install_extension('iceberg');

This returns DuckDB rows, you can expand them using * or you can select specific columns using the r['mycol'] syntax. If you want to select specific columns, you should give the function call an easy alias, like r. For example:

SELECT * FROM iceberg_scan('data/iceberg/table');
SELECT r['id'], r['name'] FROM iceberg_scan('data/iceberg/table') r WHERE r['age'] > 21;
SELECT COUNT(*) FROM iceberg_scan('data/iceberg/table');

Further information:

The following arguments are required:

Name Type Description
path text The path, either to a remote httpfs storage or a local storage (if enabled) with the Iceberg table to read.

Optional parameters mirror DuckDB iceberg_scan function. To specify optional parameters, use parameter := 'value'.

Name Type Default Description
allowed_moved_paths boolean false Ensures that some path resolution is performed, which allows scanning Iceberg tables that are moved.
mode text ''
metadata_compression_codec text 'none'
skip_schema_inference boolean false
version text 'version-hint.text'
version_name_format text 'v%s%s.metadata.json,%s%s.metadata.json'
iceberg_metadata(path TEXT, optional_parameters) returns SETOF iceberg_metadata_record #

To use iceberg_metadata, you must enable the iceberg extension:

SELECT duckdb.install_extension('iceberg');

Returns metadata about an Iceberg table. Data is returned as a set of icerberg_metadata_record, which is defined as follows:

CREATE TYPE duckdb.iceberg_metadata_record AS (
  manifest_path TEXT,
  manifest_sequence_number NUMERIC,
  manifest_content TEXT,
  status TEXT,
  content TEXT,
  file_path TEXT
);

Further information:

The following arguments are required:

Name Type Description
path text The path, either to a remote httpfs storage or a local storage (if enabled) with the Iceberg table to read.

Optional parameters mirror DuckDB iceberg_metadata function. To specify optional parameters, use parameter := 'value'.

Name Type Default Description
allowed_moved_paths boolean false Ensures that some path resolution is performed, which allows scanning Iceberg tables that are moved.
metadata_compression_codec text 'none'
skip_schema_inference boolean false
version text 'version-hint.text'
version_name_format text 'v%s%s.metadata.json,%s%s.metadata.json'
delta_scan(path TEXT) returns SETOF duckdb.row #

Reads a Delta dataset, either from a remote (via httpfs) or a local storage.

To use delta_scan, you must enable the delta extension:

SELECT duckdb.install_extension('delta');

This returns DuckDB rows, you can expand them using * or you can select specific columns using the r['mycol'] syntax. If you want to select specific columns, you should give the function call an easy alias, like r. For example:

SELECT * FROM delta_scan('/path/to/delta/dataset');
SELECT r['id'], r['name'] FROM delta_scan('/path/to/delta/dataset') r WHERE r['age'] > 21;
SELECT COUNT(*) FROM delta_scan('/path/to/delta/dataset');

Further information:

The following arguments are required:

Name Type Description
path text The path, either to a remote httpfs storage or a local storage (if enabled) with the Delta dataset to read.

G.3.8.2. JSON Functions #

All of the DuckDB json functions and aggregates. Postgres Pro JSON/JSONB functions are not supported.

G.3.8.3. DuckDB Administration Functions #

duckdb.install_extension(extension_name TEXT, repository TEXT DEFAULT 'core') returns bool #

Installs a DuckDB extension and configures it to be loaded automatically in every session that uses pgpro_duckdb.

SELECT duckdb.install_extension('iceberg');
SELECT duckdb.install_extension('avro', 'community');

Since this function can be used to install and download any extensions, it can only be executed by a superuser by default. To allow execution by some other admin user, such as my_admin, you can grant such a user the following permissions:

GRANT ALL ON FUNCTION duckdb.install_extension(TEXT, TEXT) TO my_admin;
GRANT ALL ON TABLE duckdb.extensions TO my_admin;
GRANT ALL ON SEQUENCE duckdb.extensions_table_seq TO my_admin;

The following arguments are required:

Name Type Description
extension_name text The name of the extension to install
duckdb.query(query TEXT) returns SETOF duckdb.row #

Executes the given SELECT query directly against DuckDB. This can be useful if DuckDB syntax makes the query easier to write or if you want to use a function that is not exposed by pgpro_duckdb yet. For example, the below query shows a query that puts FROM before SELECT and uses a list comprehension. Both of those features are not supported in Postgres Pro.

SELECT * FROM duckdb.query('FROM range(10) as a(a) SELECT [a for i in generate_series(0, a)] as arr');
duckdb.raw_query(extension_name TEXT) returns void #

Runs an arbitrary query directly against DuckDB. Compared to duckdb.query, this function can execute any query, not just SELECT queries. The main downside is that it does not return its result as rows, but instead sends the query result to the logs. So the recommendation is to use duckdb.query when possible, but if you need to run e.g. some DDL you can use this function.

duckdb.recycle_ddb() returns void #

pgpro_duckdb keeps the DuckDB instance open inbetween transactions. This is done to save session level state, such as manually done SET commands. If you want to clear this session level state for some reason you can close the currently open DuckDB instance using the following query:

CALL duckdb.recycle_ddb();