G.3. pgpro_duckdb — DuckDB columnar-vectorized analytics engine and features #
pgpro_duckdb is a Postgres Pro extension that embeds DuckDB columnar-vectorized analytics engine and features into Postgres Pro. pgpro_duckdb is the key component of the Built-in Analytical Platform, a Postgres Pro solution designed to operate with OLAP workloads within a Postgres Pro instance.
Note
The pgpro_duckdb extension is currently in an experimental phase.
G.3.1. Installation #
The pgpro_duckdb
extension is provided with Postgres Pro Enterprise as a separate pre-built package pgpro-duckdb-ent-17
(for the detailed installation instructions, see Chapter 17). To enable the pgpro_duckdb
extension, create it using the following query:
CREATE EXTENSION pgpro_duckdb;
By default known extensions are allowed to be automatically installed and loaded when a DuckDB query depends on them. This behavior can be configured using the duckdb.autoinstall_known_extensions and duckdb.autoload_known_extensions settings.
It is also possible to manually install an extension. This can be useful when this autoinstall/autoload behavior is disabled, or when DuckDB fails to realize an extension is necessary to execute the query. Installing an extension requires superuser rights.
SELECT duckdb.install_extension('extname
');
Installing an extension causes it to be loaded and installed globally for any connection that uses DuckDB. The current list of installed extensions is maintained in the duckdb.extensions
table. Superusers can use this table to view, disable, or uninstall extensions, as follows:
-- install an extension -- SELECT duckdb.install_extension('iceberg'); -- view currently installed extensions -- SELECT * FROM duckdb.extensions; -- disable or enable an extension -- UPDATE duckdb.extensions SET enabled = (false|true) WHERE name = 'iceberg'; -- remove an extension -- DELETE FROM duckdb.extensions WHERE name = 'iceberg'; -- you can also install community extensions -- SELECT duckdb.install_extension('duckpgq', 'community');
There is currently no practical difference between a disabled and uninstalled extension.
G.3.2. Supported Extensions #
The following extensions are supported by pgpro_duckdb.
iceberg
iceberg support adds functions to read iceberg tables and metadata. For a list of iceberg functions, see Section G.3.8.
delta
delta support adds the ability to read delta Lake files via delta_scan.
G.3.3. Secrets #
DuckDB secrets can be configured in the duckdb.secrets
table. For example:
-- session token is optional -- INSERT INTO duckdb.secrets (type, key_id, secret, session_token, region) VALUES ('S3', 'access_key_id', 'secret_access_key', 'session_token', 'us-east-1');
G.3.3.1. Columns #
Name | Type | Required | Description |
---|---|---|---|
name | text | no | Automatically generated UUID (primary key) |
type | text | yes | One of the following: S3 for Amazon S3, GCS for Google Cloud Storage, R2 for Cloudflare R2, or Azure for Azure Blob Storage. |
key_id | text | S3, GCS, R2 | The “ID” portion of the secret |
secret | text | S3, GCS, R2 | The “password” portion of the secret |
session_token | text | no | The AWS S3 session token if required for your credential |
region | text | S3 only | For AWS S3, this specifies the region of your bucket |
endpoint | text | no | If using an S3-compatible service other than AWS, this specifies the endpoint of the service |
r2_account_id | text | R2 only | If using Cloudflare R2, the account ID for the credential |
use_ssl | boolean | no | true by default; false is principally for use with custom minio configurations |
scope | text | no | The URL prefix which applies to this credential. This is used to select between multiple credentials for the same service |
connection_string | text | Azure only | Connection string for Azure |
G.3.3.2. How It Works #
Secrets are stored in a Postgres Pro heap table. Each time a DuckDB instance is created by pgpro_duckdb, and when a secret is modified, the secrets are loaded into the DuckDB secrets manager as non-persistent secrets.
G.3.3.3. Caveats #
Only the listed types of secrets above are currently supported. As of DuckDB 1.1.3, Postgres Pro secrets are not supported.
Other authentication providers are not yet supported, e.g.
CHAIN
.
G.3.4. Transactions in pgpro_duckdb #
Multi-statement transactions are supported in pgpro_duckdb. There is one important restriction on this though, which is is currently necessary to ensure the expected ACID guarantees: you cannot write to both a Postgres Pro table and a DuckDB table in the same transaction.
Similarly, you are allowed to do DDL (like CREATE
/DROP TABLE
) on DuckDB tables inside a transaction, but it's not allowed to combine such statements with DDL involving Postgres Pro objects.
Finally, it's possible to disable this restriction completely and allow writes to both DuckDB and Postgres Pro in the same transaction by setting duckdb.unsafe_allow_mixed_transactions
to true but it isn't recommended to do so as it can result in the transaction being committed only in DuckDB, but not in Postgres Pro. This can lead to inconsistencies and even data loss. For example, the following code might result in deleting the duckdb_table
, while not copying its contents to pg_table
:
BEGIN; SET LOCAL duckdb.unsafe_allow_mixed_transactions TO true; CREATE TABLE pg_table AS SELECT * FROM duckdb_table; DROP TABLE duckdb_table; COMMIT;
G.3.5. Types #
pgpro_duckdb can read many data types that exist in both Postgres Pro and DuckDB. The following data types are currently supported for use in queries:
Integer types (
integer
,bigint
, etc.)Floating point types (
real
,double precision
)numeric
(might get converted todouble precision
internally, see Section G.3.5.1 limitations below for details)text
/varchar
/bpchar
bit
-related types, including both fixed and varied-sized bit arraybytea
/blob
timestamp
/timstampz
/date
/interval
/timestamp_ns
/timestamp_ms
/timestamp_s
boolean
uuid
json
/jsonb
domain
arrays
for all of the above types, but see limitations below about multi-dimensional arrays
G.3.5.1. Known Limitations #
The type support in pgpro_duckdb is not yet complete. The following are known issues that you might run into:
enum
types are not supported.The DuckDB
decimal
type does not support the wide range of values that the Postgres Pronumeric
type does. To avoid errors when converting between the two,numeric
is converted todouble precision
internally if DuckDB does not support the required precision. This might cause precision loss of the values.The DuckDB
STRUCT
type is not supported.The DuckDB
timestamp_ns
type gets truncated to microseconds when it is converted to the Postgres Protimestamp
type, which loses precision in the output. Operations on atimestamp_ns
value, such as sorting/grouping/comparing, will use the full precision.jsonb
columns are converted tojson
columns when reading from DuckDB. This is because DuckDB does not have ajsonb
type.Many Postgres Pro
json
andjsonb
functions and operators are not implemented in DuckDB. Instead you can use DuckDBjson
functions and operators. See the DuckDB documentation for more information on these functions.The DuckDB
tinyint
type is converted to achar
type in Postgres Pro. This is because Postgres Pro does not have atinyint
type. This causes it to be displayed as a hex code instead of a regular number.Conversion between Postgres Pro multi-dimensional arrays and DuckDB nested
LIST
s in DuckDB can run into various problems, because neither database supports the thing that the other supports exactly. Specifically, in Postgres Pro it is allowed for different arrays in a column to have a different number of dimensions, e.g.[1]
and[[1], [2]]
can both occur in the same column. In DuckDB, that is not allowed, i.e. the amount of nesting should always be the same. On the other hand, in DuckDB it is valid for different lists at the same nest-level to contain a different number of elements, e.g.[[1], [1, 2]]
. This is not allowed in Postgres Pro. So conversion between these types is only possible when the arrays follow the subset. Another possible problem that you can run into is that pgpro_duckdb uses the Postgres Pro column metadata to determine the number of dimensions that an array has. Since Postgres Pro does not complain when you add arrays of different dimensions, it is possible that the number of dimensions in the column metadata does not match the actual number of dimensions. To solve this you need to alter the column type:-- this configures the column to be a 3-dimensional array of text -- ALTER TABLE s ALTER COLUMN a SET DATA TYPE text[][][];
For the
domain
actually, during the execution of theINSERT
operation, the check regardingdomain
is conducted by Postgres Pro rather than DuckDB. When we execute theSELECT
operation and the type of the queried field is adomain
, we will convert it to the corresponding base type and let DuckDB handle it.
G.3.6. Special Types #
pgpro_duckdb introduces a few special Postgres Pro types. You should not create these types explicitly and normally you do not need to know about their existence, but they might show up in error messages from Postgres Pro. These are explained below:
duckdb.row
#The
duckdb.row
type is returned by functions likeread_parquet
,read_csv
,scan_iceberg
, etc. Depending on the arguments of these functions, they can return rows with different columns and types. Postgres Pro does not support such functions well at this point in time, so for now a custom type is returned. Then to be able to get the actual columns out of these rows, you have to use the “square bracket indexing” syntax, similarly to how you would get field:SELECT r['id'], r['name'] FROM read_parquet('file.parquet') r WHERE r['age'] > 21;
Using
SELECT *
will result in the columns of this row being expanded, so your query result will never have a column that hasduckdb.row
as its type:SELECT * FROM read_parquet('file.parquet');
Due to limitations in Postgres Pro, there are some limitations when using a function that returns a
duckdb.row
in a CTE or subquery. The main problem is that pgpro_duckdb cannot automatically assign useful aliases to the selected columns from the row. So while this query without a CTE/subquery returns ther[company]
column ascompany
:SELECT r['company'] FROM duckdb.query($$ SELECT 'DuckDB Labs' company $$) r; -- company -- ───────────── -- DuckDB Labs
The same query in a subquery or CTE will return the column simply as
r
:WITH mycte AS ( SELECT r['company'] FROM duckdb.query($$ SELECT 'DuckDB Labs' company $$) r ) SELECT * FROM mycte; -- r -- ───────────── -- DuckDB Labs
This is easy to work around by adding an explicit alias to the column in the CTE/subquery:
WITH mycte AS ( SELECT r['company'] AS company FROM duckdb.query($$ SELECT 'DuckDB Labs' company $$) r ) SELECT * FROM mycte; -- company -- ───────────── -- DuckDB Labs
Another limitation that can be similarly confusing is that when using
SELECT *
inside the CTE/subquery, you want to reference a specific column outside the CTE/subquery, then you still need to use ther['colname']
syntax instead of simplycolname
. So while this works as expected:WITH mycte AS ( SELECT * FROM duckdb.query($$ SELECT 'DuckDB Labs' company $$) r ) SELECT * FROM mycte; -- company -- ───────────── -- DuckDB Labs
The following query will throw an error:
WITH mycte AS ( SELECT * FROM duckdb.query($$ SELECT 'DuckDB Labs' company $$) r ) SELECT * FROM mycte WHERE company = 'DuckDB Labs'; -- ERROR: 42703: column "company" does not exist -- LINE 5: SELECT * FROM mycte WHERE company = 'DuckDB Labs'; -- ^ -- HINT: If you use DuckDB functions like read_parquet, you need to use the r['colname'] syntax to use columns. If you're already doing that, maybe you forgot to to give the function the r alias.
This is easy to work around by using the
r['colname']
syntax like so:> WITH mycte AS ( SELECT * FROM duckdb.query($$ SELECT 'DuckDB Labs' company $$) r ) SELECT * FROM mycte WHERE r['company'] = 'DuckDB Labs'; -- company -- ───────────── -- DuckDB Labs
duckdb.unresolved_type
#The
duckdb.unresolved_type
type is a type that is used to make Postgres Pro understand an expression for which the type is not known at query parse time. This is the type of any of the columns extracted from aduckdb.row
using ther['mycol']
syntax. Many operators and aggregates will return aduckdb.unresolved_type
when one of the sides of the operator is of the typeduckdb.unresolved_type
, for instancer['age'] + 10
.Once the query gets executed by DuckDB, the actual type will be filled in by DuckDB. So, a query result will never contain a column that has
duckdb.unresolved_type
as its type. And generally you should not even realize that this type exists.You might get errors that say that functions or operators do not exist for the
duckdb.unresolved_type
, such as:ERROR: function some_func(duckdb.unresolved_type) does not exist LINE 6: some_func(r['somecol']) as somecol
In such cases, a simple workaround is often to add an explicit cast to the type that the function accepts, such as
some_func(r['somecol']::text) as somecol
.duckdb.json
#The
duckdb.json
type is used as arguments to DuckDB JSON functions. This type exists so that these functions can take values ofjson
,jsonb
andduckdb.unresolved_type
.
G.3.7. Configuration Parameters #
Many of these settings are used to configure specific DuckDB settings.
duckdb.force_execution
#Force queries to use DuckDB execution. This is only necessary when accessing only Postgres Pro tables in a query. As soon as you use a DuckDB-only features, then DuckDB execution will be used automatically. DuckDB-only features are reading from DuckDB tables, using DuckDB functions (like
read_parquet
), orCOPY
to remote storage (s3://
, etc).Default:
false
G.3.7.1. Security Settings #
duckdb.postgres_role
#Which Postgres Pro role should be allowed to use DuckDB execution and use the secrets. Defaults to superusers only. If this is configured, but the role does not exist when running
CREATE EXTENSION pgpro_duckdb
, it will be created automatically. This role will have access to DuckDB secrets.Default:
""
Access: Needs to be in the
postgresql.conf
file and requires a restartduckdb.disabled_filesystems
#Disable specific file systems preventing access. This setting only applies to non-superusers. Superusers can always access all file systems. Unless you completely trust the user in duckdb.postgres_role, it is recommended to disable
LocalFileSystem
. Otherwise they can trivially read and write any file on the machine that the Postgres Pro process can.Default:
"LocalFileSystem"
Access: Superuser-only
duckdb.autoinstall_known_extensions
#Whether known extensions are allowed to be automatically installed when a DuckDB query depends on them.
Default:
true
Access: Superuser-only
duckdb.autoload_known_extensions
#Whether known extensions are allowed to be automatically loaded when a DuckDB query depends on them.
Default:
true
duckdb.allow_community_extensions
#Disable installing community extensions.
Default:
false
Access: Superuser-only
duckdb.enable_external_access
#Allow the DuckDB to access external storages (e.g., HTTP, S3, etc.). This setting is not tested very well yet and disabling it may break unintended pgpro_duckdb functionality.
Default:
true
Access: Superuser-only
G.3.7.2. Resource Management Settings #
Since any connection that uses DuckDB will have its own DuckDB instance, these settings are per-connection. When using pgpro_duckdb in many concurrent connections it can be a good idea to set some of these more conservatively than their defaults.
-
duckdb.max_memory / duckdb.memory_limit
# The maximum memory DuckDB can use within a single Postgres Pro connection. This is somewhat comparable to Postgres Pro
work_mem
setting. When set to an empty string, this will use DuckDB normal default, which is 80% of RAM.Default:
"4GB"
Access: Superuser-only
duckdb.threads
/duckdb.worker_threads
#Maximum number of DuckDB threads per Postgres Pro connection.
-1
means to use DuckDB its default, which is the number of CPU cores on the machine.Default:
-1
Access: Superuser-only
duckdb.max_workers_per_postgres_scan
#Maximum number of PostgreSQL workers used for a single Postgres Pro scan. This is similar to Postgres Pro max_parallel_workers_per_gather setting.
Default:
2
Access: General
G.3.8. Functions #
By default, functions without a schema listed below are installed into public
. You can choose to install these functions to an alternate location by running CREATE EXTENSION pgpro_duckdb WITH SCHEMA schema
.
ALTER EXTENSION pgpro_duckdb WITH SCHEMA schema
is not currently supported.
G.3.8.1. Data Lake Functions #
-
read_parquet(path TEXT or TEXT[],
#optional_parameters
) returns SETOFduckdb.row
Reads a Parquet file, either from a remote storage (via httpfs) or a local storage.
This returns DuckDB rows, you can expand them using
*
or you can select specific columns using ther['mycol']
syntax. If you want to select specific columns, you should give the function call an easy alias, liker
. For example:SELECT * FROM read_parquet('file.parquet'); SELECT r['id'], r['name'] FROM read_parquet('file.parquet') r WHERE r['age'] > 21; SELECT COUNT(*) FROM read_parquet('file.parquet');
Further information:
The following arguments are required:
Name Type Description path text or text[] The path, either to the remote httpfs storage or the local storage (if enabled) with the Parquet files to read. The path can be a glob or array of files to read. Optional parameters mirror DuckDB
read_parquet
function. To specify optional parameters, useparameter := 'value'
.-
read_csv(path TEXT or TEXT[],
#optional_parameters
) returns SETOFduckdb.row
Reads a CSV file, either from a remote storage (via httpfs) or a local storage.
This returns DuckDB rows, you can expand them using
*
or you can select specific columns using ther['mycol']
syntax. If you want to select specific columns, you should give the function call an easy alias, liker
. For example:SELECT * FROM read_csv('file.csv'); SELECT r['id'], r['name'] FROM read_csv('file.csv') r WHERE r['age'] > 21; SELECT COUNT(*) FROM read_csv('file.csv');
Further information:
The following arguments are required:
Name Type Description path text or text[] The path, either to the remote httpfs storage or the local storage (if enabled) with the CSV files to read. The path can be a glob or array of files to read. Optional parameters mirror DuckDB
read_csv
function. To specify optional parameters, useparameter := 'value'
.Compatibility notes:
columns
is not currently supported.nullstr
must be an array (TEXT[]
).
-
read_json(path TEXT or TEXT[],
#optional_parameters
) returns SETOFduckdb.row
Reads a JSON file, either from a remote storage (via httpfs) or a local storage.
This returns DuckDB rows, you can expand them using
*
or you can select specific columns using ther['mycol']
syntax. If you want to select specific columns, you should give the function call an easy alias, liker
. For example:SELECT * FROM read_parquet('file.parquet'); SELECT r['id'], r['name'] FROM read_parquet('file.parquet') r WHERE r['age'] > 21; SELECT COUNT(*) FROM read_parquet('file.parquet');
Further information:
The following arguments are required:
Name Type Description path text or text[] The path, either to the remote httpfs storage or the local storage (if enabled) with the JSON files to read. The path can be a glob or array of files to read. Optional parameters mirror DuckDB
read_json
function. To specify optional parameters, useparameter := 'value'
.Compatibility notes:
columns
is not currently supported.
-
iceberg_scan(path TEXT,
#optional_parameters
) returns SETOFduckdb.row
Reads an Iceberg table, either from a remote storage (via httpfs) or a local storage.
To use
iceberg_scan
, you must enable the iceberg extension:SELECT duckdb.install_extension('iceberg');
This returns DuckDB rows, you can expand them using
*
or you can select specific columns using ther['mycol']
syntax. If you want to select specific columns, you should give the function call an easy alias, liker
. For example:SELECT * FROM iceberg_scan('data/iceberg/table'); SELECT r['id'], r['name'] FROM iceberg_scan('data/iceberg/table') r WHERE r['age'] > 21; SELECT COUNT(*) FROM iceberg_scan('data/iceberg/table');
Further information:
The following arguments are required:
Name Type Description path text The path, either to a remote httpfs storage or a local storage (if enabled) with the Iceberg table to read. Optional parameters mirror DuckDB
iceberg_scan
function. To specify optional parameters, useparameter := 'value'
.Name Type Default Description allowed_moved_paths boolean false Ensures that some path resolution is performed, which allows scanning Iceberg tables that are moved. mode text ''
metadata_compression_codec text 'none'
skip_schema_inference boolean false version text 'version-hint.text'
version_name_format text 'v%s%s.metadata.json,%s%s.metadata.json'
-
iceberg_metadata(path TEXT,
#optional_parameters
) returns SETOFiceberg_metadata_record
To use
iceberg_metadata
, you must enable the iceberg extension:SELECT duckdb.install_extension('iceberg');
Returns metadata about an Iceberg table. Data is returned as a set of
icerberg_metadata_record
, which is defined as follows:CREATE TYPE duckdb.iceberg_metadata_record AS ( manifest_path TEXT, manifest_sequence_number NUMERIC, manifest_content TEXT, status TEXT, content TEXT, file_path TEXT );
Further information:
The following arguments are required:
Name Type Description path text The path, either to a remote httpfs storage or a local storage (if enabled) with the Iceberg table to read. Optional parameters mirror DuckDB
iceberg_metadata
function. To specify optional parameters, useparameter := 'value'
.Name Type Default Description allowed_moved_paths boolean false Ensures that some path resolution is performed, which allows scanning Iceberg tables that are moved. metadata_compression_codec text 'none'
skip_schema_inference boolean false version text 'version-hint.text'
version_name_format text 'v%s%s.metadata.json,%s%s.metadata.json'
-
delta_scan(path TEXT) returns SETOF
#duckdb.row
Reads a Delta dataset, either from a remote (via httpfs) or a local storage.
To use
delta_scan
, you must enable the delta extension:SELECT duckdb.install_extension('delta');
This returns DuckDB rows, you can expand them using
*
or you can select specific columns using ther['mycol']
syntax. If you want to select specific columns, you should give the function call an easy alias, liker
. For example:SELECT * FROM delta_scan('/path/to/delta/dataset'); SELECT r['id'], r['name'] FROM delta_scan('/path/to/delta/dataset') r WHERE r['age'] > 21; SELECT COUNT(*) FROM delta_scan('/path/to/delta/dataset');
Further information:
The following arguments are required:
Name Type Description path text The path, either to a remote httpfs storage or a local storage (if enabled) with the Delta dataset to read.
G.3.8.2. JSON Functions #
All of the DuckDB json functions and aggregates. Postgres Pro JSON/JSONB functions are not supported.
G.3.8.3. DuckDB Administration Functions #
-
duckdb.install_extension(extension_name TEXT, repository TEXT DEFAULT 'core') returns
#bool
Installs a DuckDB extension and configures it to be loaded automatically in every session that uses pgpro_duckdb.
SELECT duckdb.install_extension('iceberg'); SELECT duckdb.install_extension('avro', 'community');
Since this function can be used to install and download any extensions, it can only be executed by a superuser by default. To allow execution by some other admin user, such as
my_admin
, you can grant such a user the following permissions:GRANT ALL ON FUNCTION duckdb.install_extension(TEXT, TEXT) TO my_admin; GRANT ALL ON TABLE duckdb.extensions TO my_admin; GRANT ALL ON SEQUENCE duckdb.extensions_table_seq TO my_admin;
The following arguments are required:
Name Type Description extension_name text The name of the extension to install -
duckdb.query(query TEXT) returns SETOF
#duckdb.row
Executes the given SELECT query directly against DuckDB. This can be useful if DuckDB syntax makes the query easier to write or if you want to use a function that is not exposed by pgpro_duckdb yet. For example, the below query shows a query that puts
FROM
beforeSELECT
and uses a list comprehension. Both of those features are not supported in Postgres Pro.SELECT * FROM duckdb.query('FROM range(10) as a(a) SELECT [a for i in generate_series(0, a)] as arr');
-
duckdb.raw_query(extension_name TEXT) returns void
# Runs an arbitrary query directly against DuckDB. Compared to
duckdb.query
, this function can execute any query, not justSELECT
queries. The main downside is that it does not return its result as rows, but instead sends the query result to the logs. So the recommendation is to useduckdb.query
when possible, but if you need to run e.g. some DDL you can use this function.-
duckdb.recycle_ddb() returns void
# pgpro_duckdb keeps the DuckDB instance open inbetween transactions. This is done to save session level state, such as manually done
SET
commands. If you want to clear this session level state for some reason you can close the currently open DuckDB instance using the following query:CALL duckdb.recycle_ddb();