5.2. Metadata Tables #

The metadata stored in metadata tables of pgpro_metastore is required for the following functionality:

  • Creating analytical views.

  • Querying analytical tables that use partitioning and indexes of Parquet files.

  • Supporting basic ETL operations:

    • performing ETL operations

    • scheduling ETL operations to be performed at a certain date and time, as well as sequentially (one after another)

    • limiting resources available for ETL operations to avoid impacting the OLTP workload

  • Working with the OLAP data:

    • creating Parquet files with the OLAP data from Postgres Pro tables and importing this data to analytical tables

    • exporting the OLAP data from analytical tables to Parquet, CSV, and JSON files

    • creating reports with filtered OLAP data using SQL queries

  • Making transaction updates to pgpro_metastore objects, as well as providing consistent updates of analytical tables when reading and writing to these tables.

  • Checking data integrity and restoring Parquet files from backups or by using S3 storage mirrors.

  • Monitoring pgpro_metastore and notifying administrators about events requiring their immediate attention.

  • Storing Parquet files in separate storages to avoid competition with the OLTP workload for I/O throughput.

The schema of metadata tables is shown in the diagram below.

Figure 5.1. Schema of Metadata Tables


For more information about each metadata table, refer to corresponding subsections of this section.

5.2.1. pga_snapshot Metadata Table #

This metadata table contains information about snapshots.

Column Name

Column Type

Constraint

Description

snapshot_id

BIGINT

Primary key

The snapshot ID.

The value is increased by 1 for each created snapshot.

This is a primary key referred to by most metadata tables.

snapshot_time

TIMESTAMP

NOT NULL

The snapshot creation date and time.

schema_version

BIGINT

NOT NULL

The pgpro_metastore catalog version.

The value is increased by 1 at each pgpro_metastore object update, for example, when a new analytical table is created.

next_catalog_id

BIGINT

NOT NULL

The ID assigned to the next pgpro_metastore object, such as an analytical table.

The value is increased each time an object is created.

next_file_id

BIGINT

NOT NULL

The ID assigned to the next created Parquet file.

The value is increased each time a file is created.

is_visible

BOOLEAN

NOT NULL

Specifies whether pgpro_metastore objects referring to the snapshot are visible.

Possible values:

  • true

  • false

When an ETL operation, such as add_table, creates multiple objects referring to the same snapshot, the is_visible value is set to false for this snapshot. New objects are not visible until the operation is executed.

After this, the is_visible value is set to true for the snapshot, and new objects become visible, i.e., they are in the SELECT command output.

You can use the is_visible column to temporarily hide an object without deleting it and then make it visible again later.

5.2.2. pga_snapshot_changes Metadata Table #

This metadata table contains information about snapshot changes. You can use it to resolve conflicts between different snapshots.

Column Name

Column Type

Constraint

Description

snapshot_id

BIGINT

Primary key, foreign key

The ID of the snapshot from the pga_snapshot metadata table.

changes_made

VARCHAR

-

A comma-separated list of snapshot changes in the following formats:

  • created_schema:schema_name: The snapshot is associated with creating an analytical schema.

  • created_table:table_name: The snapshot is associated with creating an analytical table.

  • created_view:view_name: The snapshot is associated with creating a view.

  • inserted_into_table:table_ID: The snapshot is associated with inserting the OLAP data in an analytical table.

  • deleted_from_table:table_ID: The snapshot is associated with deleting the OLAP data from an analytical table.

  • dropped_schema:schema_ID: The snapshot is associated with deleting an analytical schema.

  • dropped_table:table_ID: The snapshot is associated with deleting an analytical table.

  • dropped_view:view_ID: The snapshot is associated with deleting a view.

  • altered_table:table_ID: The snapshot is associated with updating metadata of an analytical table.

  • altered_view:view_ID: The snapshot is associated with updating metadata of a view.

5.2.3. pga_schema Metadata Table #

This metadata table contains information about analytical schemas.

Column Name

Column Type

Constraint

Description

schema_id

BIGINT

Primary key

The analytical schema ID assigned by increasing the next_catalog_id value of the pga_snapshot metadata table.

schema_uuid

UUID

NOT NULL, UNIQUE

The persistent analytical schema ID stored for compatibility with Postgres Pro schemas.

begin_snapshot

BIGINT

Foreign key, NOT NULL

The snapshot from the pga_snapshot metadata table associated with the analytical schema creation date and time.

end_snapshot

BIGINT

Foreign key

The snapshot from the pga_snapshot metadata table associated with the analytical schema deletion date and time.

If the value is NULL, the analytical schema is not deleted.

schema_name

VARCHAR

NOT NULL, UNIQUE

The unique analytical schema name, for example, my_schema.

5.2.4. pga_table Metadata Table #

This metadata table contains information about analytical tables.

Column Name

Column Type

Constraint

Description

table_id

BIGINT

Primary key

The analytical table ID assigned by increasing the next_catalog_id value of the pga_snapshot metadata table.

table_uuid

UUID

NOT NULL, UNIQUE

The persistent analytical table ID stored for compatibility with Postgres Pro tables.

begin_snapshot

BIGINT

Foreign key, NOT NULL

The snapshot from the pga_snapshot metadata table associated with the analytical table creation date and time.

end_snapshot

BIGINT

Foreign key

The snapshot from the pga_snapshot metadata table associated with the analytical table deletion date and time.

If the value is NULL, the analytical table is not deleted.

schema_id

BIGINT

NOT NULL, UNIQUE

The ID of the analytical schema from the pga_schema metadata table where the analytical table is created.

table_name

VARCHAR

NOT NULL

The analytical table name, for example, my_table.

table_name values are not UNIQUE because pgpro_metastore does not delete metadata of analytical tables. Deleted analytical tables are rather associated with end_snapshot values to indicate that these tables can no longer be used.

If table_name values were UNIQUE, you would not be able to create a new analytical table with the same name as a previously deleted table.

path

VARCHAR

NOT NULL, UNIQUE

The path to the shared directory relative to the uri value of the pga_storage metadata table.

The absolute path is generated by concatenating the uri value of the pga_storage metadata table and the path value of the pga_table metadata table.

storage_id

BIGINT

Foreign key, NOT NULL

The ID of the storage from the pga_storage metadata table where the analytical table is created.

5.2.5. pga_storage Metadata Table #

This metadata table contains information about storages.

Column Name

Column Type

Constraint

Description

storage_id

BIGINT

Primary key

The storage ID assigned by increasing the next_catalog_id value of the pga_snapshot metadata table.

begin_snapshot

BIGINT

Foreign key, NOT NULL

The snapshot from the pga_snapshot metadata table associated with the storage creation date and time.

end_snapshot

BIGINT

Foreign key

The snapshot from the pga_snapshot metadata table associated with the storage deletion date and time.

If the value is NULL, the storage is not deleted.

storage_name

VARCHAR

NOT NULL

The storage name, for example, my_storage.

storage_name values are not UNIQUE because pgpro_metastore does not delete metadata of storages. Deleted storages are rather associated with end_snapshot values to indicate that these storages can no longer be used.

If storage_name values were UNIQUE, you would not be able to create a new storage with the same name as a previously deleted storage.

uri

VARCHAR

NOT NULL, UNIQUE

The URI of the root storage directory.

Local storages have simple URIs, for example, file:///home/j.doe/workspace/axe/.

S3 storages have URIs with more complicated structure and parsing, for example, s3://premdb/team01.csv?versionId=w_B5qT8s5MkiT09.IRHay0lW.PycsHTS. These URIs are generated using third-party libraries.

tmp_uri

VARCHAR

NOT NULL, UNIQUE

The URI of the storage directory for temporary files.

Local storages have simple URIs, for example, file:///home/j.doe/workspace/axe/.

S3 storages have URIs with more complicated structure and parsing, for example, s3://premdb/team01.csv?versionId=w_B5qT8s5MkiT09.IRHay0lW.PycsHTS. These URIs are generated using third-party libraries.

storage_type

BIGINT

NOT NULL

The storage type.

Currently, local, network, and S3 storages are supported.

min_space

BIGINT

NOT NULL

The minimum free space for the storage, in bytes.

It is verified when executing the add_storage ETL operation or any operation that requires removing temporary files from the directory specified by the tmp_uri value.

uri_id

BIGINT

Foreign key, NOT NULL, UNIQUE

The ID of the URI from the pga_uri metadata table associated with the root storage directory.

tmp_uri_id

BIGINT

Foreign key, NOT NULL, UNIQUE

The ID of the URI from the pga_uri metadata table associated with the storage directory for temporary files.

5.2.6. pga_uri Metadata Table #

This metadata table contains system pgpro_metastore information, and it is not intended to be updated by users.

The uri column contains initial serialized URIs associated with uri or tmp_uri values of the pga_storage metadata table.

The path column contains paths to root directories of storages. This is the only relevant column for local storages, and its values usually match uri values, except for prefixes specifying the storage type:

  • file:// for local storages

  • s3:// for S3 storages

Remaining columns contain arbitrary URIs for S3 storages.

For more information, refer to RFS 3968 and to the official AWS documentation.

5.2.7. pga_folder Metadata Table #

This metadata table contains information about shared directories used for storing Parquet files. For example, you can use these directories as a source of Parquet files for the add_files ETL operation.

Column Name

Column Type

Constraint

Description

folder_id

BIGINT

Primary key

The shared directory ID assigned by increasing the next_catalog_id value of the pga_snapshot metadata table.

begin_snapshot

BIGINT

Foreign key, NOT NULL

The snapshot from the pga_snapshot metadata table associated with the shared directory creation date and time.

end_snapshot

BIGINT

Foreign key

The snapshot from the pga_snapshot metadata table associated with the shared directory deletion date and time.

If the value is NULL, the shared directory is not deleted.

path

VARCHAR

NOT NULL

The path to the shared directory relative to the root storage directory.

The absolute path is generated by concatenating the uri value of the pga_storage metadata table and the path value of the pga_folder metadata table.

folder_name

VARCHAR

NOT NULL

The shared directory name, for example, my_directory.

folder_name values are not UNIQUE because pgpro_metastore does not delete metadata of shared directories. Deleted directories are rather associated with end_snapshot values to indicate that these directories can no longer be used.

If folder_name values were UNIQUE, you would not be able to create a new shared directory with the same name as a previously deleted directory.

storage_id

BIGINT

Foreign key, NOT NULL

The ID of the storage from the pga_storage metadata table where the shared directory is created.

5.2.8. pga_column Metadata Table #

This metadata table contains information about columns of analytical tables.

Column Name

Column Type

Constraint

Description

column_id

BIGINT

Primary key

The column ID.

If the initial Parquet file contains the field_id column ID, this ID is used as the column_id value. The ID is not updated until the column is deleted.

column_id values must be unique.

begin_snapshot

BIGINT

Foreign key, NOT NULL

The snapshot from the pga_snapshot metadata table associated with the column creation date and time.

end_snapshot

BIGINT

Foreign key

The snapshot from the pga_snapshot metadata table associated with the column deletion date and time.

If the value is NULL, the column is not deleted.

table_id

BIGINT

Foreign key, NOT NULL

The ID of the analytical table from the pga_table metadata table where the column is created.

column_order

BIGINT

NOT NULL

The sequence number of the column.

column_order values must be unique. However, their sequence can be interrupted, i.e., empty values are possible.

column_name

VARCHAR

NOT NULL

The column name, for example, my_column.

column_name values are not UNIQUE because pgpro_metastore does not delete metadata of columns. Deleted columns are rather associated with end_snapshot values to indicate that these columns can no longer be used.

If column_name values were UNIQUE, you would not be able to create a new column with the same name as a previously deleted column.

column_type

VARCHAR

NOT NULL

The column type corresponding to one of the supported pgpro_metastore data types.

initial_default

VARCHAR

-

The initial column value after its creation, for example, after executing ALTER TABLE.

It can be NULL.

default_value

VARCHAR

-

The default column value used in upsert operations.

It can be NULL.

nulls_allowed

BOOLEAN

-

Specifies whether the column can contain NULL values.

Possible values:

  • true

  • false

is_uniq

BOOLEAN

-

Specifies whether the column can contain repeating values for different rows.

Possible values:

  • true

  • false

5.2.9. pga_data_file Metadata Table #

This metadata table contains information about Parquet files.

Column Name

Column Type

Constraint

Description

data_file_id

BIGINT

Primary key

The Parquet file ID assigned by increasing the next_file_id value of the pga_snapshot metadata table.

begin_snapshot

BIGINT

Foreign key, NOT NULL

The snapshot from the pga_snapshot metadata table associated with the Parquet file creation date and time.

end_snapshot

BIGINT

Foreign key

The snapshot from the pga_snapshot metadata table associated with the Parquet file deletion date and time.

If the value is NULL, the Parquet file is not deleted.

table_id

BIGINT

Foreign key, NOT NULL

The ID of the analytical table from the pga_table metadata table associated with the Parquet file.

file_order

BIGINT

NOT NULL

The vertical position of the Parquet file.

file_order values must be unique. However, their sequence can be interrupted, i.e., empty values are possible.

path

VARCHAR

NOT NULL, UNIQUE

The path to the Parquet file, for example, my_file.parquet.

path_is_relative

BOOLEAN

NOT NULL

Specifies whether the path to the Parquet file is relative to the path to the analytical table or absolute.

Possible values:

  • true

  • false

Default value: false.

file_format

VARCHAR

NOT NULL

The file format.

Currently, only the Parquet format is supported.

record_count

BIGINT

NOT NULL

The number of rows in the Parquet file.

file_size_bytes

BIGINT

NOT NULL

The Parquet file size, in bytes.

row_id_start

BIGINT

NOT NULL

The row_id value for the first row corresponding to the Parquet file. Indicates where rows corresponding to the Parquet file start.

checksum_crc32

BIGINT

NOT NULL

The current checksum value of the Parquet file.

It is used for data integrity checks.

5.2.10. pga_files_scheduled_for_deletion Metadata Table #

This metadata table contains information about Parquet files that are not associated with snapshots and are scheduled for deletion. Parquet files are disassociated from snapshots after executing the expire_snapshot ETL operation, which marks snapshots as expired.

Column Name

Column Type

Constraint

Description

data_file_id

BIGINT

Primary key, foreign key

The ID of the Parquet file from the pga_data_file metadata table.

path

VARCHAR

NOT NULL, UNIQUE

The path to the Parquet file, for example, my_file.parquet.

path_is_relative

BOOLEAN

NOT NULL

Specifies whether the path to the Parquet file is relative to the path to the analytical table or absolute.

Possible values:

  • true

  • false

Default value: false.

schedule_start

TIMESTAMPTZ

NOT NULL

The scheduled Parquet file deletion date and time.

5.2.11. pga_table_stats Metadata Table #

This metadata table contains statistics of analytical tables.

Column Name

Column Type

Constraint

Description

table_id

BIGINT

Primary key, foreign key

The ID of the analytical table from the pga_table metadata table.

record_count

BIGINT

NOT NULL

The number of rows in the analytical table.

next_row_id

BIGINT

NOT NULL

The row_id value for the next column inserted in the analytical table.

file_size_bytes

BIGINT

NOT NULL

The total size of all Parquet files associated with the analytical table, in bytes.

5.2.12. pga_table_column_stats Metadata Table #

This metadata table contains statistics of analytical table columns.

Column Name

Column Type

Constraint

Description

column_id

BIGINT

Primary key, foreign key

The ID of the column from the pga_column metadata table.

contains_null

BOOLEAN

NOT NULL

Specifies whether the column can contain NULL values.

Possible values:

  • true

  • false

contains_nan

BOOLEAN

-

Specifies whether the column can contain NaN values.

Possible values:

  • true

  • false

This is relevant only for FLOAT, DOUBLE, and NUMERIC data types.

min_value

VARCHAR

-

The minimum column value.

It is displayed as a string for consistency and must be converted to the current column data type when used.

max_value

VARCHAR

-

The maximum column value.

It is displayed as a string for consistency and must be converted to the current column data type when used.

5.2.13. pga_file_column_statistics Metadata Table #

This metadata table contains statistics of analytical table columns in the context of Parquet files.

Column Name

Column Type

Constraint

Description

data_file_id

BIGINT

Primary key, foreign key

The ID of the Parquet file from the pga_data_file metadata table.

column_id

BIGINT

Primary key, foreign key

The ID of the column from the pga_column metadata table.

column_size

BIGINT

NOT NULL

The column size, in bytes.

value_count

BIGINT

NOT NULL

The number of column elements in the Parquet file.

This is not the same as the number of Parquet file entries since there can be nested entries, such as structures and lists.

null_count

BIGINT

NOT NULL

The number of NULL values in the column.

min_value

VARCHAR

-

The minimum column value in the Parquet file.

You can use it for estimating the upper bound of the minimum column value.

max_value

VARCHAR

-

The maximum column value in the Parquet file.

You can use it for estimating the lower bound of the maximum column value.

5.2.14. pga_transaction_log Metadata Table #

This metadata table contains logging information about pgpro_metastore tasks analogous to standard DBMS transactions.

Column Name

Column Type

Constraint

Description

lsn

BIGSERIAL

Primary key

The transaction log sequence number (LSN).

txid

BIGINT

NOT NULL

The task ID.

op_id

INT

-

The ETL operation ID.

One task can consist of multiple ETL operations.

kind

VARCHAR

NOT NULL

The ETL operation type, such as add_storage, add_table, add_files, and create_view.

attr

JSONB

-

The initial and runtime data associated with the task.