F.51. pgpro_sfile — storage for large objects #

The pgpro_sfile module allows storing multiple large objects. It is similar to Oracle LOBs. pgpro_sfile provides an engine for storing large objects in a set of tables controlled by the extension. Each object consists of chunks, or blocks. Blocks are split into pages of ~8kB size. The maximum number of objects, number of blocks and object size in bytes are limited by the size of the bigint type (2^63 - 1). A pgpro_sfile large object is called sfile object.

F.51.1. Storage Engine #

sfile objects are stored in tables created and controlled by pgpro_sfile. Regular tables only store the identifier of each sfile object as a reference to tables created by pgpro_sfile. Once an sfile object is created, it can be loaded, read, truncated and deleted, and these operations do not affect regular tables where it is used.

The storage engine is a set of the following storage tables:

  • SF_DESCRIPTOR — a registry where descriptors of all sfile objects are stored.

  • SF_PARTITION — a registry of storage tables (partitions). Stores information about SF_PAGE_XX tables.

  • SF_BLOCK — a registry of sfile object blocks. Each block can include several pages of sfile objects.

  • SF_PAGE_XX — tables with sfile object data pages.

  • SF_OPTION — a table with sfile object options. Currently there is one option TABLESPACE, which allows setting the tablespace where sfile objects are stored.

All the tables, except SF_OPTION, are created in a special pgpro_sfile_data schema. The SF_OPTION table is created during the pgpro_sfile installation. SF_DESCRIPTOR, SF_PARTITION, and SF_BLOCK tables are created during the pgpro_sfile initialization, and SF_PAGE_XX tables are created when an sfile object is written to the database and are tracked in the SF_PARTITION registry.

Next sections describe the structure of the storage tables.

F.51.1.1. SF_OPTION #

The SF_OPTION table has the following fields:

opt_typesmallintOption type: GLOBAL = 0, TABLE = 1, or OBJECT = 2. Currently GLOBAL is only supported.
opt_namenameUnique option name.
opt_valuenameOption value.

F.51.1.2. SF_DESCRIPTOR #

The SF_DESCRIPTOR table has the following fields:

sf_idbigintUnique sfile object identifier.
sf_namenamesfile object name.
sf_persistencesmallintStorage type of the sfile object: RELPERSISTENCE_PERMANENT="p" or RELPERSISTENCE_UNLOGGED="u".
sf_statesmallintSet of bits that describe the sfile object state. For example: SF_STATE_DELETED=1.
sys_creation_datetimestampDate and time of the sfile object creation.
sys_update_datetimestampDate and time of the last update of the sfile object.
sf_json_optstextOptions of the sfile object in the text or JSON format.
sf_typetextManually set sfile object type.
tbs_identitytextName of the tablespace where the sfile object will be stored.

F.51.1.3. SF_PARTITION #

The SF_PARTITION table has the following fields:

part_idintUnique partition identifier.
part_data_sizebigintNumber of blocks that are currently stored in the partition.
part_persistencesmallintData storage type in the partition: RELPERSISTENCE_PERMANENT="p" or RELPERSISTENCE_UNLOGGED="u".
rel_identitytextName of the table that stores the partition data.
tbs_identitytextName of the tablespace that stores the partition data.

F.51.1.4. SF_BLOCK #

The SF_BLOCK table has the following fields:

sf_idbigintUnique sfile object identifier.
block_idbigintUnique identifier of the block in the sfile object. The value of the a_sf_index argument of sf_write is written here.
part_idintIdentifier of the partition where the block data is stored.
block_sizeintBlock size, in bytes.
block_start_pageintSequential number of the block first page, the block numbering starts at 1.
block_end_pageintSequential number of the block last page, the block numbering starts at 1.
sys_creation_datetimestampDate and time of the block creation .
sys_update_datetimestampDate and time of the block last update.

F.51.1.5. SF_PAGE_XX #

Each SF_PAGE_XX table has the following fields:

sf_idbigintUnique sfile object identifier.
block_idbigintUnique identifier of the block that includes the page.
page_nointSequential number of the page in the block.
databyteaThe buffer to store the sfile object data. The maximum data size to store in the data buffer is 8096 bytes.

F.51.2. Installation #

The pgpro_sfile extension is included in Postgres Pro Enterprise. To enable pgpro_sfile, create the extension using the following query:

CREATE EXTENSION pgpro_sfile;

F.51.3. Functions #

The pgpro_sfile functions allow performing various operations with sfile objects, such as creating, writing, reading, deleting and more. To access IDs of sfile objects, these functions use the sfile data type. To store object IDs, a new column of the sfile data type must be added to the table.

sf_initialize() returns void #

Initializes the storage of sfile objects. The function creates the pgpro_sfile_data schema and then creates the needed tables and sequences in this schema. Call this function right after installation of the extension.

sf_deinitialize() returns void #

Performs cascading delete of the pgpro_sfile_data schema. Call this function before deleting the extension to delete all the sfile object data. Superuser privileges are required to call this function.

sf_create(a_sf_nametext, a_sf_persistence text, a_sf_json_options text[, a_sf_tablespace text]) returns sfile #

Creates a new sfile object with the name a_sf_name and returns the ID of the object. The object will be logged or unlogged depending on the value of a_sf_persistence, which must be LOGGED or UNLOGGED, and the a_sf_json_options argument must contain JSON with options. The a_sf_tablespace argument allows you to specify the tablespace where the sfile object will be stored, otherwise, the default tablespace is used.

This function adds a descriptor of the sfile object with the zero size to the SF_DESCRIPTOR relation. So a call to sf_is_empty on a newly created object returns TRUE, as well as sf_is_valid. No blocks or partitions are created by sf_create.

sf_create_empty([a_sf_tablespace text]) returns sfile #

Creates a new empty sfile object with an autogenerated name (sf_gen_XX) and empty options and returns the ID of the object. The a_sf_tablespace argument allows you to specify the tablespace where the sfile object will be stored, otherwise, the default tablespace is used.

sf_write(a_sf sfile, a_sf_data bytea[, a_sf_index bigint]) returns integer #

Inserts the new data block a_sf_data with the index specified by the a_sf_index argument into the sfile object specified by its ID a_sf and returns the number of bytes added. The current timestamp is used as the default index.

This function first checks whether the sfile object was already written in order to figure out whether the partition table SF_PAGE_XX is available and has enough free space to write the block. If the partition is not available or it is the first write, all other partitions are checked and the first available is locked for writing. If no partition is available, a new one is created and added to the SF_PARTITION registry. Then a new block entry is initialized in SF_BLOCK, the block is divided into pages of ~8kB size, and these pages are written to the previously locked SF_PAGE_XX partition. When the block is written, the size of the data is adjusted in the part_data_size field of the SF_PARTITION table.

sf_read(a_sf sfile[, a_offset bigint, a_length integer]) returns bytea #

Reads the number of bytes specified by the a_length from the sfile object specified by its ID a_sf at the offset specified by the a_offset argument. Up to ~1 GB can be read due to the varlena limitation. The default is ~1 GB. The default offset is 0. Returns the buffer with the data read.

sf_size(a_sf sfile) returns bigint #

Retrieves the sfile object size.

sf_truncate(a_sf sfile) returns bigint #

Truncates the sfile object, that is, deletes all data (blocks) but leaves the object intact, and returns the new sfile object size, that is, 0. So a subsequent call to sf_is_valid for this object continues to return TRUE, and a call to sf_is_empty also returns TRUE.

sf_delete(a_sf sfile) returns bigint #

Deletes the sfile object and returns the amount of data deleted. The function deletes all the blocks of data and then removes the object descriptor from SF_DESCRIPTOR. Since then, the object is not available, so a call to sf_is_valid for this object returns FALSE.

sf_describe(a_sf sfile) returns cstring #

Retrieves the sfile object descriptor data and returns it as text.

sf_get_json_options(a_sf sfile) returns cstring #

Retrieves JSON with options provided when the sfile object was created (see sf_create).

sf_find(a_sf_name text) returns sfile #

Searches the sfile object by the specified name and returns the sfile object ID.

sf_set_option(a_opt_name text, [a_opt_value text, a_opt_type text]) returns void #

Sets the option (pgpro_sfile setting) specified by the a_opt_name argument in SF_OPTION. The only argument values that are currently supported are TABLESPACE for a_opt_name and GLOBAL for a_opt_type. The default for a_opt_value is NULL, and the default for a_opt_type is GLOBAL. The pgpro_sfile behavior with the TABLESPACE option, is to check whether the specified tablespace exists and if it does not, raise an error.

sf_get_option(a_opt_name text) returns cstring #

Retrieves the option (pgpro_sfile setting) with the specified name from SF_OPTION.

sf_delete_option(a_opt_name text) returns void #

Deletes the option (pgpro_sfile setting) with the specified name from SF_OPTION.

sf_is_valid(a_sf sfile) returns bool #

Returns TRUE if the argument specifies a valid sfile object and FALSE otherwise.

sf_is_empty(a_sf sfile) returns bool #

Returns TRUE if the argument specifies a valid and empty sfile object and FALSE otherwise.

sf_is_logged(a_sf sfile) returns bool #

Returns TRUE if the argument specifies a logged (permanent) sfile object and FALSE otherwise.

sf_trim(a_sf sfile, a_length bigint) returns bigint #

Trims the sfile object specified by the a_sf argument to the size specified by the a_length argument. The trimmed data is deleted. Returns the new object size. If the sfile object size is less than a_length, returns the actual object size.

sf_set_type(a_sf sfile, a_type text) returns void #

Sets the custom type specified in the a_type argument to the sfile object specified by a_sf.

sf_get_type(a_sf sfile) returns cstring #

Retrieves the custom object type of the specified sfile object as text.

sf_md5(a_sf sfile) returns text #

Calculates MD5 hash for the specified sfile object.

F.51.4. Parallel Processing #

To speed up interactions with large objects, read functions sf_read/sf_size and write function the sf_write can be performed in parallel. These functions take an AccessShareLock lock for the sfile object, which does not forbid reading the object. Each of sf_read/sf_size functions releases this lock right after execution of the operation, while sf_write holds the lock up to the end of the transaction to prevent deletion/trunction/trimming of the sfile object before the end of the transaction. In more detail, the changes made by sf_write to the SF_BLOCK and SF_PAGE_XX tables are not visible to a parallel session, and if the AccessShareLock lock is removed before the end of the transaction, the sfile object can be deleted by another session, which is unaware of the rows added to SF_BLOCK and SF_PAGE_XX. As a result, the sfile object will be deleted while these rows will remain intact.

While writing, blocks of one object can be written from several sessions, and the block write sequence depends on the time of the sf_write call. To control the block write sequence, pass the a_sf_index argument. If it is not passed, the call timestamp is used instead, which is the default. Each call to sf_write takes an AccessExclusiveLock for partitions where blocks are written, which means that it is not permitted to write blocks in the same partition from different sessions as it is not possible to simultaneously modify the part_data_size field of the SF_PARTITION table. However, writing sfile object blocks from several sessions into different partitions is permitted.

sf_trim, sf_truncate and sf_delete functions cannot be performed in parallel as they take an AccessExclusiveLock for the sfile object.

F.51.5. Example #

This example illustrates usage of pgpro_sfile:

-- Create pgpro_sfile extension
CREATE EXTENSION pgpro_sfile;

-- Initialize sfile object storage
SELECT sf_initialize();

-- Create test table
CREATE TABLE test_sfile (id int, l sfile);

-- Create and store sfile object
INSERT INTO test_sfile VALUES (1, sf_create('sf', 'LOGGED', NULL));

-- Write data to sfile object and check data written
SELECT sf_write(t.l, '1234567890'::bytea) FROM test_sfile t WHERE id = 1;
SELECT encode(b, 'escape'), length(b) FROM (SELECT sf_read(t.l, 0, NULL) b FROM test_sfile t WHERE id = 1) x;

-- Trim sfile object and check data
SELECT sf_trim(t.l, 5) FROM test_sfile t WHERE id = 1;
SELECT encode(b, 'escape'), length(b) FROM (SELECT sf_read(t.l, 0, NULL) b FROM test_sfile t WHERE id = 1) x;

-- Remove table
DROP TABLE test_sfile;

-- Delete data of all sfile objects
SET client_min_messages = WARNING;
SELECT sf_deinitialize();
RESET client_min_messages;

-- Delete pgpro_sfile extension
DROP EXTENSION pgpro_sfile;

This example produces the following output:

CREATE EXTENSION
 sf_initialize
---------------

(1 row)


CREATE TABLE
INSERT 0 1
 sf_write
----------
       10
(1 row)


   encode   | length
------------+--------
 1234567890 |     10
(1 row)


 sf_trim
---------
       5
(1 row)


 encode | length
--------+--------
 12345  |      5
(1 row)


DROP TABLE
SET
 sf_deinitialize
-----------------

(1 row)

F.51.6. Authors #

Postgres Professional, Moscow, Russia