F.54. pgpro_sfile — storage for large objects #
 The pgpro_sfile module allows storing multiple large objects. It is similar to Oracle LOBs. pgpro_sfile provides an engine for storing large objects in a set of tables controlled by the extension. Each object consists of chunks, or blocks. Blocks are split into pages of ~8kB size. The maximum number of objects, number of blocks and object size in bytes are limited by the size of the bigint type (2^63 - 1). A pgpro_sfile large object is called sfile object. 
F.54.1. Storage Engine #
sfile objects are stored in tables created and controlled by pgpro_sfile. Regular tables only store the identifier of each sfile object as a reference to tables created by pgpro_sfile. Once an sfile object is created, it can be loaded, read, truncated and deleted, and these operations do not affect regular tables where it is used. 
The storage engine is a set of the following storage tables:
- SF_DESCRIPTOR— a registry where descriptors of all- sfileobjects are stored.
- SF_PARTITION— a registry of storage tables (partitions). Stores information about- SF_PAGE_XXtables.
- SF_BLOCK— a registry of- sfileobject blocks. Each block can include several pages of- sfileobjects.
- SF_PAGE_XX— tables with- sfileobject data pages.
- SF_OPTION— a table with- sfileobject options. Currently there is one option- TABLESPACE, which allows setting the tablespace where- sfileobjects are stored.
 All the tables, except SF_OPTION, are created in a special pgpro_sfile_data schema. The SF_OPTION table is created during the pgpro_sfile installation. SF_DESCRIPTOR, SF_PARTITION, and SF_BLOCK tables are created during the pgpro_sfile initialization, and SF_PAGE_XX tables are created when an sfile object is written to the database and are tracked in the SF_PARTITION registry. 
Next sections describe the structure of the storage tables.
F.54.1.1. SF_OPTION #
 The SF_OPTION table has the following fields: 
| opt_type | smallint | Option type: GLOBAL= 0,TABLE= 1, orOBJECT= 2. CurrentlyGLOBALis only supported. | 
| opt_name | name | Unique option name. | 
| opt_value | name | Option value. | 
F.54.1.2. SF_DESCRIPTOR #
 The SF_DESCRIPTOR table has the following fields: 
| sf_id | bigint | Unique sfileobject identifier. | 
| sf_name | name | sfileobject name. | 
| sf_persistence | smallint | Storage type of the sfileobject:RELPERSISTENCE_PERMANENT="p" orRELPERSISTENCE_UNLOGGED="u". | 
| sf_state | smallint | Set of bits that describe the sfileobject state. For example:SF_STATE_DELETED=1. | 
| sys_creation_date | timestamp | Date and time of the sfileobject creation. | 
| sys_update_date | timestamp | Date and time of the last update of the sfileobject. | 
| sf_json_opts | text | Options of the sfileobject in the text or JSON format. | 
| sf_type | text | Manually set sfileobject type. | 
| tbs_identity | text | Name of the tablespace where the sfileobject will be stored. | 
F.54.1.3. SF_PARTITION #
 The SF_PARTITION table has the following fields: 
| part_id | int | Unique partition identifier. | 
| part_data_size | bigint | Number of blocks that are currently stored in the partition. | 
| part_persistence | smallint | Data storage type in the partition: RELPERSISTENCE_PERMANENT="p" orRELPERSISTENCE_UNLOGGED="u". | 
| rel_identity | text | Name of the table that stores the partition data. | 
| tbs_identity | text | Name of the tablespace that stores the partition data. | 
F.54.1.4. SF_BLOCK #
 The SF_BLOCK table has the following fields: 
| sf_id | bigint | Unique sfileobject identifier. | 
| block_id | bigint | Unique identifier of the block in the sfileobject. The value of thea_sf_indexargument of sf_write is written here. | 
| part_id | int | Identifier of the partition where the block data is stored. | 
| block_size | int | Block size, in bytes. | 
| block_start_page | int | Sequential number of the block first page, the block numbering starts at 1. | 
| block_end_page | int | Sequential number of the block last page, the block numbering starts at 1. | 
| sys_creation_date | timestamp | Date and time of the block creation . | 
| sys_update_date | timestamp | Date and time of the block last update. | 
F.54.1.5. SF_PAGE_XX #
 Each SF_PAGE_XX table has the following fields: 
| sf_id | bigint | Unique sfileobject identifier. | 
| block_id | bigint | Unique identifier of the block that includes the page. | 
| page_no | int | Sequential number of the page in the block. | 
| data | bytea | The buffer to store the sfileobject data. The maximum data size to store in thedatabuffer is 8096 bytes. | 
F.54.2. Installation #
 The pgpro_sfile extension is included in Postgres Pro Enterprise. To enable pgpro_sfile, create the extension using the following query: 
CREATE EXTENSION pgpro_sfile;
F.54.3. Functions #
 The pgpro_sfile functions allow performing various operations with sfile objects, such as creating, writing, reading, deleting and more. To access IDs of sfile objects, these functions use the sfile data type. To store object IDs, a new column of the sfile data type must be added to the table. 
-   sf_initialize()returnsvoid#
- Initializes the storage of - sfileobjects. The function creates the- pgpro_sfile_dataschema and then creates the needed tables and sequences in this schema. Call this function right after installation of the extension.
-   sf_deinitialize()returnsvoid#
- Performs cascading delete of the - pgpro_sfile_dataschema. Call this function before deleting the extension to delete all the- sfileobject data. Superuser privileges are required to call this function.
-   sf_create(returnsa_sf_nametext,a_sf_persistencetext,a_sf_json_optionstext[,a_sf_tablespacetext])sfile#
- Creates a new - sfileobject with the name- a_sf_nameand returns the ID of the object. The object will be logged or unlogged depending on the value of- a_sf_persistence, which must be- LOGGEDor- UNLOGGED, and the- a_sf_json_optionsargument must contain JSON with options. The- a_sf_tablespaceargument allows you to specify the tablespace where the- sfileobject will be stored, otherwise, the default tablespace is used.- This function adds a descriptor of the - sfileobject with the zero size to the SF_DESCRIPTOR relation. So a call to sf_is_empty on a newly created object returns- TRUE, as well as sf_is_valid. No blocks or partitions are created by- sf_create.
-   sf_create_empty([#a_sf_tablespacetext]) returnssfile
- Creates a new empty - sfileobject with an autogenerated name (- sf_gen_XX) and empty options and returns the ID of the object. The- a_sf_tablespaceargument allows you to specify the tablespace where the- sfileobject will be stored, otherwise, the default tablespace is used.
-   sf_write(returnsa_sfsfile,a_sf_databytea[,a_sf_indexbigint])integer#
- Inserts the new data block - a_sf_datawith the index specified by the- a_sf_indexargument into the- sfileobject specified by its ID- a_sfand returns the number of bytes added. The current timestamp is used as the default index. Passing- a_sf_index<= 0 results in an error.- This function first checks whether the - sfileobject was already written in order to figure out whether the partition table SF_PAGE_XX is available and has enough free space to write the block. If the partition is not available or it is the first write, all other partitions are checked and the first available is locked for writing. If no partition is available, a new one is created and added to the SF_PARTITION registry. Then a new block entry is initialized in SF_BLOCK, the block is divided into pages of ~8kB size, and these pages are written to the previously locked- SF_PAGE_XXpartition. When the block is written, the size of the data is adjusted in the- part_data_sizefield of the- SF_PARTITIONtable.
-   sf_read(returnsa_sfsfile[,a_offsetbigint,a_lengthinteger])bytea#
- Reads the number of bytes specified by the - a_lengthfrom the- sfileobject specified by its ID- a_sfat the offset specified by the- a_offsetargument. Up to ~1 GB can be read due to the- varlenalimitation. The default is ~1 GB. The default offset is 0. Returns the buffer with the data read.
-   sf_size(returnsa_sfsfile)bigint#
- Retrieves the - sfileobject size.
-   sf_truncate(returnsa_sfsfile)bigint#
- Truncates the - sfileobject, that is, deletes all data (blocks) but leaves the object intact, and returns the new- sfileobject size, that is, 0. So a subsequent call to sf_is_valid for this object continues to return- TRUE, and a call to sf_is_empty also returns- TRUE.
-   sf_delete(returnsa_sfsfile)bigint#
- Deletes the - sfileobject and returns the amount of data deleted. The function deletes all the blocks of data and then removes the object descriptor from SF_DESCRIPTOR. Since then, the object is not available, so a call to sf_is_valid for this object returns- FALSE.
-   sf_import(returnsa_filetext,a_sf_nametext,a_sf_persistencetext,a_sf_json_optionstext[,a_sf_tablespacetext])sfile
 sf_import(returnsa_filetext,a_sfsfile)sfile#
- The two-parameter function writes the contents of the - a_filefile at the end of the existing- a_sfsfile object. The function with more parameters imports the- a_filefile as a new- sfileobject with the name- a_sf_nameand returns the ID of the object. The object will be logged or unlogged depending on the value of- a_sf_persistence, which must be- LOGGEDor- UNLOGGED, and the- a_sf_json_optionsargument must contain JSON with options. The- a_sf_tablespaceargument allows you to specify the tablespace where the- sfileobject will be stored, otherwise, the default tablespace is used.
-   sf_export(returnsa_sfsfile,a_filetext)void#
- Dumps the - a_sf- sfileobject to a file named- a_file.
-   sf_describe(returnsa_sfsfile)cstring#
- Retrieves the - sfileobject descriptor data and returns it as text.
-   sf_get_json_options(returnsa_sfsfile)cstring#
- Retrieves JSON with options provided when the - sfileobject was created (see sf_create).
-   sf_find(returnsa_sf_nametext)sfile
 sf_find(returnsa_sf_idbigint)sfile#
- Searches the - sfileobject by the specified name or by the specified unique ID of the- biginttype and returns the- sfileobject ID. See also Example F.8 for more information.
-   sf_set_option(returnsa_opt_nametext, [a_opt_valuetext,a_opt_typetext])void#
- Sets the option (pgpro_sfile setting) specified by the - a_opt_nameargument in SF_OPTION. The only argument values that are currently supported are- TABLESPACEfor- a_opt_nameand- GLOBALfor- a_opt_type. The default for- a_opt_valueis NULL, and the default for- a_opt_typeis- GLOBAL. The pgpro_sfile behavior with the- TABLESPACEoption, is to check whether the specified tablespace exists and if it does not, raise an error.
-   sf_get_option(returnsa_opt_nametext)cstring#
- Retrieves the option (pgpro_sfile setting) with the specified name from SF_OPTION. 
-   sf_delete_option(returnsa_opt_nametext)void#
- Deletes the option (pgpro_sfile setting) with the specified name from SF_OPTION. 
-   sf_is_valid(returnsa_sfsfile)bool#
- Returns - TRUEif the argument specifies a valid- sfileobject and- FALSEotherwise.
-   sf_is_empty(returnsa_sfsfile)bool#
- Returns - TRUEif the argument specifies a valid and empty- sfileobject and- FALSEotherwise.
-   sf_is_logged(returnsa_sfsfile)bool#
- Returns - TRUEif the argument specifies a logged (permanent)- sfileobject and- FALSEotherwise.
-   sf_trim(returnsa_sfsfile,a_lengthbigint)bigint#
- Trims the - sfileobject specified by the- a_sfargument to the size specified by the- a_lengthargument. The trimmed data is deleted. Returns the new object size. If the- sfileobject size is less than- a_length, returns the actual object size.
-   sf_set_type(returnsa_sfsfile,a_typetext)void#
- Sets the custom type specified in the - a_typeargument to the- sfileobject specified by- a_sf.
-   sf_get_type(returnsa_sfsfile)cstring#
- Retrieves the custom object type of the specified - sfileobject as text.
-   sf_md5(returnsa_sfsfile)text#
- Calculates MD5 hash for the specified - sfileobject.
F.54.4. Client Functions #
 In addition to server functions for export and import of sfile objects, pgpro_sfile provides client functions for that. The differences between server-side and client-side functions for export and import are the same as described for large objects (see Section 36.4). The client functions are implemented as the following psql meta-commands: 
- \sf_export#- sf_id- filename
- Reads the - sfileobject specified by- sf_idfrom the database, writes it to- filename, and returns the size of the written data, in bytes.- Note that this is subtly different from the server function - sf_export, which acts with the permissions of the user that the database server runs as and on the server's file system.
- \sf_import#- filename- sf_id
- Stores the - filenamefile into the- sfileobject specified by- sf_idand returns the size of the stored data, in bytes.- Note that this command is subtly different from the server-side - sf_importbecause it acts as the local user on the local file system, rather than the server's user and file system.
- \sf_delete#- sf_id
- Deletes the - sfileobject specified by- sf_idfrom the database and returns the size of the deleted object.
F.54.5. Parallel Processing #
 To speed up interactions with large objects, read functions sf_read/sf_size and write function the sf_write can be performed in parallel. These functions take an AccessShareLock lock for the sfile object, which does not forbid reading the object. Each of sf_read/sf_size functions releases this lock right after execution of the operation, while sf_write holds the lock up to the end of the transaction to prevent deletion/trunction/trimming of the sfile object before the end of the transaction. In more detail, the changes made by sf_write to the SF_BLOCK and SF_PAGE_XX tables are not visible to a parallel session, and if the AccessShareLock lock is removed before the end of the transaction, the sfile object can be deleted by another session, which is unaware of the rows added to SF_BLOCK and SF_PAGE_XX. As a result, the sfile object will be deleted while these rows will remain intact. 
 While writing, blocks of one object can be written from several sessions, and the block write sequence depends on the time of the sf_write call. To control the block write sequence, pass the a_sf_index argument. If it is not passed, the call timestamp is used instead, which is the default. Each call to sf_write takes an AccessExclusiveLock for partitions where blocks are written, which means that it is not permitted to write blocks in the same partition from different sessions as it is not possible to simultaneously modify the part_data_size field of the SF_PARTITION table. However, writing sfile object blocks from several sessions into different partitions is permitted. 
sf_trim, sf_truncate and sf_delete functions cannot be performed in parallel as they take an AccessExclusiveLock for the sfile object. 
F.54.6. Examples #
Example F.7. Basic Usage of pgpro_sfile
-- Create pgpro_sfile extension
CREATE EXTENSION pgpro_sfile;
-- Initialize sfile object storage
SELECT sf_initialize();
-- Create test table
CREATE TABLE test_sfile (id int, l sfile);
-- Create and store sfile object
INSERT INTO test_sfile VALUES (1, sf_create('sf', 'LOGGED', NULL));
-- Write data to sfile object and check data written
SELECT sf_write(t.l, '1234567890'::bytea) FROM test_sfile t WHERE id = 1;
SELECT encode(b, 'escape'), length(b) FROM (SELECT sf_read(t.l, 0, NULL) b FROM test_sfile t WHERE id = 1) x;
-- Trim sfile object and check data
SELECT sf_trim(t.l, 5) FROM test_sfile t WHERE id = 1;
SELECT encode(b, 'escape'), length(b) FROM (SELECT sf_read(t.l, 0, NULL) b FROM test_sfile t WHERE id = 1) x;
-- Remove table
DROP TABLE test_sfile;
-- Delete data of all sfile objects
SET client_min_messages = WARNING;
SELECT sf_deinitialize();
RESET client_min_messages;
-- Delete pgpro_sfile extension
DROP EXTENSION pgpro_sfile;
This example produces the following output:
CREATE EXTENSION
 sf_initialize
---------------
(1 row)
CREATE TABLE
INSERT 0 1
 sf_write
----------
       10
(1 row)
   encode   | length
------------+--------
 1234567890 |     10
(1 row)
 sf_trim
---------
       5
(1 row)
 encode | length
--------+--------
 12345  |      5
(1 row)
DROP TABLE
SET
 sf_deinitialize
-----------------
(1 row)
Example F.8. Usage of Type Cast Instead of sf_find
-- Create pgpro_sfile extension
CREATE EXTENSION pgpro_sfile;
-- Initialize sfile object storage
SELECT sf_initialize();
-- Create sfile object
SELECT sf_create('test', 'LOGGED', NULL);
-- Write to sfile using sfile name
SELECT sf_write(sf_find('test'), '1234567890'::bytea);
-- Write to sfile using sfile identifier
SELECT sf_write(1::bigint::sfile, '1234567890'::bytea);
This example produces the following output:
CREATE EXTENSION
 sf_initialize
---------------
(1 row)
 sf_create
-----------
 1
(1 row)
 sf_write
----------
       10
(1 row)
 sf_write
----------
       10
(1 row)
F.54.7. Authors #
Postgres Professional, Moscow, Russia