F.54. pgpro_sfile — storage for large objects #
The pgpro_sfile module allows storing multiple large objects. It is similar to Oracle LOBs. pgpro_sfile provides an engine for storing large objects in a set of tables controlled by the extension. Each object consists of chunks, or blocks. Blocks are split into pages of ~8kB size. The maximum number of objects, number of blocks and object size in bytes are limited by the size of the bigint type (2^63 - 1). A pgpro_sfile large object is called sfile object.
F.54.1. Storage Engine #
sfile objects are stored in tables created and controlled by pgpro_sfile. Regular tables only store the identifier of each sfile object as a reference to tables created by pgpro_sfile. Once an sfile object is created, it can be loaded, read, truncated and deleted, and these operations do not affect regular tables where it is used.
The storage engine is a set of the following storage tables:
SF_DESCRIPTOR— a registry where descriptors of allsfileobjects are stored.SF_PARTITION— a registry of storage tables (partitions). Stores information aboutSF_PAGE_XXtables.SF_BLOCK— a registry ofsfileobject blocks. Each block can include several pages ofsfileobjects.SF_PAGE_XX— tables withsfileobject data pages.SF_OPTION— a table withsfileobject options. Currently there is one optionTABLESPACE, which allows setting the tablespace wheresfileobjects are stored.
All the tables, except SF_OPTION, are created in a special pgpro_sfile_data schema. The SF_OPTION table is created during the pgpro_sfile installation. SF_DESCRIPTOR, SF_PARTITION, and SF_BLOCK tables are created during the pgpro_sfile initialization, and SF_PAGE_XX tables are created when an sfile object is written to the database and are tracked in the SF_PARTITION registry.
Next sections describe the structure of the storage tables.
F.54.1.1. SF_OPTION #
The SF_OPTION table has the following fields:
opt_type | smallint | Option type: GLOBAL = 0, TABLE = 1, or OBJECT = 2. Currently GLOBAL is only supported. |
opt_name | name | Unique option name. |
opt_value | name | Option value. |
F.54.1.2. SF_DESCRIPTOR #
The SF_DESCRIPTOR table has the following fields:
sf_id | bigint | Unique sfile object identifier. |
sf_name | name | sfile object name. |
sf_persistence | smallint | Storage type of the sfile object: RELPERSISTENCE_PERMANENT="p" or RELPERSISTENCE_UNLOGGED="u". |
sf_state | smallint | Set of bits that describe the sfile object state. For example: SF_STATE_DELETED=1. |
sys_creation_date | timestamp | Date and time of the sfile object creation. |
sys_update_date | timestamp | Date and time of the last update of the sfile object. |
sf_json_opts | text | Options of the sfile object in the text or JSON format. |
sf_type | text | Manually set sfile object type. |
tbs_identity | text | Name of the tablespace where the sfile object will be stored. |
F.54.1.3. SF_PARTITION #
The SF_PARTITION table has the following fields:
part_id | int | Unique partition identifier. |
part_data_size | bigint | Number of blocks that are currently stored in the partition. |
part_persistence | smallint | Data storage type in the partition: RELPERSISTENCE_PERMANENT="p" or RELPERSISTENCE_UNLOGGED="u". |
rel_identity | text | Name of the table that stores the partition data. |
tbs_identity | text | Name of the tablespace that stores the partition data. |
F.54.1.4. SF_BLOCK #
The SF_BLOCK table has the following fields:
sf_id | bigint | Unique sfile object identifier. |
block_id | bigint | Unique identifier of the block in the sfile object. The value of the a_sf_index argument of sf_write is written here. |
part_id | int | Identifier of the partition where the block data is stored. |
block_size | int | Block size, in bytes. |
block_start_page | int | Sequential number of the block first page, the block numbering starts at 1. |
block_end_page | int | Sequential number of the block last page, the block numbering starts at 1. |
sys_creation_date | timestamp | Date and time of the block creation . |
sys_update_date | timestamp | Date and time of the block last update. |
F.54.1.5. SF_PAGE_XX #
Each SF_PAGE_XX table has the following fields:
sf_id | bigint | Unique sfile object identifier. |
block_id | bigint | Unique identifier of the block that includes the page. |
page_no | int | Sequential number of the page in the block. |
data | bytea | The buffer to store the sfile object data. The maximum data size to store in the data buffer is 8096 bytes. |
F.54.2. Installation #
The pgpro_sfile extension is included in Postgres Pro Enterprise. To enable pgpro_sfile, create the extension using the following query:
CREATE EXTENSION pgpro_sfile;
F.54.3. Functions #
The pgpro_sfile functions allow performing various operations with sfile objects, such as creating, writing, reading, deleting and more. To access IDs of sfile objects, these functions use the sfile data type. To store object IDs, a new column of the sfile data type must be added to the table.
-
sf_initialize()returnsvoid# Initializes the storage of
sfileobjects. The function creates thepgpro_sfile_dataschema and then creates the needed tables and sequences in this schema. Call this function right after installation of the extension.-
sf_deinitialize()returnsvoid# Performs cascading delete of the
pgpro_sfile_dataschema. Call this function before deleting the extension to delete all thesfileobject data. Superuser privileges are required to call this function.-
sf_create(returnsa_sf_nametext,a_sf_persistencetext,a_sf_json_optionstext[,a_sf_tablespacetext])sfile# Creates a new
sfileobject with the namea_sf_nameand returns the ID of the object. The object will be logged or unlogged depending on the value ofa_sf_persistence, which must beLOGGEDorUNLOGGED, and thea_sf_json_optionsargument must contain JSON with options. Thea_sf_tablespaceargument allows you to specify the tablespace where thesfileobject will be stored, otherwise, the default tablespace is used.This function adds a descriptor of the
sfileobject with the zero size to the SF_DESCRIPTOR relation. So a call to sf_is_empty on a newly created object returnsTRUE, as well as sf_is_valid. No blocks or partitions are created bysf_create.-
sf_create_empty([#a_sf_tablespacetext]) returnssfile Creates a new empty
sfileobject with an autogenerated name (sf_gen_XX) and empty options and returns the ID of the object. Thea_sf_tablespaceargument allows you to specify the tablespace where thesfileobject will be stored, otherwise, the default tablespace is used.-
sf_write(returnsa_sfsfile,a_sf_databytea[,a_sf_indexbigint])integer# Inserts the new data block
a_sf_datawith the index specified by thea_sf_indexargument into thesfileobject specified by its IDa_sfand returns the number of bytes added. The current timestamp is used as the default index. Passinga_sf_index<= 0 results in an error.This function first checks whether the
sfileobject was already written in order to figure out whether the partition table SF_PAGE_XX is available and has enough free space to write the block. If the partition is not available or it is the first write, all other partitions are checked and the first available is locked for writing. If no partition is available, a new one is created and added to the SF_PARTITION registry. Then a new block entry is initialized in SF_BLOCK, the block is divided into pages of ~8kB size, and these pages are written to the previously lockedSF_PAGE_XXpartition. When the block is written, the size of the data is adjusted in thepart_data_sizefield of theSF_PARTITIONtable.-
sf_read(returnsa_sfsfile[,a_offsetbigint,a_lengthinteger])bytea# Reads the number of bytes specified by the
a_lengthfrom thesfileobject specified by its IDa_sfat the offset specified by thea_offsetargument. Up to ~1 GB can be read due to thevarlenalimitation. The default is ~1 GB. The default offset is 0. Returns the buffer with the data read.-
sf_size(returnsa_sfsfile)bigint# Retrieves the
sfileobject size.-
sf_truncate(returnsa_sfsfile)bigint# Truncates the
sfileobject, that is, deletes all data (blocks) but leaves the object intact, and returns the newsfileobject size, that is, 0. So a subsequent call to sf_is_valid for this object continues to returnTRUE, and a call to sf_is_empty also returnsTRUE.-
sf_delete(returnsa_sfsfile)bigint# Deletes the
sfileobject and returns the amount of data deleted. The function deletes all the blocks of data and then removes the object descriptor from SF_DESCRIPTOR. Since then, the object is not available, so a call to sf_is_valid for this object returnsFALSE.-
sf_import(returnsa_filetext,a_sf_nametext,a_sf_persistencetext,a_sf_json_optionstext[,a_sf_tablespacetext])sfile
sf_import(returnsa_filetext,a_sfsfile)sfile# The two-parameter function writes the contents of the
a_filefile at the end of the existinga_sfsfile object. The function with more parameters imports thea_filefile as a newsfileobject with the namea_sf_nameand returns the ID of the object. The object will be logged or unlogged depending on the value ofa_sf_persistence, which must beLOGGEDorUNLOGGED, and thea_sf_json_optionsargument must contain JSON with options. Thea_sf_tablespaceargument allows you to specify the tablespace where thesfileobject will be stored, otherwise, the default tablespace is used.-
sf_export(returnsa_sfsfile,a_filetext)void# Dumps the
a_sfsfileobject to a file nameda_file.-
sf_describe(returnsa_sfsfile)cstring# Retrieves the
sfileobject descriptor data and returns it as text.-
sf_get_json_options(returnsa_sfsfile)cstring# Retrieves JSON with options provided when the
sfileobject was created (see sf_create).-
sf_find(returnsa_sf_nametext)sfile
sf_find(returnsa_sf_idbigint)sfile# Searches the
sfileobject by the specified name or by the specified unique ID of thebiginttype and returns thesfileobject ID. See also Example F.8 for more information.-
sf_set_option(returnsa_opt_nametext, [a_opt_valuetext,a_opt_typetext])void# Sets the option (pgpro_sfile setting) specified by the
a_opt_nameargument in SF_OPTION. The only argument values that are currently supported areTABLESPACEfora_opt_nameandGLOBALfora_opt_type. The default fora_opt_valueis NULL, and the default fora_opt_typeisGLOBAL. The pgpro_sfile behavior with theTABLESPACEoption, is to check whether the specified tablespace exists and if it does not, raise an error.-
sf_get_option(returnsa_opt_nametext)cstring# Retrieves the option (pgpro_sfile setting) with the specified name from SF_OPTION.
-
sf_delete_option(returnsa_opt_nametext)void# Deletes the option (pgpro_sfile setting) with the specified name from SF_OPTION.
-
sf_is_valid(returnsa_sfsfile)bool# Returns
TRUEif the argument specifies a validsfileobject andFALSEotherwise.-
sf_is_empty(returnsa_sfsfile)bool# Returns
TRUEif the argument specifies a valid and emptysfileobject andFALSEotherwise.-
sf_is_logged(returnsa_sfsfile)bool# Returns
TRUEif the argument specifies a logged (permanent)sfileobject andFALSEotherwise.-
sf_trim(returnsa_sfsfile,a_lengthbigint)bigint# Trims the
sfileobject specified by thea_sfargument to the size specified by thea_lengthargument. The trimmed data is deleted. Returns the new object size. If thesfileobject size is less thana_length, returns the actual object size.-
sf_set_type(returnsa_sfsfile,a_typetext)void# Sets the custom type specified in the
a_typeargument to thesfileobject specified bya_sf.-
sf_get_type(returnsa_sfsfile)cstring# Retrieves the custom object type of the specified
sfileobject as text.-
sf_md5(returnsa_sfsfile)text# Calculates MD5 hash for the specified
sfileobject.
F.54.4. Client Functions #
In addition to server functions for export and import of sfile objects, pgpro_sfile provides client functions for that. The differences between server-side and client-side functions for export and import are the same as described for large objects (see Section 36.4). The client functions are implemented as the following psql meta-commands:
\sf_export#sf_idfilenameReads the
sfileobject specified bysf_idfrom the database, writes it tofilename, and returns the size of the written data, in bytes.Note that this is subtly different from the server function
sf_export, which acts with the permissions of the user that the database server runs as and on the server's file system.\sf_import#filenamesf_idStores the
filenamefile into thesfileobject specified bysf_idand returns the size of the stored data, in bytes.Note that this command is subtly different from the server-side
sf_importbecause it acts as the local user on the local file system, rather than the server's user and file system.\sf_delete#sf_idDeletes the
sfileobject specified bysf_idfrom the database and returns the size of the deleted object.
F.54.5. Parallel Processing #
To speed up interactions with large objects, read functions sf_read/sf_size and write function the sf_write can be performed in parallel. These functions take an AccessShareLock lock for the sfile object, which does not forbid reading the object. Each of sf_read/sf_size functions releases this lock right after execution of the operation, while sf_write holds the lock up to the end of the transaction to prevent deletion/trunction/trimming of the sfile object before the end of the transaction. In more detail, the changes made by sf_write to the SF_BLOCK and SF_PAGE_XX tables are not visible to a parallel session, and if the AccessShareLock lock is removed before the end of the transaction, the sfile object can be deleted by another session, which is unaware of the rows added to SF_BLOCK and SF_PAGE_XX. As a result, the sfile object will be deleted while these rows will remain intact.
While writing, blocks of one object can be written from several sessions, and the block write sequence depends on the time of the sf_write call. To control the block write sequence, pass the a_sf_index argument. If it is not passed, the call timestamp is used instead, which is the default. Each call to sf_write takes an AccessExclusiveLock for partitions where blocks are written, which means that it is not permitted to write blocks in the same partition from different sessions as it is not possible to simultaneously modify the part_data_size field of the SF_PARTITION table. However, writing sfile object blocks from several sessions into different partitions is permitted.
sf_trim, sf_truncate and sf_delete functions cannot be performed in parallel as they take an AccessExclusiveLock for the sfile object.
F.54.6. Examples #
Example F.7. Basic Usage of pgpro_sfile
-- Create pgpro_sfile extension
CREATE EXTENSION pgpro_sfile;
-- Initialize sfile object storage
SELECT sf_initialize();
-- Create test table
CREATE TABLE test_sfile (id int, l sfile);
-- Create and store sfile object
INSERT INTO test_sfile VALUES (1, sf_create('sf', 'LOGGED', NULL));
-- Write data to sfile object and check data written
SELECT sf_write(t.l, '1234567890'::bytea) FROM test_sfile t WHERE id = 1;
SELECT encode(b, 'escape'), length(b) FROM (SELECT sf_read(t.l, 0, NULL) b FROM test_sfile t WHERE id = 1) x;
-- Trim sfile object and check data
SELECT sf_trim(t.l, 5) FROM test_sfile t WHERE id = 1;
SELECT encode(b, 'escape'), length(b) FROM (SELECT sf_read(t.l, 0, NULL) b FROM test_sfile t WHERE id = 1) x;
-- Remove table
DROP TABLE test_sfile;
-- Delete data of all sfile objects
SET client_min_messages = WARNING;
SELECT sf_deinitialize();
RESET client_min_messages;
-- Delete pgpro_sfile extension
DROP EXTENSION pgpro_sfile;
This example produces the following output:
CREATE EXTENSION
sf_initialize
---------------
(1 row)
CREATE TABLE
INSERT 0 1
sf_write
----------
10
(1 row)
encode | length
------------+--------
1234567890 | 10
(1 row)
sf_trim
---------
5
(1 row)
encode | length
--------+--------
12345 | 5
(1 row)
DROP TABLE
SET
sf_deinitialize
-----------------
(1 row)
Example F.8. Usage of Type Cast Instead of sf_find
-- Create pgpro_sfile extension
CREATE EXTENSION pgpro_sfile;
-- Initialize sfile object storage
SELECT sf_initialize();
-- Create sfile object
SELECT sf_create('test', 'LOGGED', NULL);
-- Write to sfile using sfile name
SELECT sf_write(sf_find('test'), '1234567890'::bytea);
-- Write to sfile using sfile identifier
SELECT sf_write(1::bigint::sfile, '1234567890'::bytea);
This example produces the following output:
CREATE EXTENSION
sf_initialize
---------------
(1 row)
sf_create
-----------
1
(1 row)
sf_write
----------
10
(1 row)
sf_write
----------
10
(1 row)
F.54.7. Authors #
Postgres Professional, Moscow, Russia