F.51. pgpro_sfile — storage for large objects #
The pgpro_sfile
module allows storing multiple large objects. It is similar to Oracle LOBs. pgpro_sfile provides an engine for storing large objects in a set of tables controlled by the extension. Each object consists of chunks, or blocks. Blocks are split into pages of ~8kB size. The maximum number of objects, number of blocks and object size in bytes are limited by the size of the bigint
type (2^63 - 1). A pgpro_sfile large object is called sfile
object.
F.51.1. Storage Engine #
sfile
objects are stored in tables created and controlled by pgpro_sfile. Regular tables only store the identifier of each sfile
object as a reference to tables created by pgpro_sfile. Once an sfile
object is created, it can be loaded, read, truncated and deleted, and these operations do not affect regular tables where it is used.
The storage engine is a set of the following storage tables:
SF_DESCRIPTOR
— a registry where descriptors of allsfile
objects are stored.SF_PARTITION
— a registry of storage tables (partitions). Stores information aboutSF_PAGE_XX
tables.SF_BLOCK
— a registry ofsfile
object blocks. Each block can include several pages ofsfile
objects.SF_PAGE_XX
— tables withsfile
object data pages.SF_OPTION
— a table withsfile
object options. Currently there is one optionTABLESPACE
, which allows setting the tablespace wheresfile
objects are stored.
All the tables, except SF_OPTION
, are created in a special pgpro_sfile_data
schema. The SF_OPTION
table is created during the pgpro_sfile installation. SF_DESCRIPTOR
, SF_PARTITION
, and SF_BLOCK
tables are created during the pgpro_sfile initialization, and SF_PAGE_XX
tables are created when an sfile
object is written to the database and are tracked in the SF_PARTITION
registry.
Next sections describe the structure of the storage tables.
F.51.1.1. SF_OPTION
#
The SF_OPTION
table has the following fields:
opt_type | smallint | Option type: GLOBAL = 0, TABLE = 1, or OBJECT = 2. Currently GLOBAL is only supported. |
opt_name | name | Unique option name. |
opt_value | name | Option value. |
F.51.1.2. SF_DESCRIPTOR
#
The SF_DESCRIPTOR
table has the following fields:
sf_id | bigint | Unique sfile object identifier. |
sf_name | name | sfile object name. |
sf_persistence | smallint | Storage type of the sfile object: RELPERSISTENCE_PERMANENT ="p" or RELPERSISTENCE_UNLOGGED ="u". |
sf_state | smallint | Set of bits that describe the sfile object state. For example: SF_STATE_DELETED =1. |
sys_creation_date | timestamp | Date and time of the sfile object creation. |
sys_update_date | timestamp | Date and time of the last update of the sfile object. |
sf_json_opts | text | Options of the sfile object in the text or JSON format. |
sf_type | text | Manually set sfile object type. |
tbs_identity | text | Name of the tablespace where the sfile object will be stored. |
F.51.1.3. SF_PARTITION
#
The SF_PARTITION
table has the following fields:
part_id | int | Unique partition identifier. |
part_data_size | bigint | Number of blocks that are currently stored in the partition. |
part_persistence | smallint | Data storage type in the partition: RELPERSISTENCE_PERMANENT ="p" or RELPERSISTENCE_UNLOGGED ="u". |
rel_identity | text | Name of the table that stores the partition data. |
tbs_identity | text | Name of the tablespace that stores the partition data. |
F.51.1.4. SF_BLOCK
#
The SF_BLOCK
table has the following fields:
sf_id | bigint | Unique sfile object identifier. |
block_id | bigint | Unique identifier of the block in the sfile object. The value of the a_sf_index argument of sf_write is written here. |
part_id | int | Identifier of the partition where the block data is stored. |
block_size | int | Block size, in bytes. |
block_start_page | int | Sequential number of the block first page, the block numbering starts at 1. |
block_end_page | int | Sequential number of the block last page, the block numbering starts at 1. |
sys_creation_date | timestamp | Date and time of the block creation . |
sys_update_date | timestamp | Date and time of the block last update. |
F.51.1.5. SF_PAGE_XX
#
Each SF_PAGE_XX
table has the following fields:
sf_id | bigint | Unique sfile object identifier. |
block_id | bigint | Unique identifier of the block that includes the page. |
page_no | int | Sequential number of the page in the block. |
data | bytea | The buffer to store the sfile object data. The maximum data size to store in the data buffer is 8096 bytes. |
F.51.2. Installation #
The pgpro_sfile
extension is included in Postgres Pro Enterprise. To enable pgpro_sfile, create the extension using the following query:
CREATE EXTENSION pgpro_sfile;
F.51.3. Functions #
The pgpro_sfile functions allow performing various operations with sfile
objects, such as creating, writing, reading, deleting and more. To access IDs of sfile
objects, these functions use the sfile
data type. To store object IDs, a new column of the sfile
data type must be added to the table.
-
sf_initialize()
returnsvoid
# Initializes the storage of
sfile
objects. The function creates thepgpro_sfile_data
schema and then creates the needed tables and sequences in this schema. Call this function right after installation of the extension.-
sf_deinitialize()
returnsvoid
# Performs cascading delete of the
pgpro_sfile_data
schema. Call this function before deleting the extension to delete all thesfile
object data. Superuser privileges are required to call this function.-
sf_create(
returnsa_sf_name
text
,a_sf_persistence
text
,a_sf_json_options
text
[,a_sf_tablespace
text
])sfile
# Creates a new
sfile
object with the namea_sf_name
and returns the ID of the object. The object will be logged or unlogged depending on the value ofa_sf_persistence
, which must beLOGGED
orUNLOGGED
, and thea_sf_json_options
argument must contain JSON with options. Thea_sf_tablespace
argument allows you to specify the tablespace where thesfile
object will be stored, otherwise, the default tablespace is used.This function adds a descriptor of the
sfile
object with the zero size to the SF_DESCRIPTOR relation. So a call to sf_is_empty on a newly created object returnsTRUE
, as well as sf_is_valid. No blocks or partitions are created bysf_create
.-
sf_create_empty([
#a_sf_tablespace
text
]) returnssfile
Creates a new empty
sfile
object with an autogenerated name (sf_gen_XX
) and empty options and returns the ID of the object. Thea_sf_tablespace
argument allows you to specify the tablespace where thesfile
object will be stored, otherwise, the default tablespace is used.-
sf_write(
returnsa_sf
sfile
,a_sf_data
bytea
[,a_sf_index
bigint
])integer
# Inserts the new data block
a_sf_data
with the index specified by thea_sf_index
argument into thesfile
object specified by its IDa_sf
and returns the number of bytes added. The current timestamp is used as the default index.This function first checks whether the
sfile
object was already written in order to figure out whether the partition table SF_PAGE_XX is available and has enough free space to write the block. If the partition is not available or it is the first write, all other partitions are checked and the first available is locked for writing. If no partition is available, a new one is created and added to the SF_PARTITION registry. Then a new block entry is initialized in SF_BLOCK, the block is divided into pages of ~8kB size, and these pages are written to the previously lockedSF_PAGE_XX
partition. When the block is written, the size of the data is adjusted in thepart_data_size
field of theSF_PARTITION
table.-
sf_read(
returnsa_sf
sfile
[,a_offset
bigint
,a_length
integer
])bytea
# Reads the number of bytes specified by the
a_length
from thesfile
object specified by its IDa_sf
at the offset specified by thea_offset
argument. Up to ~1 GB can be read due to thevarlena
limitation. The default is ~1 GB. The default offset is 0. Returns the buffer with the data read.-
sf_size(
returnsa_sf
sfile
)bigint
# Retrieves the
sfile
object size.-
sf_truncate(
returnsa_sf
sfile
)bigint
# Truncates the
sfile
object, that is, deletes all data (blocks) but leaves the object intact, and returns the newsfile
object size, that is, 0. So a subsequent call to sf_is_valid for this object continues to returnTRUE
, and a call to sf_is_empty also returnsTRUE
.-
sf_delete(
returnsa_sf
sfile
)bigint
# Deletes the
sfile
object and returns the amount of data deleted. The function deletes all the blocks of data and then removes the object descriptor from SF_DESCRIPTOR. Since then, the object is not available, so a call to sf_is_valid for this object returnsFALSE
.-
sf_describe(
returnsa_sf
sfile
)cstring
# Retrieves the
sfile
object descriptor data and returns it as text.-
sf_get_json_options(
returnsa_sf
sfile
)cstring
# Retrieves JSON with options provided when the
sfile
object was created (see sf_create).-
sf_find(
returnsa_sf_name
text
)sfile
# Searches the
sfile
object by the specified name and returns thesfile
object ID.-
sf_set_option(
returnsa_opt_name
text
, [a_opt_value
text
,a_opt_type
text
])void
# Sets the option (pgpro_sfile setting) specified by the
a_opt_name
argument in SF_OPTION. The only argument values that are currently supported areTABLESPACE
fora_opt_name
andGLOBAL
fora_opt_type
. The default fora_opt_value
is NULL, and the default fora_opt_type
isGLOBAL
. The pgpro_sfile behavior with theTABLESPACE
option, is to check whether the specified tablespace exists and if it does not, raise an error.-
sf_get_option(
returnsa_opt_name
text
)cstring
# Retrieves the option (pgpro_sfile setting) with the specified name from SF_OPTION.
-
sf_delete_option(
returnsa_opt_name
text
)void
# Deletes the option (pgpro_sfile setting) with the specified name from SF_OPTION.
-
sf_is_valid(
returnsa_sf
sfile
)bool
# Returns
TRUE
if the argument specifies a validsfile
object andFALSE
otherwise.-
sf_is_empty(
returnsa_sf
sfile
)bool
# Returns
TRUE
if the argument specifies a valid and emptysfile
object andFALSE
otherwise.-
sf_is_logged(
returnsa_sf
sfile
)bool
# Returns
TRUE
if the argument specifies a logged (permanent)sfile
object andFALSE
otherwise.-
sf_trim(
returnsa_sf
sfile
,a_length
bigint
)bigint
# Trims the
sfile
object specified by thea_sf
argument to the size specified by thea_length
argument. The trimmed data is deleted. Returns the new object size. If thesfile
object size is less thana_length
, returns the actual object size.-
sf_set_type(
returnsa_sf
sfile
,a_type
text
)void
# Sets the custom type specified in the
a_type
argument to thesfile
object specified bya_sf
.-
sf_get_type(
returnsa_sf
sfile
)cstring
# Retrieves the custom object type of the specified
sfile
object as text.-
sf_md5(
returnsa_sf
sfile
)text
# Calculates MD5 hash for the specified
sfile
object.
F.51.4. Parallel Processing #
To speed up interactions with large objects, read functions sf_read/sf_size and write function the sf_write can be performed in parallel. These functions take an AccessShareLock
lock for the sfile
object, which does not forbid reading the object. Each of sf_read
/sf_size
functions releases this lock right after execution of the operation, while sf_write
holds the lock up to the end of the transaction to prevent deletion/trunction/trimming of the sfile
object before the end of the transaction. In more detail, the changes made by sf_write
to the SF_BLOCK and SF_PAGE_XX tables are not visible to a parallel session, and if the AccessShareLock
lock is removed before the end of the transaction, the sfile
object can be deleted by another session, which is unaware of the rows added to SF_BLOCK
and SF_PAGE_XX
. As a result, the sfile
object will be deleted while these rows will remain intact.
While writing, blocks of one object can be written from several sessions, and the block write sequence depends on the time of the sf_write
call. To control the block write sequence, pass the a_sf_index
argument. If it is not passed, the call timestamp is used instead, which is the default. Each call to sf_write
takes an AccessExclusiveLock
for partitions where blocks are written, which means that it is not permitted to write blocks in the same partition from different sessions as it is not possible to simultaneously modify the part_data_size
field of the SF_PARTITION table. However, writing sfile
object blocks from several sessions into different partitions is permitted.
sf_trim, sf_truncate and sf_delete functions cannot be performed in parallel as they take an AccessExclusiveLock
for the sfile
object.
F.51.5. Example #
This example illustrates usage of pgpro_sfile:
-- Create pgpro_sfile extension CREATE EXTENSION pgpro_sfile; -- Initialize sfile object storage SELECT sf_initialize(); -- Create test table CREATE TABLE test_sfile (id int, l sfile); -- Create and store sfile object INSERT INTO test_sfile VALUES (1, sf_create('sf', 'LOGGED', NULL)); -- Write data to sfile object and check data written SELECT sf_write(t.l, '1234567890'::bytea) FROM test_sfile t WHERE id = 1; SELECT encode(b, 'escape'), length(b) FROM (SELECT sf_read(t.l, 0, NULL) b FROM test_sfile t WHERE id = 1) x; -- Trim sfile object and check data SELECT sf_trim(t.l, 5) FROM test_sfile t WHERE id = 1; SELECT encode(b, 'escape'), length(b) FROM (SELECT sf_read(t.l, 0, NULL) b FROM test_sfile t WHERE id = 1) x; -- Remove table DROP TABLE test_sfile; -- Delete data of all sfile objects SET client_min_messages = WARNING; SELECT sf_deinitialize(); RESET client_min_messages; -- Delete pgpro_sfile extension DROP EXTENSION pgpro_sfile;
This example produces the following output:
CREATE EXTENSION sf_initialize --------------- (1 row) CREATE TABLE INSERT 0 1 sf_write ---------- 10 (1 row) encode | length ------------+-------- 1234567890 | 10 (1 row) sf_trim --------- 5 (1 row) encode | length --------+-------- 12345 | 5 (1 row) DROP TABLE SET sf_deinitialize ----------------- (1 row)
F.51.6. Authors #
Postgres Professional, Moscow, Russia