MVCC in PostgreSQL — 3. Row Versions
Well, we've already discussed isolation and made a digression regarding the low-level data structure. And we've finally reached the most fascinating thing, that is, row versions (tuples).
Tuple header
As already mentioned, several versions of each row can be simultaneously available in the database. And we need to somehow distinguish one version from another one. To this end, each version is labeled with its effective "time" (xmin
) and expiration "time" (xmax
). Quotation marks denote that a special incrementing counter is used rather than the time itself. And this counter is the transaction identifier.
(As usual, in reality this is more complicated: the transaction ID cannot always increment due to a limited bit depth of the counter. But we will explore more details of this when our discussion reaches freezing.)
When a row is created, the value of xmin
is set equal to the ID of the transaction that performed the INSERT command, while xmax
is not filled in.
When a row is deleted, the xmax
value of the current version is labeled with the ID of the transaction that performed DELETE.
An UPDATE command actually performs two subsequent operations: DELETE and INSERT. In the current version of the row, xmax
is set equal to the ID of the transaction that performed UPDATE. Then a new version of the same row is created, in which the value of xmin
is the same as xmax
of the previous version.
xmin
and xmax
fields are included in the header of a row version. In addition to these fields, the tuple header contains other ones, such as:
infomask
-- several bits that determine the properties of a given tuple. There are quite a few of them, and we will discuss each over time.ctid
-- a reference to the next, more recent, version of the same row.ctid
of the newest, up-to-date, row version references that very version. The number is in the(x,y)
form, wherex
is the number of the page andy
is the order number of the pointer in the array.- The NULLs bitmap, which marks those columns of a given version that contain a NULL. NULL is not a regular value of data types, and therefore, we have to store this characteristic separately.
As a result, the header appears pretty large: 23 bytes per each tuple at a minimum, but usually larger because of the NULLs bitmap. If a table is "narrow" (that is, it contains few columns), the overhead bytes can occupy more space that the useful information.
Insert
Let's look in more detail at how the operations on rows are performed at a low level, and we start with an insert.
To experiment, we will create a new table with two columns and an index on one of them:
=> CREATE TABLE t(
id serial,
s text
);
=> CREATE INDEX ON t(s);
We start a transaction to insert a row.
=> BEGIN;
=> INSERT INTO t(s) VALUES ('FOO');
This is the ID of our current transaction:
=> SELECT txid_current();
txid_current
--------------
3664
(1 row)
Let's look into the contents of the page. The heap_page_items
function from the "pageinspect" extension enables us to get information on the pointers and row versions:
=> SELECT * FROM heap_page_items(get_raw_page('t',0)) \gx
-[ RECORD 1 ]-------------------
lp | 1
lp_off | 8160
lp_flags | 1
lp_len | 32
t_xmin | 3664
t_xmax | 0
t_field3 | 0
t_ctid | (0,1)
t_infomask2 | 2
t_infomask | 2050
t_hoff | 24
t_bits |
t_oid |
t_data | \x0100000009464f4f
Note that the word "heap" in PostgreSQL denotes tables. This is one more weird usage of a term: a heap is a known data structure, which has nothing to do with a table. This word is used here in the sense that "all is heaped up", unlike in ordered indexes.
This function shows the data "as is", in a format that is difficult to comprehend. To clarify the things, we leave only part of the information and interpret it:
=> SELECT '(0,'||lp||')' AS ctid,
CASE lp_flags
WHEN 0 THEN 'unused'
WHEN 1 THEN 'normal'
WHEN 2 THEN 'redirect to '||lp_off
WHEN 3 THEN 'dead'
END AS state,
t_xmin as xmin,
t_xmax as xmax,
(t_infomask & 256) > 0 AS xmin_commited,
(t_infomask & 512) > 0 AS xmin_aborted,
(t_infomask & 1024) > 0 AS xmax_commited,
(t_infomask & 2048) > 0 AS xmax_aborted,
t_ctid
FROM heap_page_items(get_raw_page('t',0)) \gx
-[ RECORD 1 ]-+-------
ctid | (0,1)
state | normal
xmin | 3664
xmax | 0
xmin_commited | f
xmin_aborted | f
xmax_commited | f
xmax_aborted | t
t_ctid | (0,1)
We did the following:
- Added a zero to the pointer number to make it look like a
t_ctid
: (page number, pointer number). - Interpreted the status of the
lp_flags
pointer. It is "normal" here, which means that the pointer actually references a row version. We will discuss other values later. - Of all information bits, we selected only two pairs so far.
xmin_committed
andxmin_aborted
bits show whether the transaction with the IDxmin
is committed (rolled back). A pair of similar bits relates to the transaction with the IDxmax
.
What do we observe? When a row is inserted, in the table page a pointer appears that has number 1 and references the first and the only version of the row.
The xmin
field in the tuple is filled with the ID of the current transaction. Because the transaction is still active, both xmin_committed
and xmin_aborted
bits are unset.
The ctid
field of the row version references the same row. It means that no newer version is available.
The xmax
field is filled with the conventional number 0 since the tuple is not deleted, that is, up-to-date. Transactions will ignore this number because of the xmax_aborted
bit set.
Let's move one more step to improving the readability by appending information bits to transaction IDs. And let's create the function since we will need the query more than once:
=> CREATE FUNCTION heap_page(relname text, pageno integer)
RETURNS TABLE(ctid tid, state text, xmin text, xmax text, t_ctid tid)
AS $$
SELECT (pageno,lp)::text::tid AS ctid,
CASE lp_flags
WHEN 0 THEN 'unused'
WHEN 1 THEN 'normal'
WHEN 2 THEN 'redirect to '||lp_off
WHEN 3 THEN 'dead'
END AS state,
t_xmin || CASE
WHEN (t_infomask & 256) > 0 THEN ' (c)'
WHEN (t_infomask & 512) > 0 THEN ' (a)'
ELSE ''
END AS xmin,
t_xmax || CASE
WHEN (t_infomask & 1024) > 0 THEN ' (c)'
WHEN (t_infomask & 2048) > 0 THEN ' (a)'
ELSE ''
END AS xmax,
t_ctid
FROM heap_page_items(get_raw_page(relname,pageno))
ORDER BY lp;
$$ LANGUAGE SQL;
What is happening in the header of the row version it is much clearer in this form:
=> SELECT * FROM heap_page('t',0);
ctid | state | xmin | xmax | t_ctid
-------+--------+------+-------+--------
(0,1) | normal | 3664 | 0 (a) | (0,1)
(1 row)
We can get similar information, but far less detailed, from the table itself by using xmin
and xmax
pseudo-columns:
=> SELECT xmin, xmax, * FROM t;
xmin | xmax | id | s
------+------+----+-----
3664 | 0 | 1 | FOO
(1 row)
Commit
When a transaction is successful, its status must be remembered, that is, the transaction must be marked as committed. To this end, the XACT structure is used. (Before version 10 it was called CLOG (commit log), and you are still likely to come across this name.)
XACT is not a table of the system catalog, but files in the PGDATA/pg_xact directory. Two bits are allocated in these files for each transaction — "committed" and "aborted" — exactly the same way as in the tuple header. This information is spread across several files only for convenience; we will get back to this when we discuss freezing. PostgreSQL works with these files page by page, as with all others.
So, when a transaction is committed, the "committed" bit is set for this transaction in XACT. And this is all that happens when the transaction is committed (although we do not mention the write-ahead log yet).
When some other transaction accesses the table page we were just looking at, the former will have to answer a few questions.
- Was the transaction
xmin
completed? If not, the created tuple must not be visible.
This is checked by looking through another structure, which is located in the shared memory of the instance and called ProcArray. This structure holds a list of all active processes, along with the ID of the current (active) transaction for each. - If the transaction was completed, then was it committed or rolled back? If it was rolled back, the tuple must not be visible either.
This is just what XACT is needed for. But it is expensive to check XACT each time, although last pages of XACT are stored in buffers in the shared memory. Therefore, once figured out, the transaction status is written toxmin_committed
andxmin_aborted
bits of the tuple. If any of these bits is set, the transaction status is treated as known and the next transaction will not need to check XACT.
Why does not the transaction that performs the insert set these bits? When an insert is being performed, the transaction is yet unaware of whether it will be completed successfully. And at the commit time it's already unclear which rows and in which pages were changed. There can be a lot of such pages, and it is impractical to keep track of them. Besides, some of the pages can be evicted to disk from the buffer cache; to read them again in order to change the bits would mean a considerable slowdown of the commit.
The reverse side of the cost saving is that after the updates, any transaction (even the one performing SELECT) can begin changing data pages in the buffer cache.
So, we commit the change.
=> COMMIT;
Nothing has changed in the page (but we know that the transactions status is already written to XACT):
=> SELECT * FROM heap_page('t',0);
ctid | state | xmin | xmax | t_ctid
-------+--------+------+-------+--------
(0,1) | normal | 3664 | 0 (a) | (0,1)
(1 row)
Now a transaction that first accesses the page will need to determine the status of the transaction xmin
and will write it to the information bits:
=> SELECT * FROM t;
id | s
----+-----
1 | FOO
(1 row)
=> SELECT * FROM heap_page('t',0);
ctid | state | xmin | xmax | t_ctid
-------+--------+----------+-------+--------
(0,1) | normal | 3664 (c) | 0 (a) | (0,1)
(1 row)
Delete
When a row is deleted, the ID of the current deleting transaction is written to the xmax
field of the up-to-date version and the xmax_aborted
bit is reset.
Note that the value of xmax
corresponding to the active transaction works as a row lock. If another transaction is going to update or delete this row, it will have to wait until the xmax
transaction completes. We will talk about locks in more detail later. At this point, only note that the number of row locks is not limited at all. They do not occupy memory, and the system performance is not affected by that number. However, long lasting transactions have other drawbacks, which will also be discussed later.
Let's delete a row.
=> BEGIN;
=> DELETE FROM t;
=> SELECT txid_current();
txid_current
--------------
3665
(1 row)
We see that the transaction ID is written to the xmax
field, but information bits are unset:
=> SELECT * FROM heap_page('t',0);
ctid | state | xmin | xmax | t_ctid
-------+--------+----------+------+--------
(0,1) | normal | 3664 (c) | 3665 | (0,1)
(1 row)
Abort
Abort of a transaction works similarly to commit, except that the "aborted" bit is set in XACT. An abort is done as fast as a commit. Although the command is called ROLLBACK, the changes are not rolled back: everything that the transaction has already changed, remains untouched.
=> ROLLBACK;
=> SELECT * FROM heap_page('t',0);
ctid | state | xmin | xmax | t_ctid
-------+--------+----------+------+--------
(0,1) | normal | 3664 (c) | 3665 | (0,1)
(1 row)
When accessing the page, the status will be checked and the hint bit xmax_aborted
will be set. Although the number xmax
itself will be still in the page, it will not be looked at.
=> SELECT * FROM t;
id | s
----+-----
1 | FOO
(1 row)
=> SELECT * FROM heap_page('t',0);
ctid | state | xmin | xmax | t_ctid
-------+--------+----------+----------+--------
(0,1) | normal | 3664 (c) | 3665 (a) | (0,1)
(1 row)
Update
An update works as if the current version is deleted first and then a new one is inserted.
=> BEGIN;
=> UPDATE t SET s = 'BAR';
=> SELECT txid_current();
txid_current
--------------
3666
(1 row)
The query returns one row (the new version):
=> SELECT * FROM t;
id | s
----+-----
1 | BAR
(1 row)
But we can see both versions in the page:
=> SELECT * FROM heap_page('t',0);
ctid | state | xmin | xmax | t_ctid
-------+--------+----------+-------+--------
(0,1) | normal | 3664 (c) | 3666 | (0,2)
(0,2) | normal | 3666 | 0 (a) | (0,2)
(2 rows)
The deleted version is labeled with the ID of the current transaction in the xmax
field. Moreover, this value has overwritten the old one since the previous transaction was rolled back. And the xmax_aborted
bit is reset since the status of the current transaction is unknown yet.
The first version of the row is now referencing the second, as a newer one.
The index page now contains the second pointer and second row, which references the second version in the table page.
The same way as for a delete, the value of xmax
in the first version indicates that the row is locked.
Lastly, we commit the transaction.
=> COMMIT;
Indexes
We were talking only about table pages so far. But what happens inside indexes?
Information in index pages highly depends on the specific index type. Moreover, even one type of indexes can have different kinds of pages. For example: a B-tree has the metadata page and "normal" pages.
Nevertheless, an index page usually has an array of pointers to the rows and rows themselves (just like table pages). Besides, some space at the end of a page is allocated for special data.
Rows in indexes can also have different structures depending on the index type. For example: in an B-tree, rows pertinent to leaf pages contain the value of the indexing key and a reference (ctid
) to the appropriate table row. In general, an index can be structured quite a different way.
The main point is that in indexes of any type there are no row versions. Or we can consider each row to be represented by only one version. In other words, the header of the index row does not contain the xmin
and xmax
fields. For now we can assume that references from the index point to all versions of table rows. So to make out which of the row versions are visible to a transaction, PostgreSQL needs to look into the table. (As usual, this is not the whole story. Sometimes the visibility map enables optimizing the process, but we will discuss this later.)
Here, in the index page, we find pointers to both versions: the up-to-date and previous:
=> SELECT itemoffset, ctid FROM bt_page_items('t_s_idx',1);
itemoffset | ctid
------------+-------
1 | (0,2)
2 | (0,1)
(2 rows)
Virtual transactions
In practice, PostgreSQL takes advantage of an optimization that permits to "sparingly" expends transaction IDs.
If a transaction only reads data, it does not affect the visibility of tuple at all. Therefore, first the backend process assigns a virtual ID (virtual xid) to the transaction. This ID consists of the process identifier and a sequential number.
Assignment of this virtual ID does not require synchronization between all the processes and is therefore performed very quickly. We will learn another reason of using virtual IDs when we discuss freezing.
Data snapshots do not take into account virtual ID at all.
At different points in time, the system can have virtual transactions with IDs that were already used, and this is fine. But this ID cannot be written to data pages since when the page is accessed next time, the ID can become meaningless.
=> BEGIN;
=> SELECT txid_current_if_assigned();
txid_current_if_assigned
--------------------------
(1 row)
But if a transaction begins to change data, it receives a true, unique, transaction ID.
=> UPDATE accounts SET amount = amount - 1.00;
=> SELECT txid_current_if_assigned();
txid_current_if_assigned
--------------------------
3667
(1 row)
=> COMMIT;
Subtransactions
Savepoints
In SQL, savepoints are defined, which permit rolling back some operations of the transaction without its complete abortion. But this is incompatible with the above model since the transaction status is one for all the changes and no data is physically rolled back.
To implement this functionality, a transaction with a savepoint is divided into several separate subtransactions whose statuses can be managed separately.
Subtrabsactions have their own IDs (greater than the ID of the main transaction). The statuses of subtransactions are written to XACT in a usual way, but the final status depends on the status of the main transaction: if it is rolled back, all subtransactions are rolled back as well.
Information about subtransactions nesting is stored in files of the PGDATA/pg_subtrans directory. These files are accessed through buffers in the shared memory of the instance, which are structured the same way as XACT buffers.
Do not confuse subtransactions with autonomous transactions. Autonomous transactions in no way depend on one another, while subtransactions do depend. There are no autonomous transactions in the regular PostgreSQL, which is, perhaps, for the better: they are actually needed extremely rarely, and their availability in other DBMS invites abuse, which everyone suffers.
Let's clear the table, start a transaction and insert a row:
=> TRUNCATE TABLE t;
=> BEGIN;
=> INSERT INTO t(s) VALUES ('FOO');
=> SELECT txid_current();
txid_current
--------------
3669
(1 row)
=> SELECT xmin, xmax, * FROM t;
xmin | xmax | id | s
------+------+----+-----
3669 | 0 | 2 | FOO
(1 row)
=> SELECT * FROM heap_page('t',0);
ctid | state | xmin | xmax | t_ctid
-------+--------+------+-------+--------
(0,1) | normal | 3669 | 0 (a) | (0,1)
(1 row)
Now we establish a savepoint and insert another row.
=> SAVEPOINT sp;
=> INSERT INTO t(s) VALUES ('XYZ');
=> SELECT txid_current();
txid_current
--------------
3669
(1 row)
Note that the txid_current
function returns the ID of the main transaction rather than of the subtransaction.
=> SELECT xmin, xmax, * FROM t;
xmin | xmax | id | s
------+------+----+-----
3669 | 0 | 2 | FOO
3670 | 0 | 3 | XYZ
(2 rows)
=> SELECT * FROM heap_page('t',0);
ctid | state | xmin | xmax | t_ctid
-------+--------+------+-------+--------
(0,1) | normal | 3669 | 0 (a) | (0,1)
(0,2) | normal | 3670 | 0 (a) | (0,2)
(2 rows)
Let's rollback to the savepoint and insert the third row.
=> ROLLBACK TO sp;
=> INSERT INTO t VALUES ('BAR');
=> SELECT xmin, xmax, * FROM t;
xmin | xmax | id | s
------+------+----+-----
3669 | 0 | 2 | FOO
3671 | 0 | 4 | BAR
(2 rows)
=> SELECT * FROM heap_page('t',0);
ctid | state | xmin | xmax | t_ctid
-------+--------+----------+-------+--------
(0,1) | normal | 3669 | 0 (a) | (0,1)
(0,2) | normal | 3670 (a) | 0 (a) | (0,2)
(0,3) | normal | 3671 | 0 (a) | (0,3)
(3 rows)
In the page, we continue to see the row that was added by the rolled back subtransaction.
Committing the changes.
=> COMMIT;
=> SELECT xmin, xmax, * FROM t;
xmin | xmax | id | s
------+------+----+-----
3669 | 0 | 2 | FOO
3671 | 0 | 4 | BAR
(2 rows)
=> SELECT * FROM heap_page('t',0);
ctid | state | xmin | xmax | t_ctid
-------+--------+----------+-------+--------
(0,1) | normal | 3669 (c) | 0 (a) | (0,1)
(0,2) | normal | 3670 (a) | 0 (a) | (0,2)
(0,3) | normal | 3671 (c) | 0 (a) | (0,3)
(3 rows)
It is clearly seen now that each subtransaction has its own status.
Note that SQL does not permit explicit use of subtransactions, that is, you cannot start a new transaction before you complete the current one. This technique gets implicitly involved when savepoints are used and also when handling PL/pgSQL exceptions, as well as in some other, more exotic situations.
=> BEGIN;
BEGIN
=> BEGIN;
WARNING: there is already a transaction in progress
BEGIN
=> COMMIT;
COMMIT
=> COMMIT;
WARNING: there is no transaction in progress
COMMIT
Errors and operation atomicity
What happens if an error occurs while the operation is being performed? For example, like this:
=> BEGIN;
=> SELECT * FROM t;
id | s
----+-----
2 | FOO
4 | BAR
(2 rows)
=> UPDATE t SET s = repeat('X', 1/(id-4));
ERROR: division by zero
An error occurred. Now the transaction is treated as aborted and no operations are permitted in it:
=> SELECT * FROM t;
ERROR: current transaction is aborted, commands ignored until end of transaction block
And even if we try to commit the changes, PostgreSQL will report the rollback:
=> COMMIT;
ROLLBACK
Why is it impossible to continue execution of the transaction after a failure? The thing is that the error could occur so that we would get access to part of the changes, that is, the atomicity would be broken not only for the transaction, but even for a single operator. For instance, in our example the operator could have updated one row before the error occurred:
=> SELECT * FROM heap_page('t',0);
ctid | state | xmin | xmax | t_ctid
-------+--------+----------+-------+--------
(0,1) | normal | 3669 (c) | 3672 | (0,4)
(0,2) | normal | 3670 (a) | 0 (a) | (0,2)
(0,3) | normal | 3671 (c) | 0 (a) | (0,3)
(0,4) | normal | 3672 | 0 (a) | (0,4)
(4 rows)
It's worth noting that psql has a mode that allows continuing the transaction after failure, as if the effects of the erroneous operator were rolled back.
=> \set ON_ERROR_ROLLBACK on
=> BEGIN;
=> SELECT * FROM t;
id | s
----+-----
2 | FOO
4 | BAR
(2 rows)
=> UPDATE t SET s = repeat('X', 1/(id-4));
ERROR: division by zero
=> SELECT * FROM t;
id | s
----+-----
2 | FOO
4 | BAR
(2 rows)
=> COMMIT;
It's easy to figure out that in this mode, psql actually establishes an implicit savepoint before each command and initiates a rollback to it in the event of failure. This mode is not used by default since establishing savepoints (even without a rollback to them) entails a significant overhead.