Re: RFC: WAL infrastructure issues, updates and improvements - Mailing list pgsql-hackers

From Matthias van de Meent
Subject Re: RFC: WAL infrastructure issues, updates and improvements
Date
Msg-id CAEze2Whf=fwAj7rosf6aDM9t+7MU1w-bJn28HFWYGkz+ics-hg@mail.gmail.com
Whole thread Raw
List pgsql-hackers
Hi,

Various users and threads on -hackers (incl. [0] specifically) have
complained about the overhead of our WAL format. One example of this
is that we write at least 44 bytes to register any changes to a single
relation's page: There is a 24-byte WAL record header, a 4-byte block
record header, plus 12 bytes for RelFileLocator, plus 4 bytes for
BlockNumber. This is a very significant overhead, which is threatening
to grow ever larger as we're considering moving to larger identifiers
in PostgreSQL: 56-bit relfilenodes, 64-bit XIDs, etc. I think we can
do significantly better than that in most, if not all, cases.

For PostgreSQL 17, I'd like to propose an effort to improve our WAL
infrastructure. The tail of this mail contains identified areas for
improvement and feasibility analysis for improving on those items.
There are probably more issues and potential improvements, but these
are the ones I was aware of.

Note that I'm mentioning PostgreSQL 17 because I don't have a lot of
time for 16, and I think changing WAL formats and behavior all at once
reduces the total effort for keeping external tooling compatible (as
opposed to piecemeal changes); "to rip off the band-aid". I'm sending
this mail now to collect community feedback and to make the thread
discoverable in the archives - I can't find any other threads in the
-hackers archives that were created specifically to cover people's
gripes with WAL and solutions to those gripes.

Kind regards,

Matthias van de Meent

CC-ed Heikki, Robert, Andres and Dilip for their interest in the topic
shown in the 56-bit relfilenumber thread, and Koichi for his work on
parallel recovery.
(sorry for the duplicate mail, I misconfigured my sender address)

------------

I am aware of the following ideas that exist, several of which came up
at the FOSDEM Developer Meeting last Thursday [1][2]:

Reducing the size of the XLog record header:
============

We currently store XIDs in every XLog record header, where only some need
them. E.g. most index records do not need XIDs for recovery, nor
conflict resolution.

The length field is 4 bytes, but many records are <UINT8 bytes long,
and most <UINT16. Putting total record length in a variable-length
field could reduce the average record overhead even more.

There are currently 2 bytes of alignment losses in the XLog header.

We may not need to store the xl_prev pointer to the previous record;
an implicit reference only contained in the checksum _may_ be enough.

Together, these updates can reduce the header size by several bytes in
the common case.


Reducing the size of the XLog block header
============
The block header has a 2-byte length field. Not all registered blocks
have associated data, and many records only have little data
(<UINT8_MAX). It would make sense to make this field variable length,
too.


Reducing the size of RelFileLocator, BlockNumber
============
Most of the time, the values in RelFileLocator's fields (and
BlockNumber) are small enough to fit in 2 or 3 bytes. We could shave
off some more bytes here, if the values are low enough and if we have
the bits to spare.


Reducing the need for FPIs in some cases
============
We log FPIs when we modify a page for the first time after a
checkpoint to make sure we can recover from torn writes. However, as
long as the data we change is at known offsets in the page (e.g. in
the page's checksum field) we could also choose to _not_ emit an FPI,
and instead mark the page as 'dirty+partial FPI emitted'. Redo would
_always_ have to replay these partial FPI images, but the overall WAL
volume should be able to decrease dramatically for checksum-enabled
systems where hintbit update WAL records can account for a large
portion of the WAL volume.

We'd need to update the rules on when we emit an FPI, but I think this
is not impossible and indeed quite feasible.

Some ideas about this "partial FPI":

As long as the line pointer array is not being modified, we can use
this partial FPI for a page's Item's header flag bits updates
(assuming we know what was modified from the xlog record): the items
remain on their location in the page, so even with torn writes the
references on the page don't change, allowing us to use them to locate
the page's item's header and overwrite the bytes. We can definitely
use this partial FPI for visibility hint bits, and use it for VM-bit
update logging as well.

The buffer descriptor (and/or page header bits) would need indicator
flags not only for 'is dirty' but also for indicating that it has only
had a partial FPI this checkpoint, so that when the page is modified
again in a way that is incompatible with the "partial FPI" rules we
don't forget to emit another FPI

See also: [2]


Reducing the overhead of reading large WAL records
============
Right now, WAL records are first fully read into a sequential buffer,
then checksum-validated, and then parsed. For small records that fit
on a page, we don't need to copy the data to calculate the checksum,
but for records spanning multiple pages, we write the whole record
into a locally allocated buffer, and then copy all that data again
when parsing the record's data.

See for example the index build records generated during GIN index
build on a table with data - it uses several xlog pages to deal with a
single GIN record that contains several FPIs of the newly built index.
During redo this currently requires two allocations of that record's
size - one for a temporary buffer to hold the data, and one for the
DecodedXLogRecord struct that will be used in the redo machinery.

This feels extremely wasteful. I think we should be able to parse and
checksum a record's data in the same pass: all but the first 861 bytes
of a record are data sections, which can be copied and aligned
directly into the relevant sections of the DecodedXLogRecord.


Parallelizing recovery
============
In a system that is not in use (i.e. crash recovery), each block need
to be replayed sequentially, but there is no need for that in general.
We could distribute WAL records across threads for replay, with some
extra coordination required for multi-block (such as HEAP/UPDATE) and
extra-relational records (such as XACT/COMMIT).
This may extend to replicas, but I'm not confident that all apply
orders are OK as long as each block's WAL is applied linearly, with a
prime counterexample being index insertions: An index tuple may only
be inserted after the heap tuple is inserted, or the index would
report that it is corrupted because it can't find the heap tuples it
is pointing to.
I think the XLog records of different databases could be replayed in
parallel but I'm not super confident about this; as long as we
synchronize on shared-information records (like updates in the
catalogs, or commits) I *think* we should be fine.

------------------------------

Feasability:

"Reducing the size of the XLog block header" is implementable in the
near term, as it requires very few changes in the XLog machinery. I am
planning on providing a patch for PostgreSQL 17.

"Reducing the size of the XLog record header" takes more effort, but I
think this too is possible in the near future. I'm planning on
providing this in a patch for 17 as well.

"Reducing the size of RelFileLocator, BlockNumber" has been proposed
by Andres Freund in [0], and showed some nice reduction in WAL size in
his testing. However, I am concerned about the potentially
significantly different performance between new clusters and older
clusters due to the significantly different compressible database IDs
/ relfilenumber / blocknumber in those systems.
I do think it is worth looking into improving it, but I think we'd
need changes to how relation and database IDs are generated first:
sharing a counter with TOAST (and other things that could generate
apparent gaps in a database's (or cluster's) assigned ids) is in my
opinion extremely unfavorable for the efficiency of the feature.

I think that "Reducing the need for FPIs in some cases" might be
somewhat controversial, and would need to be considered separate from
any other changes. I only (re?)discovered this idea last week, so I
haven't yet discovered all the pros and cons. But, because it could
potentially reduce the IO overhead of maintenance operations by up to
50% (still one dirty page write, but that's down from a full page in
WAL + dirty page write), I think it is definitely worth looking into.

"Reducing the overhead of reading large WAL records" requires another
update in the WAL reader. I don't think there are many controversial
components to this, though, as it is mostly refactoring and making
sure we don't lose performance.

"Parallelizing recovery" is not impossible, but requires significant
work regarding ordering of operations before it would be usable on hot
standby nodes. I did find a wiki page on the topic [3], and it
contains some further reading and links to a git branch with work on
this topic (latest based on 14.6).


[0] "problems with making relfilenodes 56-bits":
https://www.postgresql.org/message-id/flat/CA%2BTgmoaa9Yc9O-FP4vS_xTKf8Wgy8TzHpjnjN56_ShKE%3DjrP-Q%40mail.gmail.com
[1] https://wiki.postgresql.org/wiki/FOSDEM/PGDay_2023_Developer_Meeting#XLog_Format
[2] https://wiki.postgresql.org/wiki/FOSDEM/PGDay_2023_Developer_Meeting#Page_Format
[3] https://wiki.postgresql.org/wiki/Parallel_Recovery



pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: Improve logging when using Huge Pages
Next
From: Peter Smith
Date:
Subject: Re: [PATCH] Add pretty-printed XML output option