Re: RFC: WAL infrastructure issues, updates and improvements - Mailing list pgsql-hackers
From | Matthias van de Meent |
---|---|
Subject | Re: RFC: WAL infrastructure issues, updates and improvements |
Date | |
Msg-id | CAEze2Whf=fwAj7rosf6aDM9t+7MU1w-bJn28HFWYGkz+ics-hg@mail.gmail.com Whole thread Raw |
List | pgsql-hackers |
Hi, Various users and threads on -hackers (incl. [0] specifically) have complained about the overhead of our WAL format. One example of this is that we write at least 44 bytes to register any changes to a single relation's page: There is a 24-byte WAL record header, a 4-byte block record header, plus 12 bytes for RelFileLocator, plus 4 bytes for BlockNumber. This is a very significant overhead, which is threatening to grow ever larger as we're considering moving to larger identifiers in PostgreSQL: 56-bit relfilenodes, 64-bit XIDs, etc. I think we can do significantly better than that in most, if not all, cases. For PostgreSQL 17, I'd like to propose an effort to improve our WAL infrastructure. The tail of this mail contains identified areas for improvement and feasibility analysis for improving on those items. There are probably more issues and potential improvements, but these are the ones I was aware of. Note that I'm mentioning PostgreSQL 17 because I don't have a lot of time for 16, and I think changing WAL formats and behavior all at once reduces the total effort for keeping external tooling compatible (as opposed to piecemeal changes); "to rip off the band-aid". I'm sending this mail now to collect community feedback and to make the thread discoverable in the archives - I can't find any other threads in the -hackers archives that were created specifically to cover people's gripes with WAL and solutions to those gripes. Kind regards, Matthias van de Meent CC-ed Heikki, Robert, Andres and Dilip for their interest in the topic shown in the 56-bit relfilenumber thread, and Koichi for his work on parallel recovery. (sorry for the duplicate mail, I misconfigured my sender address) ------------ I am aware of the following ideas that exist, several of which came up at the FOSDEM Developer Meeting last Thursday [1][2]: Reducing the size of the XLog record header: ============ We currently store XIDs in every XLog record header, where only some need them. E.g. most index records do not need XIDs for recovery, nor conflict resolution. The length field is 4 bytes, but many records are <UINT8 bytes long, and most <UINT16. Putting total record length in a variable-length field could reduce the average record overhead even more. There are currently 2 bytes of alignment losses in the XLog header. We may not need to store the xl_prev pointer to the previous record; an implicit reference only contained in the checksum _may_ be enough. Together, these updates can reduce the header size by several bytes in the common case. Reducing the size of the XLog block header ============ The block header has a 2-byte length field. Not all registered blocks have associated data, and many records only have little data (<UINT8_MAX). It would make sense to make this field variable length, too. Reducing the size of RelFileLocator, BlockNumber ============ Most of the time, the values in RelFileLocator's fields (and BlockNumber) are small enough to fit in 2 or 3 bytes. We could shave off some more bytes here, if the values are low enough and if we have the bits to spare. Reducing the need for FPIs in some cases ============ We log FPIs when we modify a page for the first time after a checkpoint to make sure we can recover from torn writes. However, as long as the data we change is at known offsets in the page (e.g. in the page's checksum field) we could also choose to _not_ emit an FPI, and instead mark the page as 'dirty+partial FPI emitted'. Redo would _always_ have to replay these partial FPI images, but the overall WAL volume should be able to decrease dramatically for checksum-enabled systems where hintbit update WAL records can account for a large portion of the WAL volume. We'd need to update the rules on when we emit an FPI, but I think this is not impossible and indeed quite feasible. Some ideas about this "partial FPI": As long as the line pointer array is not being modified, we can use this partial FPI for a page's Item's header flag bits updates (assuming we know what was modified from the xlog record): the items remain on their location in the page, so even with torn writes the references on the page don't change, allowing us to use them to locate the page's item's header and overwrite the bytes. We can definitely use this partial FPI for visibility hint bits, and use it for VM-bit update logging as well. The buffer descriptor (and/or page header bits) would need indicator flags not only for 'is dirty' but also for indicating that it has only had a partial FPI this checkpoint, so that when the page is modified again in a way that is incompatible with the "partial FPI" rules we don't forget to emit another FPI See also: [2] Reducing the overhead of reading large WAL records ============ Right now, WAL records are first fully read into a sequential buffer, then checksum-validated, and then parsed. For small records that fit on a page, we don't need to copy the data to calculate the checksum, but for records spanning multiple pages, we write the whole record into a locally allocated buffer, and then copy all that data again when parsing the record's data. See for example the index build records generated during GIN index build on a table with data - it uses several xlog pages to deal with a single GIN record that contains several FPIs of the newly built index. During redo this currently requires two allocations of that record's size - one for a temporary buffer to hold the data, and one for the DecodedXLogRecord struct that will be used in the redo machinery. This feels extremely wasteful. I think we should be able to parse and checksum a record's data in the same pass: all but the first 861 bytes of a record are data sections, which can be copied and aligned directly into the relevant sections of the DecodedXLogRecord. Parallelizing recovery ============ In a system that is not in use (i.e. crash recovery), each block need to be replayed sequentially, but there is no need for that in general. We could distribute WAL records across threads for replay, with some extra coordination required for multi-block (such as HEAP/UPDATE) and extra-relational records (such as XACT/COMMIT). This may extend to replicas, but I'm not confident that all apply orders are OK as long as each block's WAL is applied linearly, with a prime counterexample being index insertions: An index tuple may only be inserted after the heap tuple is inserted, or the index would report that it is corrupted because it can't find the heap tuples it is pointing to. I think the XLog records of different databases could be replayed in parallel but I'm not super confident about this; as long as we synchronize on shared-information records (like updates in the catalogs, or commits) I *think* we should be fine. ------------------------------ Feasability: "Reducing the size of the XLog block header" is implementable in the near term, as it requires very few changes in the XLog machinery. I am planning on providing a patch for PostgreSQL 17. "Reducing the size of the XLog record header" takes more effort, but I think this too is possible in the near future. I'm planning on providing this in a patch for 17 as well. "Reducing the size of RelFileLocator, BlockNumber" has been proposed by Andres Freund in [0], and showed some nice reduction in WAL size in his testing. However, I am concerned about the potentially significantly different performance between new clusters and older clusters due to the significantly different compressible database IDs / relfilenumber / blocknumber in those systems. I do think it is worth looking into improving it, but I think we'd need changes to how relation and database IDs are generated first: sharing a counter with TOAST (and other things that could generate apparent gaps in a database's (or cluster's) assigned ids) is in my opinion extremely unfavorable for the efficiency of the feature. I think that "Reducing the need for FPIs in some cases" might be somewhat controversial, and would need to be considered separate from any other changes. I only (re?)discovered this idea last week, so I haven't yet discovered all the pros and cons. But, because it could potentially reduce the IO overhead of maintenance operations by up to 50% (still one dirty page write, but that's down from a full page in WAL + dirty page write), I think it is definitely worth looking into. "Reducing the overhead of reading large WAL records" requires another update in the WAL reader. I don't think there are many controversial components to this, though, as it is mostly refactoring and making sure we don't lose performance. "Parallelizing recovery" is not impossible, but requires significant work regarding ordering of operations before it would be usable on hot standby nodes. I did find a wiki page on the topic [3], and it contains some further reading and links to a git branch with work on this topic (latest based on 14.6). [0] "problems with making relfilenodes 56-bits": https://www.postgresql.org/message-id/flat/CA%2BTgmoaa9Yc9O-FP4vS_xTKf8Wgy8TzHpjnjN56_ShKE%3DjrP-Q%40mail.gmail.com [1] https://wiki.postgresql.org/wiki/FOSDEM/PGDay_2023_Developer_Meeting#XLog_Format [2] https://wiki.postgresql.org/wiki/FOSDEM/PGDay_2023_Developer_Meeting#Page_Format [3] https://wiki.postgresql.org/wiki/Parallel_Recovery
pgsql-hackers by date: