Hi,
As discussed on list, over IM and in person at pgcon I want to make
multixact truncations be WAL logged to address various bugs.
Since that's a comparatively large and invasive change I thought it'd be
a good idea to start a new thread instead of burying it in a already
long thread.
Here's the commit message which hopefully explains what's being changed
and why:
Rework the way multixact truncations work.
The fact that multixact truncations are not WAL logged has caused a fair
share of problems. Amongst others it requires to do computations during
recovery while the database is not in a consistent state, delaying
truncations till checkpoints, and handling members being truncated, but
offset not.
We tried to put bandaids on lots of these issues over the last years,
but it seems time to change course. Thus this patch introduces WAL
logging for truncation, even in the back branches.
This allows:
1) to perform the truncation directly during VACUUM, instead of delaying it
to the checkpoint.
2) to avoid looking at the offsets SLRU for truncation during recovery,
we can just use the master's values.
3) simplify a fair amount of logic to keep in memory limits straight,
this has gotten much easier
During the course of fixing this a bunch of bugs had to be fixed:
1) Data was not purged from memory the member's slru before deleting
segments. This happend to be hard or impossible to hit due to the
interlock between checkpoints and truncation.
2) find_multixact_start() relied on SimpleLruDoesPhysicalPageExist - but
that doesn't work for offsets that haven't yet been flushed to
disk. Flush out before running to fix. Not pretty, but it feels
slightly safer to only make decisions based on on-disk state.
To handle the case of an updated standby replaying WAL from a not-yet
upgraded primary we have to recognize that situation and use "old style"
truncation (i.e. looking at the SLRUs) during WAL replay. In contrast to
before this now happens in the startup process, when replaying a
checkpoint record, instead of the checkpointer. Doing this in the
restartpoint was incorrect, they can happen much later than the original
checkpoint, thereby leading to wraparound. It's also more in line to how
the WAL logging now works.
To avoid "multixact_redo: unknown op code 48" errors standbys should be
upgraded before primaries. This needs to be expressed clearly in the
release notes.
Backpatch to 9.3, where the use of multixacts was expanded. Arguably
this could be backpatched further, but there doesn't seem to be
sufficient benefit to outweigh the risk of applying a significantly
different patch there.
I've tested this a bunch, including using a newer standby against a
older master and such. What I have yet to test is that the concurrency
protections against multiple backends truncating at the same time are
correct.
It'd be very welcome to see some wider testing and review on this.
I've attached three commits:
0001: Add functions to burn through multixacts - that should get its own file.
0002: Lower the lower bound limits for *_freeze_max_age - I think we should
just do that. There really is no reason for the current limits
and they make testing hard and force space wastage.
0003: The actual truncation patch.
Greetings,
Andres Freund