Rework the way multixact truncations work - Mailing list pgsql-hackers

From Andres Freund
Subject Rework the way multixact truncations work
Date
Msg-id 20150621192409.GA4797@alap3.anarazel.de
Whole thread Raw
Responses Re: Rework the way multixact truncations work  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Re: Rework the way multixact truncations work  (Thomas Munro <thomas.munro@enterprisedb.com>)
Re: Rework the way multixact truncations work  (Andres Freund <andres@anarazel.de>)
Re: Rework the way multixact truncations work  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hi,

As discussed on list, over IM and in person at pgcon I want to make
multixact truncations be WAL logged to address various bugs.

Since that's a comparatively large and invasive change I thought it'd be
a good idea to start a new thread instead of burying it in a already
long thread.

Here's the commit message which hopefully explains what's being changed
and why:

    Rework the way multixact truncations work.

    The fact that multixact truncations are not WAL logged has caused a fair
    share of problems. Amongst others it requires to do computations during
    recovery while the database is not in a consistent state, delaying
    truncations till checkpoints, and handling members being truncated, but
    offset not.

    We tried to put bandaids on lots of these issues over the last years,
    but it seems time to change course. Thus this patch introduces WAL
    logging for truncation, even in the back branches.

    This allows:
    1) to perform the truncation directly during VACUUM, instead of delaying it
       to the checkpoint.
    2) to avoid looking at the offsets SLRU for truncation during recovery,
       we can just use the master's values.
    3) simplify a fair amount of logic to keep in memory limits straight,
       this has gotten much easier

    During the course of fixing this a bunch of bugs had to be fixed:
    1) Data was not purged from memory the member's slru before deleting
       segments. This happend to be hard or impossible to hit due to the
       interlock between checkpoints and truncation.
    2) find_multixact_start() relied on SimpleLruDoesPhysicalPageExist - but
       that doesn't work for offsets that haven't yet been flushed to
       disk. Flush out before running to fix. Not pretty, but it feels
       slightly safer to only make decisions based on on-disk state.

    To handle the case of an updated standby replaying WAL from a not-yet
    upgraded primary we have to recognize that situation and use "old style"
    truncation (i.e. looking at the SLRUs) during WAL replay. In contrast to
    before this now happens in the startup process, when replaying a
    checkpoint record, instead of the checkpointer. Doing this in the
    restartpoint was incorrect, they can happen much later than the original
    checkpoint, thereby leading to wraparound. It's also more in line to how
    the WAL logging now works.

    To avoid "multixact_redo: unknown op code 48" errors standbys should be
    upgraded before primaries. This needs to be expressed clearly in the
    release notes.

    Backpatch to 9.3, where the use of multixacts was expanded. Arguably
    this could be backpatched further, but there doesn't seem to be
    sufficient benefit to outweigh the risk of applying a significantly
    different patch there.


I've tested this a bunch, including using a newer standby against a
older master and such. What I have yet to test is that the concurrency
protections against multiple backends truncating at the same time are
correct.

It'd be very welcome to see some wider testing and review on this.

I've attached three commits:
0001: Add functions to burn through multixacts - that should get its own file.
0002: Lower the lower bound limits for *_freeze_max_age - I think we should
      just do that. There really is no reason for the current limits
      and they make testing hard and force space wastage.
0003: The actual truncation patch.

Greetings,

Andres Freund

Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: pg_stat_*_columns?
Next
From: Andres Freund
Date:
Subject: Re: castoroides spinlock failure on test_shm_mq