Thread: Rework the way multixact truncations work

Rework the way multixact truncations work

From
Andres Freund
Date:
Hi,

As discussed on list, over IM and in person at pgcon I want to make
multixact truncations be WAL logged to address various bugs.

Since that's a comparatively large and invasive change I thought it'd be
a good idea to start a new thread instead of burying it in a already
long thread.

Here's the commit message which hopefully explains what's being changed
and why:

    Rework the way multixact truncations work.

    The fact that multixact truncations are not WAL logged has caused a fair
    share of problems. Amongst others it requires to do computations during
    recovery while the database is not in a consistent state, delaying
    truncations till checkpoints, and handling members being truncated, but
    offset not.

    We tried to put bandaids on lots of these issues over the last years,
    but it seems time to change course. Thus this patch introduces WAL
    logging for truncation, even in the back branches.

    This allows:
    1) to perform the truncation directly during VACUUM, instead of delaying it
       to the checkpoint.
    2) to avoid looking at the offsets SLRU for truncation during recovery,
       we can just use the master's values.
    3) simplify a fair amount of logic to keep in memory limits straight,
       this has gotten much easier

    During the course of fixing this a bunch of bugs had to be fixed:
    1) Data was not purged from memory the member's slru before deleting
       segments. This happend to be hard or impossible to hit due to the
       interlock between checkpoints and truncation.
    2) find_multixact_start() relied on SimpleLruDoesPhysicalPageExist - but
       that doesn't work for offsets that haven't yet been flushed to
       disk. Flush out before running to fix. Not pretty, but it feels
       slightly safer to only make decisions based on on-disk state.

    To handle the case of an updated standby replaying WAL from a not-yet
    upgraded primary we have to recognize that situation and use "old style"
    truncation (i.e. looking at the SLRUs) during WAL replay. In contrast to
    before this now happens in the startup process, when replaying a
    checkpoint record, instead of the checkpointer. Doing this in the
    restartpoint was incorrect, they can happen much later than the original
    checkpoint, thereby leading to wraparound. It's also more in line to how
    the WAL logging now works.

    To avoid "multixact_redo: unknown op code 48" errors standbys should be
    upgraded before primaries. This needs to be expressed clearly in the
    release notes.

    Backpatch to 9.3, where the use of multixacts was expanded. Arguably
    this could be backpatched further, but there doesn't seem to be
    sufficient benefit to outweigh the risk of applying a significantly
    different patch there.


I've tested this a bunch, including using a newer standby against a
older master and such. What I have yet to test is that the concurrency
protections against multiple backends truncating at the same time are
correct.

It'd be very welcome to see some wider testing and review on this.

I've attached three commits:
0001: Add functions to burn through multixacts - that should get its own file.
0002: Lower the lower bound limits for *_freeze_max_age - I think we should
      just do that. There really is no reason for the current limits
      and they make testing hard and force space wastage.
0003: The actual truncation patch.

Greetings,

Andres Freund

Attachment

Re: Rework the way multixact truncations work

From
Alvaro Herrera
Date:
Andres Freund wrote:

>     Rework the way multixact truncations work.

I spent some time this morning reviewing this patch and had some
comments that I relayed over IM to Andres.  The vast majority is
cosmetic, but there are two larger things:

1. I think this part of PerformMembersTruncation() is very confusing:
    /* verify whether the current segment is to be deleted */    if (segment != startsegment && segment != endsegment)
     SlruDeleteSegment(MultiXactMemberCtl, segment);
 

I think this works correctly in that it preserves both endpoint files,
but the files in between are removed ... which is a confusing interface,
IMO.  I think this merits a longer explanation.


2. We set PGXACT->delayChkpt while the truncation is executed.  This
seems reasonable, and there's a good reason for it, but all the other
users of this facility only do small operations with this thing grabbed,
while the multixact truncation could take a long time because a large
number of files might be deleted.  Maybe it's not a problem to have
checkpoints be delayed by several seconds, or who knows maybe even a
minute in a busy system.  (We will have checkpointer sleeping in 10ms
intervals until the truncation is complete).

Maybe this is fine, not sure.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Rework the way multixact truncations work

From
Andres Freund
Date:
On 2015-06-26 14:48:35 -0300, Alvaro Herrera wrote:
> Andres Freund wrote:
> 
> >     Rework the way multixact truncations work.
> 
> I spent some time this morning reviewing this patch and had some
> comments that I relayed over IM to Andres.

Thanks for that!

> 2. We set PGXACT->delayChkpt while the truncation is executed.  This
> seems reasonable, and there's a good reason for it, but all the other
> users of this facility only do small operations with this thing grabbed,
> while the multixact truncation could take a long time because a large
> number of files might be deleted.  Maybe it's not a problem to have
> checkpoints be delayed by several seconds, or who knows maybe even a
> minute in a busy system.  (We will have checkpointer sleeping in 10ms
> intervals until the truncation is complete).

I don't think this is a problem. Consider that we're doing all this in
the checkpointer today, blocking much more than just the actual xlog
insertion. That's a bigger problem, as we'll not do the paced writing
during that and such. The worst thatthis can cause is a bunch of sleeps,
that seems fairly harmless.

Greetings,

Andres Freund



Re: Rework the way multixact truncations work

From
Thomas Munro
Date:
On Mon, Jun 22, 2015 at 7:24 AM, Andres Freund <andres@anarazel.de> wrote:
> It'd be very welcome to see some wider testing and review on this.

I started looking at this and doing some testing.  Here is some
initial feedback:

Perhaps vac_truncate_clog needs a new name now that it does more,
maybe vac_truncate_transaction_logs?

MultiXactState->sawTruncationCkptCyle: is 'Cyle' supposed to be 'Cycle'?

In the struct xl_multixact_truncate, and also the function
WriteMTruncateXlogRec and other places, I think you have confused the
typedefs MultiXactOffset and MultiXactId.  If I'm not mistaken,
startMemb and endMemb have the correct type, but startOff and endOff
should be of type MultiXactId despite their names (the *values* stored
inside pg_multixact/offsets are indeed offsets (into
pg_multixact/members), but their *location* is what a multixact ID
represents).

I was trying to understand if there could be any problem caused by
setting latest_page_number to the pageno that holds (or will hold)
xlrec.endOff in multixact_redo.  The only thing that jumps out at me
is that the next call to SlruSelectLRUPage will no longer be prevented
from evicting the *real* latest page (the most recently created page).

In SlruDeleteSegment, is it OK to call unlink() while holding the SLRU
control lock?

In find_multixact_start you call SimpleLruFlush before calling
SimpleLruDoesPhysicalPageExist.  Should we do something like this
instead?  https://gist.github.com/macdice/8e5b2f0fe3827fdf3d5a

I think saw some extra autovacuum activity that I didn't expect in a
simple scenario, but I'm not sure and will continue looking tomorrow.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Rework the way multixact truncations work

From
Andres Freund
Date:
On 2015-06-29 23:54:40 +1200, Thomas Munro wrote:
> On Mon, Jun 22, 2015 at 7:24 AM, Andres Freund <andres@anarazel.de> wrote:
> > It'd be very welcome to see some wider testing and review on this.
> 
> I started looking at this and doing some testing.  Here is some
> initial feedback:
> 
> Perhaps vac_truncate_clog needs a new name now that it does more,
> maybe vac_truncate_transaction_logs?

It has done more before, so I don't really see a connection to this
patch...

> MultiXactState->sawTruncationCkptCyle: is 'Cyle' supposed to be 'Cycle'?

Oops.

> In the struct xl_multixact_truncate, and also the function
> WriteMTruncateXlogRec and other places, I think you have confused the
> typedefs MultiXactOffset and MultiXactId.  If I'm not mistaken,
> startMemb and endMemb have the correct type, but startOff and endOff
> should be of type MultiXactId despite their names (the *values* stored
> inside pg_multixact/offsets are indeed offsets (into
> pg_multixact/members), but their *location* is what a multixact ID
> represents).

IIRC I did it that way to make clear this is just 'byte' type offsets,
without other meaning. Wasn't a good idea.

> I was trying to understand if there could be any problem caused by
> setting latest_page_number to the pageno that holds (or will hold)
> xlrec.endOff in multixact_redo.  The only thing that jumps out at me
> is that the next call to SlruSelectLRUPage will no longer be prevented
> from evicting the *real* latest page (the most recently created page).

That hasn't changed unless I miss something?

> In SlruDeleteSegment, is it OK to call unlink() while holding the SLRU
> control lock?

I think it's safer than not doing it, but don't particularly care.

> In find_multixact_start you call SimpleLruFlush before calling
> SimpleLruDoesPhysicalPageExist.  Should we do something like this
> instead?  https://gist.github.com/macdice/8e5b2f0fe3827fdf3d5a

I'm currently slightly inclined to do it "my way". They way these
functions are used it doesn't seem like a bad property to ensure things
are on disk.

> I think saw some extra autovacuum activity that I didn't expect in a
> simple scenario, but I'm not sure and will continue looking tomorrow.

Cool, thanks!

Greetings,

Andres Freund



Re: Rework the way multixact truncations work

From
Andres Freund
Date:
Hi,

I'm still working on this patch. I found a bunch of issues. Non of them
super critical and some pre-existing; but nonetheless I don't feel like
it's ready to push yet. So we'll have the first alpha without those
fixes :(

The biggest issues so far are:

* There's, both in the posted patch and as-is in all the branches, a lot
  of places that aren't actually safe against a concurrent truncation. A
  bunch of places grab oldestMultiXactId with an lwlock held, release
  it, and then make decisions based on that.

  A bunch of places (including the find_multixact_start callsites) is
  actually vulnerable to that, for a bunch of others it's less likely to
  be a problem.  All callers of GetMultiXactIdMembers() are vulnerable,
  but with the exception of pg_get_multixact_members() they'll never
  pass in a value that's older than the new oldest member value.

  That's a problem for the current branches. Afaics that can lead to a
  useless round of emergency autovacs via
  SetMultiXactIdLimit()->SetOffsetVacuumLimit().

  SetOffsetVacuumLimit() can protect easily agains that by taking the
  new MultiXactTruncationLock lock. We could do the same for
  pg_get_multixact_members() - afaics the only caller that'll look up too
  old values otherwise - but I don't think it matters, you'll get a
  slightly obscure error if you access a too old xact that's just being
  truncated away and that is that.

* There was no update of the in-memory oldest* values when replaying a
  truncation. That's problematic if a standby is promoted after
  replaying a truncation record, but before a following checkpoint
  record. This would be fixed by an emergency autovacuum, but that's
  obviously not nice. Trivial to fix.

* The in-memory oldest values were updated *after* the truncation
  happened. It's unlikely to matter in reality, but it's safer to
  update them before, so a concurrent GetMultiXactIdMembers() of stuff
  from before the truncation will get the proper error.

* PerformMembersTruncation() probably confused Alvaro because it wasn't
  actually correct - there's no need to have the segment containing old
  oldestOffset (in contrast to oldestOffsetAlive) survive. Except
  leaking a segment that's harmless, but obviously not desirable.

Additionally I'm changing some stuff, some requested by review:
* xl_multixact_truncate's members are now called
  (start|end)Trunc(Off|Memb)
* (start|end)TruncOff have the appropriate type now
* typo fixes
* comment improvements
* pgindent

New version attached.

Greetings,

Andres Freund

Attachment

Re: Rework the way multixact truncations work

From
Robert Haas
Date:
On Mon, Jun 29, 2015 at 3:48 PM, Andres Freund <andres@anarazel.de> wrote:
> New version attached.

0002 looks good, but the commit message should perhaps mention the
comment fix.  Or commit that separately.

Will look at 0003 next.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Rework the way multixact truncations work

From
Robert Haas
Date:
On Thu, Jul 2, 2015 at 11:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Will look at 0003 next.

+        appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",

I don't think we typically use this style for notating intervals.
        case XLOG_MULTIXACT_CREATE_ID:            id = "CREATE_ID";            break;
+        case XLOG_MULTIXACT_TRUNCATE_ID:
+            id = "TRUNCATE";
+            break;

If XLOG_MULTIXACT_CREATE_ID -> "CREATE_ID", then why not
XLOG_MULTIXACT_TRUNCATE_ID -> "TRUNCATE_ID"?

+     * too old to general truncation records.

s/general/generate/

+    MultiXactId oldestMXactDB;

Data type should be OID.

+     * Recompute limits as we are now fully started, we now can correctly
+     * compute how far a members wraparound is away.

s/,/:/ or something.  This isn't particularly good English as written.

+     * Computing the actual limits is only possible once the data directory is
+     * in a consistent state. There's no need to compute the limits while
+     * still replaying WAL as no new multis can be created anyway. So we'll
+     * only do further checks after TrimMultiXact() has been called.

Multis can be and are created during replay.  What this should really
say is that we have no choice about whether to create them or not: we
just have to replay whatever's there.

+                    (errmsg("performing legacy multixact truncation,
upgrade master")));

This message needs work.  I'm not sure exactly what it should say, but
I'm pretty sure that's not clear enough.

I seriously, seriously doubt that it is a good idea to perform the
legacy truncation from MultiXactAdvanceOldest() rather than
TruncateMultiXact().  The checkpoint hasn't really happened at that
point yet; you might truncate away stuff, then crash before the
checkpoint is complete, and then we you restart recovery, you've got
trouble.  I think you should be very, very cautious about rejiggering
the order of operations here.  The current situation is not good, but
casually rejiggering it can make things much worse.

-     * If no multixacts exist, then oldestMultiXactId will be the next
-     * multixact that will be created, rather than an existing multixact.
+     * Determine the offset of the oldest multixact.  Normally, we can read
+     * the offset from the multixact itself, but there's an important special
+     * case: if there are no multixacts in existence at all, oldestMXact
+     * obviously can't point to one.  It will instead point to the multixact
+     * ID that will be assigned the next time one is needed.

There's no need to change this; it means the same thing either way.

Generally, I think you've weakened the logic in SetOffsetVacuumLimit()
appreciably here.  The existing code is careful never to set
oldestOffsetKnown false when it was previously true.  Your rewrite
removes that property.  Generally, I see no need for this function to
be overhauled to the extent that you have, and would suggest reverting
the changes that aren't absolutely required.

I don't particularly like the fact that find_multixact_start() calls
SimpleLruFlush().  I think that's really a POLA violation: you don't
expect that a function that looks like a simple inquiry is going to go
do a bunch of unrelated I/O in the background.  If somebody called
find_multixact_start() with any frequency, you'd be sad.  You're just
doing it this way because of the context *in which you expect
find_multixact_start* to be called, which does not seem very
future-proof.  I prefer Thomas's approach.

If TruncateMultiXact() fails to acquire MultiXactTruncationLock right
away, does it need to wait, or could it ConditionalAcquire and bail
out if the lock isn't obtained?

+     * Make sure to only attempt truncation if there's values to truncate
+     * away. In normal processing values shouldn't go backwards, but there's
+     * some corner cases (due to bugs) where that's possible.

I think this comment should be more detailed.  Is that talking about
the same thing as this comment:

-     * Due to bugs in early releases of PostgreSQL 9.3.X and 9.4.X,
-     * oldestMXact might point to a multixact that does not exist.
-     * Autovacuum will eventually advance it to a value that does exist,
-     * and we want to set a proper offsetStopLimit when that happens,
-     * so call DetermineSafeOldestOffset here even if we're not actually
-     * truncating.

This comment seems to be saying there's a race condition:

+     * XXX: It's also possible that the page that oldestMXact is on has
+     * already been truncated away, and we crashed before updating
+     * oldestMXact.

But why is that an XXX?  I think this is just a case of recovery
needing tolerate redo of an action already redone.

I'm not convinced that it's a good idea to remove
lastCheckpointedOldest and replace it with nothing.  It seems like a
very good idea to have two separate pieces of state in shared memory:

- The oldest offset that we think anyone might need to access to make
a visibility check for a tuple.
- The oldest offset that we still have on disk.

The latter would now need to be called something else, not
lastCheckpointedOldest, but I think it's still good to have it.
Otherwise, I don't see how you protect against the on-disk state
wrapping around before you finish truncating, and then maybe
truncation eats something that was busy getting reused.  We might be
kind of hosed in that situation anyway, because TruncateMultiXact()
and some other places assume that circular comparisons will return
sensible values.  But that could be fixed, and probably should be
fixed eventually.

+                (errmsg("supposedly still alive MultiXact %u not
found, skipping truncation",

Maybe "cannot truncate MultiXact %u because it does not exist on disk,
skipping truncation"?

I think "frozenMulti" is a slightly confusing variable name and
deserves a comment.  AUIU, that's the oldest multiXact we need to
keep.  So it's actually the oldest multi that is NOT guaranteed to be
frozen.  minMulti might be a better variable name, but a comment is
justified either way.

+     * Update in-memory limits before performing the truncation, while inside
+     * the critical section: Have to do it before truncation, to prevent
+     * concurrent lookups of those values. Has to be inside the critical
+     * section asotherwise a future call to this function would error out,
+     * while looking up the oldest member in offsets, if our caller crashes
+     * before updating the limits.

Missing space: asotherwise.

Who else might be concurrently looking up those values?  Nobody else
can truncate while we're truncating, because we hold
MultiXactTruncationLock.  And nobody else should be getting here from
looking up tuples, because if they are, we truncated too soon.

-     * Startup MultiXact.  We need to do this early for two reasons: one is
-     * that we might try to access multixacts when we do tuple freezing, and
-     * the other is we need its state initialized because we attempt
-     * truncation during restartpoints.
+     * Startup MultiXact, we need to do this early, to be able to replay
+     * truncations.

The period after "Startup MultiXact" was more correct than the comma
you've replaced it with.

Phew.  That's all I see on a first read-through, but I've only spent a
couple of hours on this, so I might easily have missed some things.
But let me stop here, hit send, and see what you think of these
comments.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Rework the way multixact truncations work

From
Andres Freund
Date:
On 2015-07-02 13:58:45 -0400, Robert Haas wrote:
> On Thu, Jul 2, 2015 at 11:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > Will look at 0003 next.
> 
> +        appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
> 
> I don't think we typically use this style for notating intervals.

I don't think we really have a very consistent style for xlog messages -
this seems to describe the meaning accurately?

> [several good points]

> +                    (errmsg("performing legacy multixact truncation,
> upgrade master")));
> 
> This message needs work.  I'm not sure exactly what it should say, but
> I'm pretty sure that's not clear enough.
> 
> I seriously, seriously doubt that it is a good idea to perform the
> legacy truncation from MultiXactAdvanceOldest() rather than
> TruncateMultiXact().

But where should TruncateMultiXact() be called from? I mean, we could
move the logic from inside MultiXactAdvanceOldest() to some special case
in the replay routine, but what'd be the advantage?

> The checkpoint hasn't really happened at that point yet; you might
> truncate away stuff, then crash before the checkpoint is complete, and
> then we you restart recovery, you've got trouble.

We're only talking about restartpoints here, right? And I don't see the
problem - we don't read the slru anymore until the end of recovery, and
the end of recovery can't happen before reaching the minimum revovery
location?

> I think you should
> be very, very cautious about rejiggering the order of operations here.
> The current situation is not good, but casually rejiggering it can
> make things much worse.

The current placement - as part of the restartpoint - is utterly broken
and unpredictable. There'll frequently be no restartpoints performed at
all (due to different checkpoint segments, slow writeout, or pending
actions).  Because there's no careful timing of when this happens it's
much harder to understand the exact state in which the truncation
happens - I think moving it to a specific location during replay makes
things considerably easier.

> Generally, I think you've weakened the logic in SetOffsetVacuumLimit()
> appreciably here.  The existing code is careful never to set
> oldestOffsetKnown false when it was previously true.  Your rewrite
> removes that property.  Generally, I see no need for this function to
> be overhauled to the extent that you have, and would suggest reverting
> the changes that aren't absolutely required.

A lot of that has to do that the whole stuff about truncations happening
during checkpoints is gone and that thus the split with
DetermineSafeOldestOffset() doesn't make sense anymore.

That oldestOffsetKnown is unset is wrong - the if (prevOldestOffsetKnown
&& !oldestOffsetKnown) block should be a bit earlier.

> I don't particularly like the fact that find_multixact_start() calls
> SimpleLruFlush().  I think that's really a POLA violation: you don't
> expect that a function that looks like a simple inquiry is going to go
> do a bunch of unrelated I/O in the background.  If somebody called
> find_multixact_start() with any frequency, you'd be sad.  You're just
> doing it this way because of the context *in which you expect
> find_multixact_start* to be called, which does not seem very
> future-proof.  I prefer Thomas's approach.

I don't strongly care, but I do think it has some value to be sure about
the on-disk state for the current callers.  I think it'd be a pretty odd
thing to call find_multixact_start() frequently.

> If TruncateMultiXact() fails to acquire MultiXactTruncationLock right
> away, does it need to wait, or could it ConditionalAcquire and bail
> out if the lock isn't obtained?

That seems like premature optimization to me. And one that's not that
easy to do correctly - what if the current caller actually has a new,
lower, minimum mxid?

> +     * XXX: It's also possible that the page that oldestMXact is on has
> +     * already been truncated away, and we crashed before updating
> +     * oldestMXact.
> 
> But why is that an XXX?  I think this is just a case of recovery
> needing tolerate redo of an action already redone.

Should rather have been NB.

> I'm not convinced that it's a good idea to remove
> lastCheckpointedOldest and replace it with nothing.  It seems like a
> very good idea to have two separate pieces of state in shared memory:

> - The oldest offset that we think anyone might need to access to make
> a visibility check for a tuple.
> - The oldest offset that we still have on disk.
> 
> The latter would now need to be called something else, not
> lastCheckpointedOldest, but I think it's still good to have it.

> Otherwise, I don't see how you protect against the on-disk state
> wrapping around before you finish truncating, and then maybe
> truncation eats something that was busy getting reused.

Unless I miss something the stop limits will prevent that from
happening? SetMultiXactIdLimit() is called only *after* the truncation
has finished?

> +     * Update in-memory limits before performing the truncation, while inside
> +     * the critical section: Have to do it before truncation, to prevent
> +     * concurrent lookups of those values. Has to be inside the critical
> +     * section asotherwise a future call to this function would error out,
> +     * while looking up the oldest member in offsets, if our caller crashes
> +     * before updating the limits.
> 
> Missing space: asotherwise.
> 
> Who else might be concurrently looking up those values?  Nobody else
> can truncate while we're truncating, because we hold
> MultiXactTruncationLock.  And nobody else should be getting here from
> looking up tuples, because if they are, we truncated too soon.

pg_get_multixact_members(), a concurrent call to SetMultiXactIdLimit()
(SetOffsetLimit()->find_multixact_start()) from vac_truncate_clog().

> Phew.  That's all I see on a first read-through, but I've only spent a
> couple of hours on this, so I might easily have missed some things.
> But let me stop here, hit send, and see what you think of these
> comments.

Thanks for the look so far!

Greetings,

Andres Freund



Re: Rework the way multixact truncations work

From
Robert Haas
Date:
On Thu, Jul 2, 2015 at 2:28 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-07-02 13:58:45 -0400, Robert Haas wrote:
>> On Thu, Jul 2, 2015 at 11:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> > Will look at 0003 next.
>>
>> +        appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
>>
>> I don't think we typically use this style for notating intervals.
>
> I don't think we really have a very consistent style for xlog messages -
> this seems to describe the meaning accurately?

Although I realize this is supposed to be interval notation, I'm not
sure everyone will immediately figure that out.  I believe it has
created some confusion in the past.  I'm not going to spend a lot of
time arguing with you about it, but I'd do something else, like
offsets from %u stop before %u, members %u stop before %u.

>> [several good points]
>
>> +                    (errmsg("performing legacy multixact truncation,
>> upgrade master")));
>>
>> This message needs work.  I'm not sure exactly what it should say, but
>> I'm pretty sure that's not clear enough.
>>
>> I seriously, seriously doubt that it is a good idea to perform the
>> legacy truncation from MultiXactAdvanceOldest() rather than
>> TruncateMultiXact().
>
> But where should TruncateMultiXact() be called from? I mean, we could
> move the logic from inside MultiXactAdvanceOldest() to some special case
> in the replay routine, but what'd be the advantage?

I think you should call it from where TruncateMultiXact() is being
called from today.  Doing legacy truncations from a different place
than we're currently doing them just gives us more ways to be wrong.

>> The checkpoint hasn't really happened at that point yet; you might
>> truncate away stuff, then crash before the checkpoint is complete, and
>> then we you restart recovery, you've got trouble.
>
> We're only talking about restartpoints here, right? And I don't see the
> problem - we don't read the slru anymore until the end of recovery, and
> the end of recovery can't happen before reaching the minimum revovery
> location?

You're still going to have to read the SLRU for as long as you are
doing legacy truncations, at least.

>> If TruncateMultiXact() fails to acquire MultiXactTruncationLock right
>> away, does it need to wait, or could it ConditionalAcquire and bail
>> out if the lock isn't obtained?
>
> That seems like premature optimization to me. And one that's not that
> easy to do correctly - what if the current caller actually has a new,
> lower, minimum mxid?

Doesn't the next truncation just catch up?  But sure, I agree this is
inessential (and maybe better left alone for now).

>> I'm not convinced that it's a good idea to remove
>> lastCheckpointedOldest and replace it with nothing.  It seems like a
>> very good idea to have two separate pieces of state in shared memory:
>
>> - The oldest offset that we think anyone might need to access to make
>> a visibility check for a tuple.
>> - The oldest offset that we still have on disk.
>>
>> The latter would now need to be called something else, not
>> lastCheckpointedOldest, but I think it's still good to have it.
>
>> Otherwise, I don't see how you protect against the on-disk state
>> wrapping around before you finish truncating, and then maybe
>> truncation eats something that was busy getting reused.
>
> Unless I miss something the stop limits will prevent that from
> happening? SetMultiXactIdLimit() is called only *after* the truncation
> has finished?

Hmm, that might be, I'd have to reread the patch.  The reason we
originally had it this way was because VACUUM was updating the limit
and then checkpoint was truncating, but now I guess vacuum + truncate
happen so close together that you might only need one value.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Rework the way multixact truncations work

From
Andres Freund
Date:
(quick answer, off now)

On 2015-07-05 14:20:11 -0400, Robert Haas wrote:
> On Thu, Jul 2, 2015 at 2:28 PM, Andres Freund <andres@anarazel.de> wrote:
> > On 2015-07-02 13:58:45 -0400, Robert Haas wrote:
> >> I seriously, seriously doubt that it is a good idea to perform the
> >> legacy truncation from MultiXactAdvanceOldest() rather than
> >> TruncateMultiXact().
> >
> > But where should TruncateMultiXact() be called from? I mean, we could
> > move the logic from inside MultiXactAdvanceOldest() to some special case
> > in the replay routine, but what'd be the advantage?
> 
> I think you should call it from where TruncateMultiXact() is being
> called from today.  Doing legacy truncations from a different place
> than we're currently doing them just gives us more ways to be wrong.

The problem with that is that the current location is just plain
wrong. Restartpoints can be skipped (due different checkpoint segments
settings), may not happen at all (pending incomplete actions), and can
just be slowed down.

That's a currently existing bug that's easy to reproduce.

> >> The checkpoint hasn't really happened at that point yet; you might
> >> truncate away stuff, then crash before the checkpoint is complete, and
> >> then we you restart recovery, you've got trouble.
> >
> > We're only talking about restartpoints here, right? And I don't see the
> > problem - we don't read the slru anymore until the end of recovery, and
> > the end of recovery can't happen before reaching the minimum revovery
> > location?
> 
> You're still going to have to read the SLRU for as long as you are
> doing legacy truncations, at least.

I'm not following. Sure, we read the SLRUs as we do today. But, in
contrast to the current positioning in recovery, with the patch they're
done at pretty much the same point on the standby as on the primary
today?

Greetings,

Andres Freund



Re: Rework the way multixact truncations work

From
Robert Haas
Date:
On Sun, Jul 5, 2015 at 2:28 PM, Andres Freund <andres@anarazel.de> wrote:
> (quick answer, off now)
>
> On 2015-07-05 14:20:11 -0400, Robert Haas wrote:
>> On Thu, Jul 2, 2015 at 2:28 PM, Andres Freund <andres@anarazel.de> wrote:
>> > On 2015-07-02 13:58:45 -0400, Robert Haas wrote:
>> >> I seriously, seriously doubt that it is a good idea to perform the
>> >> legacy truncation from MultiXactAdvanceOldest() rather than
>> >> TruncateMultiXact().
>> >
>> > But where should TruncateMultiXact() be called from? I mean, we could
>> > move the logic from inside MultiXactAdvanceOldest() to some special case
>> > in the replay routine, but what'd be the advantage?
>>
>> I think you should call it from where TruncateMultiXact() is being
>> called from today.  Doing legacy truncations from a different place
>> than we're currently doing them just gives us more ways to be wrong.
>
> The problem with that is that the current location is just plain
> wrong. Restartpoints can be skipped (due different checkpoint segments
> settings), may not happen at all (pending incomplete actions), and can
> just be slowed down.
>
> That's a currently existing bug that's easy to reproduce.

You might be right; I haven't tested that.

On the other hand, in the common case, by the time we perform a
restartpoint, we're consistent: I think the main exception to that is
if we do a base backup that spans multiple checkpoints.  I think that
in the new location, the chances that the legacy truncation is trying
to read inconsistent data is probably higher.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Rework the way multixact truncations work

From
Andres Freund
Date:
On July 5, 2015 8:50:57 PM GMT+02:00, Robert Haas <robertmhaas@gmail.com> wrote:
>On Sun, Jul 5, 2015 at 2:28 PM, Andres Freund <andres@anarazel.de>
>wrote:
>> (quick answer, off now)
>>
>> On 2015-07-05 14:20:11 -0400, Robert Haas wrote:
>>> On Thu, Jul 2, 2015 at 2:28 PM, Andres Freund <andres@anarazel.de>
>wrote:
>>> > On 2015-07-02 13:58:45 -0400, Robert Haas wrote:
>>> >> I seriously, seriously doubt that it is a good idea to perform
>the
>>> >> legacy truncation from MultiXactAdvanceOldest() rather than
>>> >> TruncateMultiXact().
>>> >
>>> > But where should TruncateMultiXact() be called from? I mean, we
>could
>>> > move the logic from inside MultiXactAdvanceOldest() to some
>special case
>>> > in the replay routine, but what'd be the advantage?
>>>
>>> I think you should call it from where TruncateMultiXact() is being
>>> called from today.  Doing legacy truncations from a different place
>>> than we're currently doing them just gives us more ways to be wrong.
>>
>> The problem with that is that the current location is just plain
>> wrong. Restartpoints can be skipped (due different checkpoint
>segments
>> settings), may not happen at all (pending incomplete actions), and
>can
>> just be slowed down.
>>
>> That's a currently existing bug that's easy to reproduce.
>
>You might be right; I haven't tested that.
>
>On the other hand, in the common case, by the time we perform a
>restartpoint, we're consistent: I think the main exception to that is
>if we do a base backup that spans multiple checkpoints.  I think that
>in the new location, the chances that the legacy truncation is trying
>to read inconsistent data is probably higher.

The primary problem isn't that we truncate too early, it's that we delay truncation on the standby in comparison to the
primaryby a considerable amount. All the while continuing to replay multi creations. 
 

I don't see the difference wrt. consistency right now, but I don't have access to the code right now. I mean we *have*
todo something while inconsistent. A start/stop backup can easily span a day or four. 
 

Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.



Re: Rework the way multixact truncations work

From
Robert Haas
Date:
On Sun, Jul 5, 2015 at 3:16 PM, Andres Freund <andres@anarazel.de> wrote:
>>On the other hand, in the common case, by the time we perform a
>>restartpoint, we're consistent: I think the main exception to that is
>>if we do a base backup that spans multiple checkpoints.  I think that
>>in the new location, the chances that the legacy truncation is trying
>>to read inconsistent data is probably higher.
>
> The primary problem isn't that we truncate too early, it's that we delay truncation on the standby in comparison to
theprimary by a considerable amount. All the while continuing to replay multi creations.
 
>
> I don't see the difference wrt. consistency right now, but I don't have access to the code right now. I mean we
*have*to do something while inconsistent. A start/stop backup can easily span a day or four.
 

So, where are we with this patch?

In my opinion, we ought to do something about master and 9.5 before
beta, so that we're doing *yet another* major release with unfixed
multixact bugs.  Let's make the relevant truncation changes in master
and 9.5 and bump the WAL page magic, so that a 9.5alpha standby can't
be used with a 9.5beta master.  Then, we don't need any of this legacy
truncation stuff at all, and 9.5 is hopefully in a much better state
than 9.4 and 9.3.

Now, that still potentially leaves 9.4 and 9.3 users hanging out to
dry.  But we don't have a tremendous number of those people clamoring
about this, and if we get 9.5+ correct, then we can go and change the
logic in 9.4 and 9.3 later when, and if, we are confident that's the
right thing to do.  I am still not altogether convinced that it's a
good idea, nor am I altogether convinced that this code is right.
Perhaps it is, and if we consensus on it, fine.  But regardless of
that, we should not send a third major release to beta with the
current broken system unless there is really no viable alternative.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Rework the way multixact truncations work

From
Andres Freund
Date:
On 2015-09-21 10:31:17 -0400, Robert Haas wrote:
> On Sun, Jul 5, 2015 at 3:16 PM, Andres Freund <andres@anarazel.de> wrote:
> >>On the other hand, in the common case, by the time we perform a
> >>restartpoint, we're consistent: I think the main exception to that is
> >>if we do a base backup that spans multiple checkpoints.  I think that
> >>in the new location, the chances that the legacy truncation is trying
> >>to read inconsistent data is probably higher.
> >
> > The primary problem isn't that we truncate too early, it's that we delay truncation on the standby in comparison to
theprimary by a considerable amount. All the while continuing to replay multi creations.
 
> >
> > I don't see the difference wrt. consistency right now, but I don't have access to the code right now. I mean we
*have*to do something while inconsistent. A start/stop backup can easily span a day or four.
 
> 
> So, where are we with this patch?

Uh. I'd basically been waiting on further review and then forgot about
it.


> In my opinion, we ought to do something about master and 9.5 before
> beta, so that we're doing *yet another* major release with unfixed
> multixact bugs.  Let's make the relevant truncation changes in master
> and 9.5 and bump the WAL page magic, so that a 9.5alpha standby can't
> be used with a 9.5beta master.  Then, we don't need any of this legacy
> truncation stuff at all, and 9.5 is hopefully in a much better state
> than 9.4 and 9.3.

Hm.

> Now, that still potentially leaves 9.4 and 9.3 users hanging out to
> dry.  But we don't have a tremendous number of those people clamoring
> about this, and if we get 9.5+ correct, then we can go and change the
> logic in 9.4 and 9.3 later when, and if, we are confident that's the
> right thing to do.  I am still not altogether convinced that it's a
> good idea, nor am I altogether convinced that this code is right.
> Perhaps it is, and if we consensus on it, fine.

To me the current logic is much worse than what's in the patch, so I
don't think that's the best way to go. But I'm not not absolutely gung
ho on that.

> But regardless of that, we should not send a third major release to
> beta with the current broken system unless there is really no viable
> alternative.

Agreed. I'll update the patch.

Greetings,

Andres Freund



Re: Rework the way multixact truncations work

From
Josh Berkus
Date:
On 09/21/2015 07:36 AM, Andres Freund wrote:
> On 2015-09-21 10:31:17 -0400, Robert Haas wrote:
>> So, where are we with this patch?
> 
> Uh. I'd basically been waiting on further review and then forgot about
> it.

Does the current plan to never expire XIDs in 9.6 affect multixact
truncation at all?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Rework the way multixact truncations work

From
Andres Freund
Date:
On 2015-09-21 10:30:59 -0700, Josh Berkus wrote:
> On 09/21/2015 07:36 AM, Andres Freund wrote:
> > On 2015-09-21 10:31:17 -0400, Robert Haas wrote:
> >> So, where are we with this patch?
> > 
> > Uh. I'd basically been waiting on further review and then forgot about
> > it.
> 
> Does the current plan to never expire XIDs in 9.6 affect multixact
> truncation at all?

I doubt that it'd in a meaningful manner. Truncations will still need to
happen to contain space usage.

Besides, I'm pretty sceptical of shaping the design of bug fixes to suit
some unwritten feature we only know the highest level design of as of
yet.

Greetings,

Andres Freund



Re: Rework the way multixact truncations work

From
Andres Freund
Date:
On 2015-07-02 11:52:04 -0400, Robert Haas wrote:
> On Mon, Jun 29, 2015 at 3:48 PM, Andres Freund <andres@anarazel.de> wrote:
> > New version attached.
> 
> 0002 looks good, but the commit message should perhaps mention the
> comment fix.  Or commit that separately.

I'm inclined to backpatch the applicable parts to 9.0 - seems pointless
to have differing autovacuum_freeze_max_age values and the current value
sucks for testing and space consumption there as well.

Greetings,

Andres Freund



Re: Rework the way multixact truncations work

From
Andres Freund
Date:
On 2015-09-21 16:36:03 +0200, Andres Freund wrote:
> Agreed. I'll update the patch.

Here's updated patches against master. These include the "legacy"
truncation support. There's no meaningful functional differences in this
version except addressing the review comments that I agreed with, and a
fair amount of additional polishing.

I've not:
* removed legacy truncation support
* removed SimpleLruFlush() from find_multixact_start() - imo it's easier
  to reason about the system when the disk state is in sync with the in
  memory state.
* removed the interval syntax from debug messages and xlogdump - they're
  a fair bit more concise and the target audience of those will be able
  to figure it out.
* unsplit DetermineSafeOldestOffset & SetOffsetVacuumLimit - imo the
  separate functions don't make sense anymore now that limits and
  truncations aren't as separate anymore.

What I've tested is the following:
* continous burning of multis, both triggered via members and offsets
* a standby keeping up when the primary is old
* a standby keeping up when the primary is new
* basebackups made while a new primary is under load
* verified that we properly PANIC when a truncation record is replayed
  in an old standby.

Does anybody have additional tests in mind?

I plan to push 0002 fairly soon, it seemed to be pretty
uncontroversial. I'll then work tomorrow afternoon on producing branch
specific versions of 0003 and on producing 0004 removing all the legacy
stuff for 9.5 & master.

Greetings,

Andres Freund

Attachment

Re: Rework the way multixact truncations work

From
Robert Haas
Date:
On Tue, Sep 22, 2015 at 9:20 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-09-21 16:36:03 +0200, Andres Freund wrote:
>> Agreed. I'll update the patch.
>
> Here's updated patches against master. These include the "legacy"
> truncation support. There's no meaningful functional differences in this
> version except addressing the review comments that I agreed with, and a
> fair amount of additional polishing.

0002 looks fine.

Regarding 0003, I'm still very much not convinced that it's a good
idea to apply this to 9.3 and 9.4.  This patch changes the way we do
truncation in those older releases; instead of happening at a
restartpoint, it happens when oldestMultiXid advances.  I admit that I
don't see a specific way that that can go wrong, but there are so many
different old versions with slightly different multixact truncation
behaviors that it seems very hard to be sure that we're not going to
make things worse rather than better by introducing yet another
approach to the problem.  I realize that you disagree and will
probably commit this to those branches anyway. But I want it to be
clear that I don't endorse that.

I wish more people were paying attention to these patches.  These are
critical data-corrupting bugs, the code in question is very tricky,
it's been majorly revised multiple times, and we're revising it again.
And nobody except me and Andres is looking at this, and I'm definitely
not smart enough to get this all right.

Other issues:
- If SlruDeleteSegment fails in unlink(), shouldn't we at the very
least log a message?  If that file is still there when we loop back
around, it's going to cause a failure, I think.

Assorted minor nitpicking:
- "happend" is misspelled in the commit message for 0003
- "in contrast to before" should have a comma after it, also in that
commit message
- "how far the next members wraparound is away" -> "how far away the
next members wraparound is"
- "seing" -> "seeing"
- "Upgrade the primary," -> "Upgrade the primary;"
- "toMultiXact" -> "to MultiXact"

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Rework the way multixact truncations work

From
Andres Freund
Date:
On 2015-09-22 13:38:58 -0400, Robert Haas wrote:
> Regarding 0003, I'm still very much not convinced that it's a good
> idea to apply this to 9.3 and 9.4.  This patch changes the way we do
> truncation in those older releases; instead of happening at a
> restartpoint, it happens when oldestMultiXid advances.

The primary reason for doing that is that doing it at restartpoints is
simply *wrong*. Restartpoints aren't scheduled in sync with replay -
which means that a restartpoint can (will actually) happen long long
after the checkpoint from the primary has replayed.  Which means that by
the time the restartpoint is performed it's actually not unlikely that
we've already filled all slru segments. Which is bad if we then fail
over/start up.

Aside from the more fundamental issue that restartpoints have to be
"asynchronous" with respect to the checkpoint record for performance
reasons, there's a bunch of additional reasons making this even more
likely to occur: Differing checkpoint segments on the standby and
pending actions (which we got rid off in 9.5+, but ...)

> I realize that you disagree and will probably commit this to those
> branches anyway. But I want it to be clear that I don't endorse that.

I don't plan to commit/backpatch this over your objection.

I do think it'd be the better approach, and I personally think that
we're much more likely to introduce bugs if we backpatch this in a
year. Which I think we'll end up having to. The longer people run on
these branches, the more issues we'll see.

> I wish more people were paying attention to these patches.

+many

> Other issues:
> - If SlruDeleteSegment fails in unlink(), shouldn't we at the very
> least log a message?  If that file is still there when we loop back
> around, it's going to cause a failure, I think.

The existing unlink() call doesn't, that's the only reason I didn't add
a message there. I'm fine with adding a (LOG or WARNING?) message.

Greetings,

Andres Freund



Re: Rework the way multixact truncations work

From
Alvaro Herrera
Date:
Robert Haas wrote:

> Regarding 0003, I'm still very much not convinced that it's a good
> idea to apply this to 9.3 and 9.4.  This patch changes the way we do
> truncation in those older releases; instead of happening at a
> restartpoint, it happens when oldestMultiXid advances.  I admit that I
> don't see a specific way that that can go wrong, but there are so many
> different old versions with slightly different multixact truncation
> behaviors that it seems very hard to be sure that we're not going to
> make things worse rather than better by introducing yet another
> approach to the problem.  I realize that you disagree and will
> probably commit this to those branches anyway. But I want it to be
> clear that I don't endorse that.

Noted.  I am not sure about changing things so invasively either TBH.
The interactions of this stuff with other parts of the system are very
complicated and it's easy to make a mistake that goes unnoticed until
some weird scenario is run elsewhere.  (Who would have thought that
things would fail when a basebackup takes 12 hours to take and you have
a custom preemptive tuple freeze script in crontab).

> I wish more people were paying attention to these patches.  These are
> critical data-corrupting bugs, the code in question is very tricky,
> it's been majorly revised multiple times, and we're revising it again.
> And nobody except me and Andres is looking at this, and I'm definitely
> not smart enough to get this all right.

I'm also looking, and yes it's tricky.

> Other issues:

It would be good to pgindent the code before producing back-branch
patches.  I think some comments will get changed.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Rework the way multixact truncations work

From
Robert Haas
Date:
On Tue, Sep 22, 2015 at 1:57 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-09-22 13:38:58 -0400, Robert Haas wrote:
>> Regarding 0003, I'm still very much not convinced that it's a good
>> idea to apply this to 9.3 and 9.4.  This patch changes the way we do
>> truncation in those older releases; instead of happening at a
>> restartpoint, it happens when oldestMultiXid advances.
>
> The primary reason for doing that is that doing it at restartpoints is
> simply *wrong*. Restartpoints aren't scheduled in sync with replay -
> which means that a restartpoint can (will actually) happen long long
> after the checkpoint from the primary has replayed.  Which means that by
> the time the restartpoint is performed it's actually not unlikely that
> we've already filled all slru segments. Which is bad if we then fail
> over/start up.

1. It would be possible to write a patch that included ONLY the
changes needed to make that happen, and did nothing else.  It would be
largely a subset of this.  If we want to change 9.3 and 9.4, I
recommend we do that first, and then come back to the rest of this.

2. I agree that what we're doing right now is wrong.  And I agree that
this fixes a real problem. But it seems to me to be quite possible,
even likely, that it will create other problems.

For example, suppose that there are files in the data directory that
precede oldestMultiXact. In the current approach, we'll remove those
because they're not in the range we expect to be used.  But in this
approach we no longer remove everything we think shouldn't be there.
We remove exactly the stuff we think should go away.  As a general
principle, that's clearly superior.  But in the back-branches, it
creates a risk: a leftover old file that doesn't get removed the first
time through - for whatever reason - becomes a time bomb that will
explode on the next wraparound.  I don't know that that will happen.
But I sure as heck don't know that won't happen with any combination
of the variously broken 9.3.X releases we've put out there.  Even if
you can prove that particular risk never materializes to your
satisfaction and mine, I will bet you a beer that there are other
possible hazards neither of us is foreseeing right now.

>> I realize that you disagree and will probably commit this to those
>> branches anyway. But I want it to be clear that I don't endorse that.
>
> I don't plan to commit/backpatch this over your objection.

I'm not in a position to demand that you take my advice, but I'm
telling you what I think as honestly as I know how.

To be clear, I am fully in favor of making these changes (without the
legacy truncation stuff) in 9.5 and master, bumping WAL page magic so
that we invalidate any 9.5 alpha standys.  I think it's a far more
solid approach than what we've got right now, and it clearly
eliminates a host of dangers.  In fact, I think it would be a pretty
stupid idea not to make these changes in those branches.  It would be
doubling down on a design we know can never be made robust.

But I do not have confidence that we can change 9.4 and especially 9.3
without knock-on consequences.  You may have that confidence.  I most
definitely do not.  My previous two rounds in the boxing ring with
this problem convinced me that (1) it's incredibly easy to break
things with well-intentioned changes in this area, (2) it's
practically impossible to foresee everything that might go wrong with
some screwy combination of versions, and (3) early 9.3.X releases are
in much worse shape than early 9.4.X releases, to the point where
guessing what any given variable is going to contain on 9.3.X is
essentially throwing darts at the wall.  That's an awfully challenging
environment in which to write a bullet-proof patch.

>> - If SlruDeleteSegment fails in unlink(), shouldn't we at the very
>> least log a message?  If that file is still there when we loop back
>> around, it's going to cause a failure, I think.
>
> The existing unlink() call doesn't, that's the only reason I didn't add
> a message there. I'm fine with adding a (LOG or WARNING?) message.

Cool.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Rework the way multixact truncations work

From
Andres Freund
Date:
On 2015-09-22 14:52:49 -0400, Robert Haas wrote:
> 1. It would be possible to write a patch that included ONLY the
> changes needed to make that happen, and did nothing else.  It would be
> largely a subset of this.  If we want to change 9.3 and 9.4, I
> recommend we do that first, and then come back to the rest of this.

I think that patch would be pretty much what I wrote.

To be correct you basically have to:
1) Never skip a truncation on the standby. Otherwise there might have  already have been wraparound and you read the
completelywrong  offset.
 
2) Always perform truncations on the standby exactly the same moment (in  the replay stream) as on the primary.
Otherwisethere also can be a  wraparound.
 
3) Never read anything from an SLRU from the data directory while  inconsistent. In an inconsistent state we can read
completelywrong  data. A standby can be inconsistent in many situations, including  crashes, restarts and fresh base
backups.

To me these three together leave only the option to never read an SLRUs
contents on a standby.  That only leaves minor changes in the patch that
could be removed afaics. I mean we could leave in
DetermineSafeOldestOffset() but it'd be doing pretty much the same as
SetOffsetVacuumLimit().


I think we put at least three layers on bandaid on this issue since
9.3.2, and each layer made things more complicated. We primarily did so
because of the compatibility and complexity concerns. I think that was a
bad mistake. We should have done it mostly right back then, and we'd be
better of now. If we continue with bandaids on the back branches while
having a fixed 9.5+ with significantly different behaviour we'll have a
hellish time fixing things in the back branches. And introduce more bugs
than this might introduce.


> 2. I agree that what we're doing right now is wrong.  And I agree that
> this fixes a real problem. But it seems to me to be quite possible,
> even likely, that it will create other problems.

Possible. But I think those bugs will be just bugs and not more
fundamental architectural problems.

To be very clear. I'm scared of the idea of backpatching this. I'm more
scared of doing that myself. But even more I am scared of the current
state.


> For example, suppose that there are files in the data directory that
> precede oldestMultiXact. In the current approach, we'll remove those
> because they're not in the range we expect to be used.

Hm. For offsets/ we continue to use SimpleLruTruncate() for truncation,
which scans the directory, so I don't see a problem. For members/ we
won't - but neither do we really today, see
SlruScanDirCbRemoveMembers(). So I don't think there'll be a significant
difference?


> a leftover old file that doesn't get removed the first time through -
> for whatever reason - becomes a time bomb that will explode on the
> next wraparound.  I don't know that that will happen.

We should be able to deal with that, otherwise recovery is pretty
borked. It can be a problem for the 'recovery from wrong oldest multi'
case, but that's the same today.


> I will bet you a beer that there are other possible hazards neither of
> us is foreseeing right now.

Right. I'm not dismissing that. I just think it's much more likely to be
handleable problems than the set we have today. It's incredibly hard to
get an accurate mental model of the combined behaviour & state of
primary and standby today. Even if we three have that today, I'm pretty
sure we won't in half a year. And sure as hell nearly nobody else will
have one.


> >> - If SlruDeleteSegment fails in unlink(), shouldn't we at the very
> >> least log a message?  If that file is still there when we loop back
> >> around, it's going to cause a failure, I think.
> >
> > The existing unlink() call doesn't, that's the only reason I didn't add
> > a message there. I'm fine with adding a (LOG or WARNING?) message.
> 
> Cool.

Hm. When redoing a truncation during [crash] recovery that can cause a
host of spurious warnings if already done before. DEBUG1 to avoid
scaring users?

Greetings,

Andres Freund



Re: Rework the way multixact truncations work

From
Andres Freund
Date:
On 2015-09-23 01:24:31 +0200, Andres Freund wrote:
> I think we put at least three layers on bandaid on this issue since
> 9.3.2, and each layer made things more complicated.

2a9b01928f193f529b885ac577051c4fd00bd427 - Cope with possible failure of the oldest MultiXact to exist.
5bbac7ec1b5754043e073a45454e4c257512ce30 - Advance the stop point for multixact offset creation only at checkpoint.
9a28c3752c89ec01fb8b28bb5904c6d547507fda - Have multixact be truncated by checkpoint, not vacuum
215ac4ad6589e0f6a31cc4cd867aedba3cd42924 - Truncate pg_multixact/'s contents during crash recovery

At least these are closely related to the fact that truncation isn't WAL
logged. There are more that are tangentially related. We (primarily me,
writing the timewise first one) should have gone for a new WAL record
from the start. We've discussed that in at least three of the threads
around the above commits...



Re: Rework the way multixact truncations work

From
Robert Haas
Date:
On Tue, Sep 22, 2015 at 7:45 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-09-23 01:24:31 +0200, Andres Freund wrote:
>> I think we put at least three layers on bandaid on this issue since
>> 9.3.2, and each layer made things more complicated.
>
> 2a9b01928f193f529b885ac577051c4fd00bd427 - Cope with possible failure of the oldest MultiXact to exist.
> 5bbac7ec1b5754043e073a45454e4c257512ce30 - Advance the stop point for multixact offset creation only at checkpoint.
> 9a28c3752c89ec01fb8b28bb5904c6d547507fda - Have multixact be truncated by checkpoint, not vacuum
> 215ac4ad6589e0f6a31cc4cd867aedba3cd42924 - Truncate pg_multixact/'s contents during crash recovery
>
> At least these are closely related to the fact that truncation isn't WAL
> logged. There are more that are tangentially related. We (primarily me,
> writing the timewise first one) should have gone for a new WAL record
> from the start. We've discussed that in at least three of the threads
> around the above commits...

I'm not disagreeing with any of that.  I'm just disagreeing with you
on the likelihood that we're going to make things better vs. making
them worse.  But, really, I've said everything I have to say about
this.  You have a commit bit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Rework the way multixact truncations work

From
Andres Freund
Date:
On 2015-09-22 20:14:11 -0400, Robert Haas wrote:
> I'm not disagreeing with any of that.  I'm just disagreeing with you
> on the likelihood that we're going to make things better vs. making
> them worse.  But, really, I've said everything I have to say about
> this.  You have a commit bit.

I'm not going to push backpatch this to 9.3/9.4 without you being on
board. For that I think you're unfortunately too often right, and this
is too critical. But I'm also not going to develop an alternative
stopgap for those versions, since I have no clue how that'd end up being
better.

The only alternative proposal I have right now is to push this to
9.5/9.6 (squashed with a followup patch removing legacy truncations) and
then push the patch including legacy stuff to 9.3/4 after the next set
of releases.

Greetings,

Andres Freund



Re: Rework the way multixact truncations work

From
Alvaro Herrera
Date:
> @@ -1210,8 +1211,14 @@ restart:;
>      (void) SlruScanDirectory(ctl, SlruScanDirCbDeleteCutoff, &cutoffPage);
>  }
>  
> -void
> -SlruDeleteSegment(SlruCtl ctl, char *filename)
> +/*
> + * Delete an individual SLRU segment, identified by the filename.
> + *
> + * NB: This does not touch the SLRU buffers themselves, callers have to ensure
> + * they either can't yet contain anything, or have already been cleaned out.
> + */
> +static void
> +SlruInternalDeleteSegment(SlruCtl ctl, char *filename)
>  {
>      char        path[MAXPGPATH];
>  
> @@ -1222,6 +1229,64 @@ SlruDeleteSegment(SlruCtl ctl, char *filename)
>  }
>  
>  /*
> + * Delete an individual SLRU segment, identified by the segment number.
> + */
> +void
> +SlruDeleteSegment(SlruCtl ctl, int segno)

Is it okay to change the ABI of SlruDeleteSegment?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Rework the way multixact truncations work

From
Andres Freund
Date:
On 2015-09-23 10:29:09 -0300, Alvaro Herrera wrote:
> >  /*
> > + * Delete an individual SLRU segment, identified by the segment number.
> > + */
> > +void
> > +SlruDeleteSegment(SlruCtl ctl, int segno)
> 
> Is it okay to change the ABI of SlruDeleteSegment?

I think so. Any previous user of the API is going to be currently broken
anyway due to the missing flushing of buffers.

Andres



Re: Rework the way multixact truncations work

From
Alvaro Herrera
Date:
The comment on top of TrimMultiXact states that "no locks are needed
here", but then goes on to grab a few locks.  I think we should remove
the comment, or rephrase it to state that we still grab them for
consistency or whatever; or perhaps even remove the lock acquisitions.
(I think the comment is still true: by the time TrimMultiXact runs,
we're out of recovery but not yet running, so it's not possible for
anyone to try to do anything multixact-related.)

I wonder if it would be cleaner to move the setting of finishedStartup
down to just before calling SetMultiXactIdLimit, instead of at the top
of the function.

It's a bit odd that SetMultiXactIdLimit has the "finishedStartup" test
so low.  Why bother setting all those local variables only to bail out?
I think it would make more sense to just do it at the top.  The only
thing you lose AFAICS is that elog(DEBUG1) message -- is that worth it?
Also, the fact that finishedStartup itself is read without a lock at
least merits a comment.

In MultiXactAdvanceOldest, the test for sawTruncationinCkptCycle seems
reversed?    if (!MultiXactState->sawTruncationInCkptCycle)
surely we should be doing truncation if it's set?

Honestly, I wonder whether this message        ereport(LOG,                (errmsg("performing legacy multixact
truncation"),                errdetail("Legacy truncations are sometimes performed when replaying WAL from an older
primary."),                errhint("Upgrade the primary, it is susceptible to data corruption.")));
 
shouldn't rather be a PANIC.  (The main reason not to, I think, is that
once you see this, there is no way to put the standby in a working state
without recloning).

I think the prevOldestOffsetKnown test in line 2667 ("if we failed to
get ...") is better expressed as an else-if of the previous "if" block.

I think the two "there are NO MultiXacts" cases in TruncateMultiXact
would benefit in readability from adding braces around the lone
statement (and moving the comment to the line prior).

If the find_multixact_start(oldestMulti) call in TruncateMultiXact
fails, what recourse does the user have?  I wonder if the elog() should
be a FATAL instead of just LOG.  It's not like it would work on a
subsequent run, is it?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Rework the way multixact truncations work

From
Andres Freund
Date:
On 2015-09-23 15:03:05 -0300, Alvaro Herrera wrote:
> The comment on top of TrimMultiXact states that "no locks are needed
> here", but then goes on to grab a few locks.

Hm. Yea. Although that was the case before.

> It's a bit odd that SetMultiXactIdLimit has the "finishedStartup" test
> so low.  Why bother setting all those local variables only to bail
> out?

Hm. Doesn't seem to matter much to me, but I can change it.

> In MultiXactAdvanceOldest, the test for sawTruncationinCkptCycle seems
> reversed?
>         if (!MultiXactState->sawTruncationInCkptCycle)
> surely we should be doing truncation if it's set?

No, that's correct. If there was a checkpoint cycle where oldestMulti
advanced without seing a truncation record we need to perform a legacy
truncation.

> Honestly, I wonder whether this message
>             ereport(LOG,
>                     (errmsg("performing legacy multixact truncation"),
>                      errdetail("Legacy truncations are sometimes performed when replaying WAL from an older
primary."),
>                      errhint("Upgrade the primary, it is susceptible to data corruption.")));
> shouldn't rather be a PANIC.  (The main reason not to, I think, is that
> once you see this, there is no way to put the standby in a working state
> without recloning).

Huh? The behaviour in that case is still better than what we have in
9.3+ today (not delayed till the restartpoint). Don't see why that
should be a panic. That'd imo make it pretty much impossible to upgrade
a pair of primary/master where you normally upgrade the standby first?

This is all moot given Robert's objection to backpatching this to
9.3/4.

> If the find_multixact_start(oldestMulti) call in TruncateMultiXact
> fails, what recourse does the user have?  I wonder if the elog() should
> be a FATAL instead of just LOG.  It's not like it would work on a
> subsequent run, is it?

It currently only LOGs, I don't want to change that. The cases where we
currently know it's possible to hit this, it should be fixed by the next
set of emergency autovacuums (which we trigger).

Thanks for the look,

Andres



Re: Rework the way multixact truncations work

From
Alvaro Herrera
Date:
Andres Freund wrote:
> On 2015-09-23 15:03:05 -0300, Alvaro Herrera wrote:

> > Honestly, I wonder whether this message
> >             ereport(LOG,
> >                     (errmsg("performing legacy multixact truncation"),
> >                      errdetail("Legacy truncations are sometimes performed when replaying WAL from an older
primary."),
> >                      errhint("Upgrade the primary, it is susceptible to data corruption.")));
> > shouldn't rather be a PANIC.  (The main reason not to, I think, is that
> > once you see this, there is no way to put the standby in a working state
> > without recloning).
> 
> Huh? The behaviour in that case is still better than what we have in
> 9.3+ today (not delayed till the restartpoint). Don't see why that
> should be a panic. That'd imo make it pretty much impossible to upgrade
> a pair of primary/master where you normally upgrade the standby first?
> 
> This is all moot given Robert's objection to backpatching this to
> 9.3/4.

I think we need to make a decision here.  Is this a terribly serious
bug/misdesign that needs addressing?  If so, we need to backpatch.  If
not, then by all means lets leave it alone.  I don't think it is a good
idea to leave it open if we think it's serious, which is what I think is
happening.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Rework the way multixact truncations work

From
Andres Freund
Date:
On 2015-09-23 15:57:02 -0300, Alvaro Herrera wrote:
> I think we need to make a decision here.  Is this a terribly serious
> bug/misdesign that needs addressing?

Imo yes. Not sure about terribly, but definitely serious. It's several
data loss bugs in one package.

> If so, we need to backpatch.  If not, then by all means lets leave it
> alone.  I don't think it is a good idea to leave it open if we think
> it's serious, which is what I think is happening.

Right, but I don't want to backpatch this over an objection, and it
doesn't seem like I have a chance to convince Robert that it'd be a good
idea. So it'll be 9.5+master for now.

Greetings,

Andres Freund



Re: Rework the way multixact truncations work

From
Andres Freund
Date:
Hi,

On 2015-09-23 15:03:05 -0300, Alvaro Herrera wrote:
> I wonder if it would be cleaner to move the setting of finishedStartup
> down to just before calling SetMultiXactIdLimit, instead of at the top
> of the function.

Done. I don't think it makes much of a difference, but there's really no
reason not to change it.

> It's a bit odd that SetMultiXactIdLimit has the "finishedStartup" test
> so low.  Why bother setting all those local variables only to bail
> out?

But we do more than set local variables? We actually set some in-memory
limits (MultiXactState->multiVacLimit et al). What we can't do is to
startup the members wraparound protection because that requires
accessing the SLRUs.

Perhaps we should, independently of this patch really, rename
SetOffsetVacuumLimit() - it may be rather confusing that it actually is
about members/? The current name is correct, but also a bit
confusing. ComputeMembersVacuumLimits()?

> In MultiXactAdvanceOldest, the test for sawTruncationinCkptCycle seems
> reversed?
>         if (!MultiXactState->sawTruncationInCkptCycle)
> surely we should be doing truncation if it's set?

I wanted to add a comment explaining this, but the existing job seems to
do a fair job at that:
        /*
         * If there has been a truncation on the master, detected by seeing a
         * moving oldestMulti, without a corresponding truncation record, we
         * know that the primary is still running an older version of postgres
         * that doesn't yet log multixact truncations. So perform the
         * truncation ourselves.
         */

I've done some additional comment smithing.

Attached is 0002 (prev 0003) including the legacy truncation support,
and 0003 removing that and bumping page magic. I'm slightly inclined to
commit them separately (to 9.5 & master) so that we have something to
backpatch from.

Greetings,

Andres Freund

Attachment

Re: Rework the way multixact truncations work

From
Andres Freund
Date:
Hi,

I pushed this to 9.5 and master, committing the xlog page magic bump
separately. To avoid using a magic value from master in 9.5 I bumped the
numbers by two in both branches.

Should this get a release note entry given that we're not (at least
immediately) backpatching this?

Greetings,

Andres Freund



Re: Rework the way multixact truncations work

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> Should this get a release note entry given that we're not (at least
> immediately) backpatching this?

I'll probably put something in when I update the release notes for beta1
(next week sometime); no real need to deal with it individually.
        regards, tom lane



Re: Rework the way multixact truncations work

From
Jim Nasby
Date:
On 9/23/15 1:48 PM, Andres Freund wrote:
>> Honestly, I wonder whether this message
>> >            ereport(LOG,
>> >                    (errmsg("performing legacy multixact truncation"),
>> >                     errdetail("Legacy truncations are sometimes performed when replaying WAL from an older
primary."),
>> >                     errhint("Upgrade the primary, it is susceptible to data corruption.")));
>> >shouldn't rather be a PANIC.  (The main reason not to, I think, is that
>> >once you see this, there is no way to put the standby in a working state
>> >without recloning).
> Huh? The behaviour in that case is still better than what we have in
> 9.3+ today (not delayed till the restartpoint). Don't see why that
> should be a panic. That'd imo make it pretty much impossible to upgrade
> a pair of primary/master where you normally upgrade the standby first?

IMHO doing just a log of something this serious; it should at least be a 
WARNING.

I think the concern about upgrading a replica before the master is 
valid; is there some way we could over-ride a PANIC when that's exactly 
what someone is trying to do? Check for a special file maybe?

+    bool        sawTruncationInCkptCycle;
What happens if someone downgrades the master, back to a version that no 
longer logs truncation? (I don't think assuming that the replica will 
need to restart if that happens is a safe bet...)

-    if (MultiXactIdPrecedes(oldestMXact, earliest))
+    /* If there's nothing to remove, we can bail out early. */
+    if (MultiXactIdPrecedes(oldestMulti, earliest))     {
-        DetermineSafeOldestOffset(oldestMXact);
+        LWLockRelease(MultiXactTruncationLock);
If/when this is backpatched, would it be safer to just leave this alone?
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Rework the way multixact truncations work

From
Andres Freund
Date:
On 2015-09-27 14:21:08 -0500, Jim Nasby wrote:
> IMHO doing just a log of something this serious; it should at least be a
> WARNING.

In postgres LOG, somewhat confusingly, is more severe than WARNING.

> I think the concern about upgrading a replica before the master is valid; is
> there some way we could over-ride a PANIC when that's exactly what someone
> is trying to do? Check for a special file maybe?

I don't understand this concern - that's just the situation we have in
all released branches today.

> +    bool        sawTruncationInCkptCycle;
> What happens if someone downgrades the master, back to a version that no
> longer logs truncation? (I don't think assuming that the replica will need
> to restart if that happens is a safe bet...)

It'll just to do legacy truncation again - without a restart on the
standby required.

> -    if (MultiXactIdPrecedes(oldestMXact, earliest))
> +    /* If there's nothing to remove, we can bail out early. */
> +    if (MultiXactIdPrecedes(oldestMulti, earliest))
>      {
> -        DetermineSafeOldestOffset(oldestMXact);
> +        LWLockRelease(MultiXactTruncationLock);
> If/when this is backpatched, would it be safer to just leave this alone?

What do you mean? This can't just isolated be left alone?



Re: Rework the way multixact truncations work

From
Jim Nasby
Date:
On 9/27/15 2:25 PM, Andres Freund wrote:
> On 2015-09-27 14:21:08 -0500, Jim Nasby wrote:
>> IMHO doing just a log of something this serious; it should at least be a
>> WARNING.
>
> In postgres LOG, somewhat confusingly, is more severe than WARNING.

Ahh, right. Which in this case stinks, because WARNING is a lot more 
attention grabbing than LOG. :/

>> I think the concern about upgrading a replica before the master is valid; is
>> there some way we could over-ride a PANIC when that's exactly what someone
>> is trying to do? Check for a special file maybe?
>
> I don't understand this concern - that's just the situation we have in
> all released branches today.

There was discussion about making this a PANIC instead of a LOG, which I 
think is a good idea... but then there'd need to be some way to not 
PANIC if you were doing an upgrade.

>> +    bool        sawTruncationInCkptCycle;
>> What happens if someone downgrades the master, back to a version that no
>> longer logs truncation? (I don't think assuming that the replica will need
>> to restart if that happens is a safe bet...)
>
> It'll just to do legacy truncation again - without a restart on the
> standby required.

Oh, I thought once that was set it would stay set. NM.

>> -    if (MultiXactIdPrecedes(oldestMXact, earliest))
>> +    /* If there's nothing to remove, we can bail out early. */
>> +    if (MultiXactIdPrecedes(oldestMulti, earliest))
>>       {
>> -        DetermineSafeOldestOffset(oldestMXact);
>> +        LWLockRelease(MultiXactTruncationLock);
>> If/when this is backpatched, would it be safer to just leave this alone?
>
> What do you mean? This can't just isolated be left alone?

I thought removing DetermineSafeOldestOffset was just an optimization, 
but I guess I was confused.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Rework the way multixact truncations work

From
Robert Haas
Date:
On Mon, Sep 28, 2015 at 5:47 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> There was discussion about making this a PANIC instead of a LOG, which I
> think is a good idea... but then there'd need to be some way to not PANIC if
> you were doing an upgrade.

I think you're worrying about a non-problem.  This code has not been
back-patched prior to 9.5, and the legacy truncation code has been
removed in 9.5+.  So it's a complete non-issue right at the moment.
If at some point we back-patch this further, then it potentially
becomes a live issue, but I would like to respectfully inquire what
exactly you think making it a PANIC would accomplish?  There are a lot
of scary things about this patch, but the logic for deciding whether
to perform a legacy truncation is solid as far as I know.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Rework the way multixact truncations work

From
Jim Nasby
Date:
On 9/28/15 8:49 PM, Robert Haas wrote:
> If at some point we back-patch this further, then it potentially
> becomes a live issue, but I would like to respectfully inquire what
> exactly you think making it a PANIC would accomplish?  There are a lot
> of scary things about this patch, but the logic for deciding whether
> to perform a legacy truncation is solid as far as I know.

Maybe I'm confused, but I thought the whole purpose of this was to get 
rid of the risk associated with that calculation in favor of explicit 
truncation boundaries in the WAL log.

Even if that's not the case, ISTM that being big and in your face about 
a potential data corruption bug is a good thing, as long as the DBA has 
a way to "hit the snooze button".

Either way, I'm not going to make a fuss over it.

Just to make sure we're on the same page; Alvaro's original comment was:
> Honestly, I wonder whether this message
>             ereport(LOG,
>                     (errmsg("performing legacy multixact truncation"),
>                      errdetail("Legacy truncations are sometimes performed when replaying WAL from an older
primary."),
>                      errhint("Upgrade the primary, it is susceptible to data corruption.")));
> shouldn't rather be a PANIC.  (The main reason not to, I think, is that
> once you see this, there is no way to put the standby in a working state
> without recloning).
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Rework the way multixact truncations work

From
Andres Freund
Date:
On 2015-09-28 21:48:00 -0500, Jim Nasby wrote:
> On 9/28/15 8:49 PM, Robert Haas wrote:
> >If at some point we back-patch this further, then it potentially
> >becomes a live issue, but I would like to respectfully inquire what
> >exactly you think making it a PANIC would accomplish?  There are a lot
> >of scary things about this patch, but the logic for deciding whether
> >to perform a legacy truncation is solid as far as I know.
> 
> Maybe I'm confused, but I thought the whole purpose of this was to get rid
> of the risk associated with that calculation in favor of explicit truncation
> boundaries in the WAL log.

> Even if that's not the case, ISTM that being big and in your face about a
> potential data corruption bug is a good thing, as long as the DBA has a way
> to "hit the snooze button".

So we'd end up with a guc that everyone has to set while they
upgrade. That seems like a pointless hassle.

Greetings,

Andres Freund



Re: Rework the way multixact truncations work

From
Robert Haas
Date:
On Mon, Sep 28, 2015 at 10:48 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> Maybe I'm confused, but I thought the whole purpose of this was to get rid
> of the risk associated with that calculation in favor of explicit truncation
> boundaries in the WAL log.

Yes.  But if the master hasn't been updated yet, then we still need to
do something based on a calculation.

> Even if that's not the case, ISTM that being big and in your face about a
> potential data corruption bug is a good thing, as long as the DBA has a way
> to "hit the snooze button".

Panicking the standby because the master hasn't been updated does not
seem like a good thing to me in any way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Rework the way multixact truncations work

From
Joel Jacobson
Date:
On Tue, Sep 22, 2015 at 3:20 PM, Andres Freund <andres@anarazel.de> wrote:
> What I've tested is the following:
> * continous burning of multis, both triggered via members and offsets
> * a standby keeping up when the primary is old
> * a standby keeping up when the primary is new
> * basebackups made while a new primary is under load
> * verified that we properly PANIC when a truncation record is replayed
>   in an old standby.

Are these test scripts available somewhere?
I understand they might be undocumented and perhaps tricky to set it all up,
but I would be very interested in them anyway,
think you could push them somewhere?

Thanks a lot for working on this!



Re: Rework the way multixact truncations work

From
Alvaro Herrera
Date:
Robert Haas wrote:
> On Mon, Sep 28, 2015 at 10:48 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> > Maybe I'm confused, but I thought the whole purpose of this was to get rid
> > of the risk associated with that calculation in favor of explicit truncation
> > boundaries in the WAL log.
> 
> Yes.  But if the master hasn't been updated yet, then we still need to
> do something based on a calculation.

Right.

> > Even if that's not the case, ISTM that being big and in your face about a
> > potential data corruption bug is a good thing, as long as the DBA has a way
> > to "hit the snooze button".
> 
> Panicking the standby because the master hasn't been updated does not
> seem like a good thing to me in any way.

If we had a way to force the master to upgrade, I think it would be good
because we have a mechanism to get rid of the legacy truncation code;
but as I said several messages ago this doesn't actually work which is
why I dropped the idea of panicking.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Rework the way multixact truncations work

From
Noah Misch
Date:
I'm several days into a review of this change (commits 4f627f8 and aa29c1c).
There's one part of the design I want to understand before commenting on
specific code.  What did you anticipate to be the consequences of failing to
remove SLRU segment files that MultiXactState->oldestMultiXactId implies we
should have removed?  I ask because, on the one hand, I see code making
substantial efforts to ensure that it truncates exactly as planned:
/* * Do truncation, and the WAL logging of the truncation, in a critical * section. That way offsets/members cannot get
outof sync anymore, i.e. * once consistent the newOldestMulti will always exist in members, even * if we crashed in the
wrongmoment. */START_CRIT_SECTION();
 
/* * Prevent checkpoints from being scheduled concurrently. This is critical * because otherwise a truncation record
mightnot be replayed after a * crash/basebackup, even though the state of the data directory would * require it.
*/Assert(!MyPgXact->delayChkpt);MyPgXact->delayChkpt= true;
 
.../* * Update in-memory limits before performing the truncation, while inside * the critical section: Have to do it
beforetruncation, to prevent * concurrent lookups of those values. Has to be inside the critical * section as otherwise
afuture call to this function would error out, * while looking up the oldest member in offsets, if our caller crashes *
beforeupdating the limits. */
 

On the other hand, TruncateMultiXact() callees ignore unlink() return values:

On Tue, Sep 22, 2015 at 07:57:27PM +0200, Andres Freund wrote:
> On 2015-09-22 13:38:58 -0400, Robert Haas wrote:
> > - If SlruDeleteSegment fails in unlink(), shouldn't we at the very
> > least log a message?  If that file is still there when we loop back
> > around, it's going to cause a failure, I think.
> 
> The existing unlink() call doesn't, that's the only reason I didn't add
> a message there. I'm fine with adding a (LOG or WARNING?) message.

Unlinking old pg_clog files is strictly an optimization.  If you were to
comment out every unlink() call in slru.c, the only ill effect on CLOG is the
waste of disk space.  Is the same true of MultiXact?

If there's anyplace where failure to unlink would cause a malfunction, I think
it would be around the use of SlruScanDirCbFindEarliest().  That function's
result becomes undefined if the range of pg_multixact/offsets segment files
present on disk spans more than about INT_MAX/2 MultiXactId.

Thanks,
nm



Re: Rework the way multixact truncations work

From
Andres Freund
Date:
Hi,

On 2015-10-24 22:07:00 -0400, Noah Misch wrote:
> I'm several days into a review of this change (commits 4f627f8 and
> aa29c1c).

Cool!

> There's one part of the design I want to understand before commenting on
> specific code.  What did you anticipate to be the consequences of failing to
> remove SLRU segment files that MultiXactState->oldestMultiXactId implies we
> should have removed?  I ask because, on the one hand, I see code making
> substantial efforts to ensure that it truncates exactly as planned:

> [portion of TruncateMultiXact]

The reason we can't have checkpoints there, is that checkpoints records
multixact values in the checkpoint record. If we crash-restart before
the truncation has finished we can end up in the situation that
->oldestMultiXactId doesn't exist. Which will trigger a round of
emergency vacuum at the next startup, not something that should happen
due to a concurrency problem.

We could instead update the in-memory values first, but that could lead
to other problems.


So the critical section/delaying of checkpoints is more about having the
on-disk agreeing with the status data in the checkpoint/control file.


> On Tue, Sep 22, 2015 at 07:57:27PM +0200, Andres Freund wrote:
> > On 2015-09-22 13:38:58 -0400, Robert Haas wrote:
> > > - If SlruDeleteSegment fails in unlink(), shouldn't we at the very
> > > least log a message?  If that file is still there when we loop back
> > > around, it's going to cause a failure, I think.
> > 
> > The existing unlink() call doesn't, that's the only reason I didn't add
> > a message there. I'm fine with adding a (LOG or WARNING?) message.

Note that I didn't add the warning after all, as it'd be too noisy
during repeated replay, as all the files would already be gone. We could
only emit it when the error is not ENOFILE, if people prefer that.


> Unlinking old pg_clog files is strictly an optimization.  If you were to
> comment out every unlink() call in slru.c, the only ill effect on CLOG is the
> waste of disk space.  Is the same true of MultiXact?

Well, multixacts are a lot larger than the other SLRUs, I think that
makes some sort of difference.


Thanks,

Andres



Re: Rework the way multixact truncations work

From
Josh Berkus
Date:
On 10/27/2015 07:44 AM, Andres Freund wrote:
>> Unlinking old pg_clog files is strictly an optimization.  If you were to
>> > comment out every unlink() call in slru.c, the only ill effect on CLOG is the
>> > waste of disk space.  Is the same true of MultiXact?
> Well, multixacts are a lot larger than the other SLRUs, I think that
> makes some sort of difference.

And by "a lot larger" we're talking like 50X to 100X.  I regularly see
pg_multixact directories larger than 1GB.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Rework the way multixact truncations work

From
Noah Misch
Date:
On Tue, Oct 27, 2015 at 03:44:10PM +0100, Andres Freund wrote:
> On 2015-10-24 22:07:00 -0400, Noah Misch wrote:
> > On Tue, Sep 22, 2015 at 07:57:27PM +0200, Andres Freund wrote:
> > > On 2015-09-22 13:38:58 -0400, Robert Haas wrote:
> > > > - If SlruDeleteSegment fails in unlink(), shouldn't we at the very
> > > > least log a message?  If that file is still there when we loop back
> > > > around, it's going to cause a failure, I think.
> > > 
> > > The existing unlink() call doesn't, that's the only reason I didn't add
> > > a message there. I'm fine with adding a (LOG or WARNING?) message.
> 
> Note that I didn't add the warning after all, as it'd be too noisy
> during repeated replay, as all the files would already be gone. We could
> only emit it when the error is not ENOFILE, if people prefer that.
> 
> 
> > Unlinking old pg_clog files is strictly an optimization.  If you were to
> > comment out every unlink() call in slru.c, the only ill effect on CLOG is the
> > waste of disk space.  Is the same true of MultiXact?
> 
> Well, multixacts are a lot larger than the other SLRUs, I think that
> makes some sort of difference.

That helps; thanks.  Your design seems good.  I've located only insipid
defects.  I propose to save some time by writing a patch series eliminating
them, which you could hopefully review.  Does that sound good?



Re: Rework the way multixact truncations work

From
Andres Freund
Date:
Hi,

On October 29, 2015 7:59:03 AM GMT+01:00, Noah Misch <noah@leadboat.com> wrote:
>On Tue, Oct 27, 2015 at 03:44:10PM +0100, Andres Freund wrote:
>> On 2015-10-24 22:07:00 -0400, Noah Misch wrote:
>> > On Tue, Sep 22, 2015 at 07:57:27PM +0200, Andres Freund wrote:
>> > > On 2015-09-22 13:38:58 -0400, Robert Haas wrote:
>> > > > - If SlruDeleteSegment fails in unlink(), shouldn't we at the
>very
>> > > > least log a message?  If that file is still there when we loop
>back
>> > > > around, it's going to cause a failure, I think.
>> > > 
>> > > The existing unlink() call doesn't, that's the only reason I
>didn't add
>> > > a message there. I'm fine with adding a (LOG or WARNING?)
>message.
>> 
>> Note that I didn't add the warning after all, as it'd be too noisy
>> during repeated replay, as all the files would already be gone. We
>could
>> only emit it when the error is not ENOFILE, if people prefer that.
>> 
>> 
>> > Unlinking old pg_clog files is strictly an optimization.  If you
>were to
>> > comment out every unlink() call in slru.c, the only ill effect on
>CLOG is the
>> > waste of disk space.  Is the same true of MultiXact?
>> 
>> Well, multixacts are a lot larger than the other SLRUs, I think that
>> makes some sort of difference.
>
>That helps; thanks.  Your design seems good.  I've located only insipid
>defects.

Great!

> I propose to save some time by writing a patch series
>eliminating
>them, which you could hopefully review.  Does that sound good?

Yes, it does.

Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.



Re: Rework the way multixact truncations work

From
Noah Misch
Date:
On Thu, Oct 29, 2015 at 08:46:52AM +0100, Andres Freund wrote:
> On October 29, 2015 7:59:03 AM GMT+01:00, Noah Misch <noah@leadboat.com> wrote:
> >That helps; thanks.  Your design seems good.  I've located only insipid
> >defects.
>
> Great!
>
> > I propose to save some time by writing a patch series
> >eliminating
> >them, which you could hopefully review.  Does that sound good?
>
> Yes, it does.

I have pushed a stack of branches to https://github.com/nmisch/postgresql.git:

mxt0-revert - reverts commits 4f627f8 and aa29c1c
mxt1-disk-independent - see below
mxt2-cosmetic - update already-wrong comments and formatting
mxt3-main - replaces commit 4f627f8
mxt4-rm-legacy - replaces commit aa29c1c

The plan is to squash each branch into one PostgreSQL commit.  In addition to
examining overall "git diff mxt2-cosmetic mxt3-main", I recommend reviewing
itemized changes and commit log entries in "git log -p --reverse --no-merges
mxt2-cosmetic..mxt3-main".  In particular, when a change involves something
you discussed upthread or was otherwise not obvious, I put a statement of
rationale in the commit log.

> +     * Make sure to only attempt truncation if there's values to truncate
> +     * away. In normal processing values shouldn't go backwards, but there's
> +     * some corner cases (due to bugs) where that's possible.

Which bugs are those?  I would like to include more detail if available.


If anything here requires careful study, it's the small mxt1-disk-independent
change, which I have also attached in patch form.  That patch removes the
SlruScanDirCbFindEarliest() test from TruncateMultiXact(), which in turn makes
multixact control decisions independent of whether TruncateMultiXact() is
successful at unlinking segments.  Today, one undeletable segment file can
cause TruncateMultiXact() to give up on truncation completely for a span of
hundreds of millions of MultiXactId.  Patched multixact.c will, like CLOG,
make its decisions strictly based on the content of files it expects to exist.
It will no longer depend on the absence of files it hopes will not exist.

To aid in explaining the change's effects, I will define some terms.  A
"logical wrap" occurs when no range of 2^31 integers covers the set of
MultiXactId stored in xmax fields.  A "file-level wrap" occurs when there
exists a pair of pg_multixact/offsets segment files such that:

  | segno_a * SLRU_PAGES_PER_SEGMENT * MULTIXACT_OFFSETS_PER_PAGE -
    segno_b * SLRU_PAGES_PER_SEGMENT * MULTIXACT_OFFSETS_PER_PAGE | > 2^31

A logical wrap implies either a file-level wrap or missing visibility
metadata, but a file-level wrap does not imply other consequences.  The
SlruScanDirCbFindEarliest() test is usually redundant with
find_multixact_start(), because MultiXactIdPrecedes(oldestMXact, earliest)
almost implies that find_multixact_start() will fail.  The exception arises
when pg_multixact/offsets files compose a file-level wrap, which can happen
when TruncateMultiXact() fails to unlink segments as planned.  When it does
happen, the result of SlruScanDirCbFindEarliest(), and therefore the computed
"earliest" value, is undefined.  (This outcome is connected to our requirement
to use only half the pg_clog or pg_multixact/offsets address space at any one
time.  The PagePrecedes callbacks for these SLRUs cease to be transitive if
more than half their address space is in use.)

The SlruScanDirCbFindEarliest() test can be helpful when a file-level wrap
coexists with incorrect oldestMultiXactId (extant xmax values define "correct
oldestMultiXactId").  If we're lucky with readdir() order, the test will block
truncation so we don't delete still-needed segments.  I am content to lose
that, because (a) the code isn't reliable for (or even directed toward) that
purpose and (b) sites running on today's latest point releases no longer have
incorrect oldestMultiXactId.

Attachment

Re: Rework the way multixact truncations work

From
Andres Freund
Date:
On November 8, 2015 12:54:07 AM PST, Noah Misch <noah@leadboat.com> wrote:

>I have pushed a stack of branches to
>https://github.com/nmisch/postgresql.git:
>
>mxt0-revert - reverts commits 4f627f8 and aa29c1c
>mxt1-disk-independent - see below
>mxt2-cosmetic - update already-wrong comments and formatting
>mxt3-main - replaces commit 4f627f8
>mxt4-rm-legacy - replaces commit aa29c1c
>
>The plan is to squash each branch into one PostgreSQL commit.  In
>addition to
>examining overall "git diff mxt2-cosmetic mxt3-main", I recommend
>reviewing
>itemized changes and commit log entries in "git log -p --reverse
>--no-merges
>mxt2-cosmetic..mxt3-main".  In particular, when a change involves
>something
>you discussed upthread or was otherwise not obvious, I put a statement
>of
>rationale in the commit log.

I'm not following along right now - in order to make cleanups the plan is to revert a couple commits and then redo them
prettyfied?

Andres
Hi
--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.



Re: Rework the way multixact truncations work

From
Noah Misch
Date:
On Sun, Nov 08, 2015 at 11:11:42AM -0800, Andres Freund wrote:
> On November 8, 2015 12:54:07 AM PST, Noah Misch <noah@leadboat.com> wrote:
> 
> >I have pushed a stack of branches to
> >https://github.com/nmisch/postgresql.git:
> >
> >mxt0-revert - reverts commits 4f627f8 and aa29c1c
> >mxt1-disk-independent - see below
> >mxt2-cosmetic - update already-wrong comments and formatting
> >mxt3-main - replaces commit 4f627f8
> >mxt4-rm-legacy - replaces commit aa29c1c
> >
> >The plan is to squash each branch into one PostgreSQL commit.  In
> >addition to
> >examining overall "git diff mxt2-cosmetic mxt3-main", I recommend
> >reviewing
> >itemized changes and commit log entries in "git log -p --reverse
> >--no-merges
> >mxt2-cosmetic..mxt3-main".  In particular, when a change involves
> >something
> >you discussed upthread or was otherwise not obvious, I put a statement
> >of
> >rationale in the commit log.
> 
> I'm not following along right now - in order to make cleanups the plan is to revert a couple commits and then redo
themprettyfied?
 

Yes, essentially.  Given the volume of updates, this seemed neater than
framing those updates as in-tree incremental development.



Re: Rework the way multixact truncations work

From
Andres Freund
Date:
On November 8, 2015 11:52:05 AM PST, Noah Misch <noah@leadboat.com> wrote:
>On Sun, Nov 08, 2015 at 11:11:42AM -0800, Andres Freund wrote:
>> On November 8, 2015 12:54:07 AM PST, Noah Misch <noah@leadboat.com>
>wrote:
>> 
>> >I have pushed a stack of branches to
>> >https://github.com/nmisch/postgresql.git:
>> >
>> >mxt0-revert - reverts commits 4f627f8 and aa29c1c
>> >mxt1-disk-independent - see below
>> >mxt2-cosmetic - update already-wrong comments and formatting
>> >mxt3-main - replaces commit 4f627f8
>> >mxt4-rm-legacy - replaces commit aa29c1c
>> >
>> >The plan is to squash each branch into one PostgreSQL commit.  In
>> >addition to
>> >examining overall "git diff mxt2-cosmetic mxt3-main", I recommend
>> >reviewing
>> >itemized changes and commit log entries in "git log -p --reverse
>> >--no-merges
>> >mxt2-cosmetic..mxt3-main".  In particular, when a change involves
>> >something
>> >you discussed upthread or was otherwise not obvious, I put a
>statement
>> >of
>> >rationale in the commit log.
>> 
>> I'm not following along right now - in order to make cleanups the
>plan is to revert a couple commits and then redo them prettyfied?
>
>Yes, essentially.  Given the volume of updates, this seemed neater than
>framing those updates as in-tree incremental development.

I don't like that plan. I don't have a problem doing that in some development branch somewhere, but I fail to see any
benefitdoing that in 9.5/master. It'll just make the history more convoluted for no benefit.
 

I'll obviously still review the changes.

Even for review it's nor particularly convenient, because now the entirety of the large changes essentially needs to be
reviewedanew, given they're not the same. 
 

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.



Re: Rework the way multixact truncations work

From
Noah Misch
Date:
On Sun, Nov 08, 2015 at 11:59:33AM -0800, Andres Freund wrote:
> On November 8, 2015 11:52:05 AM PST, Noah Misch <noah@leadboat.com> wrote:
> >On Sun, Nov 08, 2015 at 11:11:42AM -0800, Andres Freund wrote:
> >> On November 8, 2015 12:54:07 AM PST, Noah Misch <noah@leadboat.com> wrote:
> >> 
> >> >I have pushed a stack of branches to
> >> >https://github.com/nmisch/postgresql.git:
> >> >
> >> >mxt0-revert - reverts commits 4f627f8 and aa29c1c
> >> >mxt1-disk-independent - see below
> >> >mxt2-cosmetic - update already-wrong comments and formatting
> >> >mxt3-main - replaces commit 4f627f8
> >> >mxt4-rm-legacy - replaces commit aa29c1c
> >> >
> >> >The plan is to squash each branch into one PostgreSQL commit.  In
> >> >addition to
> >> >examining overall "git diff mxt2-cosmetic mxt3-main", I recommend
> >> >reviewing
> >> >itemized changes and commit log entries in "git log -p --reverse
> >> >--no-merges
> >> >mxt2-cosmetic..mxt3-main".  In particular, when a change involves
> >> >something
> >> >you discussed upthread or was otherwise not obvious, I put a
> >statement
> >> >of
> >> >rationale in the commit log.
> >> 
> >> I'm not following along right now - in order to make cleanups the
> >plan is to revert a couple commits and then redo them prettyfied?
> >
> >Yes, essentially.  Given the volume of updates, this seemed neater than
> >framing those updates as in-tree incremental development.
> 
> I don't like that plan. I don't have a problem doing that in some development branch somewhere, but I fail to see any
benefitdoing that in 9.5/master. It'll just make the history more convoluted for no benefit.
 
> 
> I'll obviously still review the changes.

Cleanliness of history is precisely why I did it this way.  If I had framed
the changes as in-tree incremental development, no one "git diff" command
would show the truncation rework or a coherent subset.  To review the whole,
students of this code might resort to a cherry-pick of the repair commits onto
aa29c1c.  That, too, proves dissatisfying; the history would nowhere carry a
finished version of legacy truncation support.  A hacker opting to back-patch
in the future, as commit 4f627f8 contemplated, would need to dig through this
thread for the bits added in mxt3-main and removed in mxt4-rm-legacy.

The benefits may become clearer as you continue to review the branches.

> Even for review it's nor particularly convenient, because now the entirety of the large changes essentially needs to
bereviewed anew, given they're not the same. 
 

Agreed; I optimized for future readers, and I don't doubt this is less
convenient for you and for others already familiar with commits 4f627f8 and
aa29c1c.  I published branches, not squashed patches, mostly because I think
the individual branch commits will facilitate your study of the changes.  I
admit the cost to you remains high.



Re: Rework the way multixact truncations work

From
Peter Geoghegan
Date:
On Sun, Nov 8, 2015 at 11:52 AM, Noah Misch <noah@leadboat.com> wrote:
>> I'm not following along right now - in order to make cleanups the plan is to revert a couple commits and then redo
themprettyfied?
 
>
> Yes, essentially.  Given the volume of updates, this seemed neater than
> framing those updates as in-tree incremental development.

I think that's an odd way of representing this work. I tend to
remember roughly when major things were committed even years later. An
outright revert should represent a total back out of the original
commit IMV. Otherwise, a git blame can be quite misleading. I can
imagine questioning my recollection, even when it is accurate, if only
because I don't expect this.

-- 
Peter Geoghegan



Re: Rework the way multixact truncations work

From
Noah Misch
Date:
On Mon, Nov 23, 2015 at 11:44:45AM -0800, Peter Geoghegan wrote:
> On Sun, Nov 8, 2015 at 11:52 AM, Noah Misch <noah@leadboat.com> wrote:
> >> I'm not following along right now - in order to make cleanups the plan is to revert a couple commits and then redo
themprettyfied?
 
> >
> > Yes, essentially.  Given the volume of updates, this seemed neater than
> > framing those updates as in-tree incremental development.
> 
> I think that's an odd way of representing this work. I tend to
> remember roughly when major things were committed even years later. An
> outright revert should represent a total back out of the original
> commit IMV. Otherwise, a git blame can be quite misleading.

I think you're saying that "clearer git blame" is a more-important reason than
"volume of updates" for preferring an outright revert over in-tree incremental
development.  Fair preference.  If that's a correct reading of your message,
then we do agree on the bottom line.



Re: Rework the way multixact truncations work

From
Robert Haas
Date:
On Fri, Nov 27, 2015 at 5:16 PM, Noah Misch <noah@leadboat.com> wrote:
> On Mon, Nov 23, 2015 at 11:44:45AM -0800, Peter Geoghegan wrote:
>> On Sun, Nov 8, 2015 at 11:52 AM, Noah Misch <noah@leadboat.com> wrote:
>> >> I'm not following along right now - in order to make cleanups the plan is to revert a couple commits and then
redothem prettyfied?
 
>> >
>> > Yes, essentially.  Given the volume of updates, this seemed neater than
>> > framing those updates as in-tree incremental development.
>>
>> I think that's an odd way of representing this work. I tend to
>> remember roughly when major things were committed even years later. An
>> outright revert should represent a total back out of the original
>> commit IMV. Otherwise, a git blame can be quite misleading.
>
> I think you're saying that "clearer git blame" is a more-important reason than
> "volume of updates" for preferring an outright revert over in-tree incremental
> development.  Fair preference.  If that's a correct reading of your message,
> then we do agree on the bottom line.

Hmm.  I read Peter's message as agreeing with Andres rather than with
you.  And I have to say I agree with Andres as well.  I think it's
weird to back a commit out only to put a bunch of very similar stuff
back in.  Even figuring out what you've actually changed here seems
rather hard.  I couldn't get github to give me a diff showing your
changes vs. master.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Rework the way multixact truncations work

From
Peter Geoghegan
Date:
On Tue, Dec 1, 2015 at 2:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Hmm.  I read Peter's message as agreeing with Andres rather than with
> you.  And I have to say I agree with Andres as well.  I think it's
> weird to back a commit out only to put a bunch of very similar stuff
> back in.

Your interpretation was correct. I think it's surprising to structure
things this way, especially since we haven't done things this way in
the past.

-- 
Peter Geoghegan



Re: Rework the way multixact truncations work

From
Noah Misch
Date:
On Tue, Dec 01, 2015 at 05:07:15PM -0500, Robert Haas wrote:
> I think it's weird to back a commit out only to put a bunch of very similar
> stuff back in.

I agree with that.  If the original patches and their replacements shared 95%
of diff lines in common, we wouldn't be having this conversation.  These
replacements redo closer to 50% of the lines, so the patches are not very
similar by verbatim line comparison.  Even so, let's stipulate that my
proposal is weird.  I'd rather be weird than lose one of the benefits I
articulated upthread, let alone all of them.

My post-commit review of RLS woke me up to the problems of gradually finishing
work in-tree.  By the time I started that review in 2015-06, the base RLS
feature already spanned twenty-two commits.  (That count has since more than
doubled.)  Reviewing each commit threatened to be wasteful, because I would
presumably find matters already fixed later.  I tried to cherry-pick the
twenty-two commits onto a branch, hoping to review the overall diff as "git
diff master...squash-rls", but that yielded complex merge conflicts.  Where
the conflicts were too much, I reviewed entire files instead.  (Granted, no
matter how this thread ends, I do not expect an outcome that opaque.)  Hackers
have been too reticent to revert and redo defect-laden commits.  If doing that
is weird today, let it be normal.

> Even figuring out what you've actually changed here seems
> rather hard.  I couldn't get github to give me a diff showing your
> changes vs. master.

If you, an expert in the 2015-09-26 commits, want to study the key changes I
made, I recommend perusing these outputs:

git remote add nmisch_github https://github.com/nmisch/postgresql.git
git fetch nmisch_github
git diff nmisch_github/mxt0-revert nmisch_github/mxt1-disk-independent
git log -p --reverse --no-merges nmisch_github/mxt2-cosmetic..nmisch_github/mxt3-main

For the overall diff vs. master that you sought:

git remote add community git://git.postgresql.org/git/postgresql.git
git remote add nmisch_github https://github.com/nmisch/postgresql.git
git fetch --multiple community nmisch_github
git diff community/master...nmisch_github/mxt4-rm-legacy

If anyone not an author or reviewer of the 2015-09-26 commits wishes to review
the work, don't read the above diffs; the intermediate states are
uninteresting.  Read these four:

git remote add nmisch_github https://github.com/nmisch/postgresql.git
git fetch nmisch_github
git diff nmisch_github/mxt0-revert nmisch_github/mxt1-disk-independent
git diff nmisch_github/mxt1-disk-independent nmisch_github/mxt2-cosmetic
git diff nmisch_github/mxt2-cosmetic nmisch_github/mxt3-main
git diff nmisch_github/mxt3-main nmisch_github/mxt4-rm-legacy

Thanks,
nm



Re: Rework the way multixact truncations work

From
Andres Freund
Date:
On 2015-12-02 09:57:19 -0500, Noah Misch wrote:
> On Tue, Dec 01, 2015 at 05:07:15PM -0500, Robert Haas wrote:
> > I think it's weird to back a commit out only to put a bunch of very similar
> > stuff back in.
>
> I agree with that.  If the original patches and their replacements shared 95%
> of diff lines in common, we wouldn't be having this conversation. These
> replacements redo closer to 50% of the lines, so the patches are not very
> similar by verbatim line comparison.

Which is a huge problem, because it makes it very hard to see what your
changes actually do. And a significant portion of the changes relative
to master aren't particularly critical. Which is easy to see if if a
commit only changes comments, but harder if you see one commit reverting
things, and another redoing most of the same things.

> Hackers have been too reticent to revert and redo defect-laden
> commits.  If doing that is weird today, let it be normal.

Why? Especially if reverting and redoing includes conflicts that mainly
increases the chance of accidental bugs.

> git remote add community git://git.postgresql.org/git/postgresql.git
> git remote add nmisch_github https://github.com/nmisch/postgresql.git
> git fetch --multiple community nmisch_github
> git diff community/master...nmisch_github/mxt4-rm-legacy

That's a nearly useless diff, because it includes a lot of other changes
(218 files changed, 2828 insertions(+), 8742 deletions(-)) made since
you published the changes. What kinda works is
git diff $(git merge-base community/master nmisch_github/mxt4-rm-legacy)..nmisch_github/mxt4-rm-legacy
which shows the diff to the version of master you start off from.

Review of the above diff:
> @@ -2013,7 +2017,7 @@ TrimMultiXact(void)
>  {
>      MultiXactId nextMXact;
>      MultiXactOffset offset;
> -    MultiXactId oldestMXact;
> +    MultiXactId    oldestMXact;

That's a bit weird, given that nextMXact isn't indented...


> @@ -2190,7 +2194,8 @@ MultiXactSetNextMXact(MultiXactId nextMulti,
> -    /*
> -     * Computing the actual limits is only possible once the data directory is
> -     * in a consistent state. There's no need to compute the limits while
> -     * still replaying WAL - no decisions about new multis are made even
> -     * though multixact creations might be replayed. So we'll only do further
> -     * checks after TrimMultiXact() has been called.
> -     */
> +    /* Before the TrimMultiXact() call at end of recovery, skip the rest. */
>      if (!MultiXactState->finishedStartup)
>          return;
> -
>      Assert(!InRecovery);
> 
> -    /* Set limits for offset vacuum. */
> +    /*
> +     * Setting MultiXactState->oldestOffset entails a find_multixact_start()
> +     * call, which is only possible once the data directory is in a consistent
> +     * state. There's no need for an offset limit while still replaying WAL;
> +     * no decisions about new multis are made even though multixact creations
> +     * might be replayed.
> +     */
>      needs_offset_vacuum = SetOffsetVacuumLimit();

I don't really see the benefit of this change. The previous location of
the comment is where we return early, so it seems appropriate to
document the reason there?

>      /*
> @@ -2354,6 +2356,12 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
>          debug_elog3(DEBUG2, "MultiXact: setting next multi to %u", minMulti);
>          MultiXactState->nextMXact = minMulti;
>      }
> +
> +    /*
> +     * MultiXactOffsetPrecedes() gives the wrong answer if nextOffset would
> +     * advance more than 2^31 between calls.  Since we get a call for each
> +     * XLOG_MULTIXACT_CREATE_ID, that should never happen.
> +     */

Independent comment improvement. Good idea though.


>  /*
> - * Update our oldestMultiXactId value, but only if it's more recent than what
> - * we had.
> - *
> - * This may only be called during WAL replay.
> + * Update our oldestMultiXactId value, but only if it's more recent than
> + * what we had.  This may only be called during WAL replay.
>   */

Whatever?

>  void
>  MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB)
> @@ -2544,14 +2550,13 @@ GetOldestMultiXactId(void)
>  static bool
>  SetOffsetVacuumLimit(void)
>  {
> -    MultiXactId oldestMultiXactId;
> +    MultiXactId    oldestMultiXactId;
>      MultiXactId nextMXact;
> -    MultiXactOffset oldestOffset = 0;    /* placate compiler */
> -    MultiXactOffset prevOldestOffset;
> +    MultiXactOffset oldestOffset = 0;        /* placate compiler */
>      MultiXactOffset nextOffset;
>      bool        oldestOffsetKnown = false;
> +    MultiXactOffset prevOldestOffset;
>      bool        prevOldestOffsetKnown;
> -    MultiXactOffset offsetStopLimit = 0;

I don't see the benefit of the order changes here.

> @@ -2588,40 +2590,50 @@ SetOffsetVacuumLimit(void)
>      else
>      {
>          /*
> -         * Figure out where the oldest existing multixact's offsets are
> -         * stored. Due to bugs in early release of PostgreSQL 9.3.X and 9.4.X,
> -         * the supposedly-earliest multixact might not really exist.  We are
> +         * Figure out where the oldest existing multixact's offsets are stored.
> +         * Due to bugs in early release of PostgreSQL 9.3.X and 9.4.X, the
> +         * supposedly-earliest multixact might not really exist.  We are
>           * careful not to fail in that case.
>           */
>          oldestOffsetKnown =
>              find_multixact_start(oldestMultiXactId, &oldestOffset);
> -
> -        if (oldestOffsetKnown)
> -            ereport(DEBUG1,
> -                    (errmsg("oldest MultiXactId member is at offset %u",
> -                            oldestOffset)));

That's imo a rather useful debug message.

> -        else
> +        if (!oldestOffsetKnown)
> +        {
> +            /* XXX This message is incorrect if prevOldestOffsetKnown. */
>              ereport(LOG,
>                      (errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed
MultiXact%u does not exist on disk",
 
>                              oldestMultiXactId)));
> +        }
>      }

Hm, the XXX is a "problem" in all 9.3+ - should we just fix it everywhere?

>      LWLockRelease(MultiXactTruncationLock);
> 
>      /*
> -     * If we can, compute limits (and install them MultiXactState) to prevent
> -     * overrun of old data in the members SLRU area. We can only do so if the
> -     * oldest offset is known though.
> +     * There's no need to update anything if we don't know the oldest offset
> +     * or if it hasn't changed.
>       */

Is that really a worthwhile optimization?

> -typedef struct mxtruncinfo
> -{
> -    int            earliestExistingPage;
> -} mxtruncinfo;
> -
> -/*
> - * SlruScanDirectory callback
> - *        This callback determines the earliest existing page number.
> - */
> -static bool
> -SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int segpage, void *data)
> -{
> -    mxtruncinfo *trunc = (mxtruncinfo *) data;
> -
> -    if (trunc->earliestExistingPage == -1 ||
> -        ctl->PagePrecedes(segpage, trunc->earliestExistingPage))
> -    {
> -        trunc->earliestExistingPage = segpage;
> -    }
> -
> -    return false;                /* keep going */
> -}
> -

That really seems like an independent change, deserving its own commit +
explanation. Just referring to "See mailing list submission notes for
rationale." makes understanding the change later imo much harder than
all the incremental commits you try to avoid.

> /*
> - * Decide which of two MultiXactMember page numbers is "older" for truncation
> - * purposes.  There is no "invalid offset number" so use the numbers verbatim.
> + * Dummy notion of which of two MultiXactMember page numbers is "older".
> + *
> + * Due to the MultiXactOffsetPrecedes() specification, this function's result
> + * is meaningless unless the system is preserving less than 2^31 members.  It
> + * is adequate for SlruSelectLRUPage() guessing the cheapest slot to reclaim.
> + * Do not pass MultiXactMemberCtl to any of the functions that use the
> + * PagePrecedes callback in other ways.
>   */
>  static bool
>  MultiXactMemberPagePrecedes(int page1, int page2)
> @@ -3157,6 +3134,10 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
> 
>  /*
>   * Decide which of two offsets is earlier.
> + *
> + * Avoid calling this function.  pg_multixact/members can preserve almost 2^32
> + * members at any given time, but this function is transitive only when the
> + * system is preserving less than 2^31 members.
>   */
>  static bool
>  MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)

As mentioned before, these really seem unrelated.

> @@ -1237,24 +1235,25 @@ SlruDeleteSegment(SlruCtl ctl, int segno)
>      SlruShared    shared = ctl->shared;
>      int            slotno;
>      char        path[MAXPGPATH];
> -    bool        did_write;
> 
>      /* Clean out any possibly existing references to the segment. */
>      LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
>  restart:
> -    did_write = false;
>      for (slotno = 0; slotno < shared->num_slots; slotno++)
>      {
> -        int            pagesegno = shared->page_number[slotno] / SLRU_PAGES_PER_SEGMENT;
> +        int            pagesegno;
> 
>          if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
>              continue;
> 
>          /* not the segment we're looking for */
> +        pagesegno = shared->page_number[slotno] / SLRU_PAGES_PER_SEGMENT;
>          if (pagesegno != segno)
>              continue;
> 
> -        /* If page is clean, just change state to EMPTY (expected case). */
> +        /*
> +         * If page is clean, just change state to EMPTY (expected case).
> +         */
>          if (shared->page_status[slotno] == SLRU_PAGE_VALID &&
>              !shared->page_dirty[slotno])
>          {
> @@ -1267,18 +1266,10 @@ restart:
>              SlruInternalWritePage(ctl, slotno, NULL);
>          else
>              SimpleLruWaitIO(ctl, slotno);
> -
> -        did_write = true;
> -    }
> -
> -    /*
> -     * Be extra careful and re-check. The IO functions release the control
> -     * lock, so new pages could have been read in.
> -     */
> -    if (did_write)
>          goto restart;
> +    }

I don't think that's really a good idea - this way we restart after
every single page write, whereas currently we only restart after passing
through all pages once. In nearly all cases we'll only ever have to
retry once in total, be because such old pages aren't usually going to
be reread/redirtied.

> @@ -9216,10 +9212,8 @@ xlog_redo(XLogReaderState *record)
>          LWLockRelease(OidGenLock);
>          MultiXactSetNextMXact(checkPoint.nextMulti,
>                                checkPoint.nextMultiOffset);
> -
> -        MultiXactAdvanceOldest(checkPoint.oldestMulti,
> -                               checkPoint.oldestMultiDB);
>          SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
> +        SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
> 
>          /*
>           * If we see a shutdown checkpoint while waiting for an end-of-backup
> @@ -9309,17 +9303,13 @@ xlog_redo(XLogReaderState *record)
>          LWLockRelease(OidGenLock);
>          MultiXactAdvanceNextMXact(checkPoint.nextMulti,
>                                    checkPoint.nextMultiOffset);
> -
> -        /*
> -         * NB: This may perform multixact truncation when replaying WAL
> -         * generated by an older primary.
> -         */
> -        MultiXactAdvanceOldest(checkPoint.oldestMulti,
> -                               checkPoint.oldestMultiDB);
>          if (TransactionIdPrecedes(ShmemVariableCache->oldestXid,
>                                    checkPoint.oldestXid))
>              SetTransactionIdLimit(checkPoint.oldestXid,
>                                    checkPoint.oldestXidDB);
> +        MultiXactAdvanceOldest(checkPoint.oldestMulti,
> +                               checkPoint.oldestMultiDB);
> +
>          /* ControlFile->checkPointCopy always tracks the latest ckpt XID */
>          ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
>          ControlFile->checkPointCopy.nextXid =
>          checkPoint.nextXid;

Why?

> diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
> index 7c4ef58..e2b4f4c 100644
> --- a/src/backend/commands/vacuum.c
> +++ b/src/backend/commands/vacuum.c
> @@ -1136,9 +1136,6 @@ vac_truncate_clog(TransactionId frozenXID,
>      if (bogus)
>          return;
> 
> -    /*
> -     * Truncate CLOG, multixact and CommitTs to the oldest computed value.
> -     */
>      TruncateCLOG(frozenXID);
>      TruncateCommitTs(frozenXID);
>      TruncateMultiXact(minMulti, minmulti_datoid);

Why? Sure, it's not a super important comment, but ...?

> diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
> index 0dc4117..41e51cf 100644
> --- a/src/backend/storage/ipc/procarray.c
> +++ b/src/backend/storage/ipc/procarray.c
> @@ -2192,10 +2192,8 @@ GetOldestSafeDecodingTransactionId(void)
> 
>  /*
>   * GetVirtualXIDsDelayingChkpt -- Get the VXIDs of transactions that are
> - * delaying checkpoint because they have critical actions in progress.
> - *
> - * Constructs an array of VXIDs of transactions that are currently in commit
> - * critical sections, as shown by having delayChkpt set in their PGXACT.
> + * delaying checkpoint because they have critical actions in progress, as
> + * shown by having delayChkpt set in their PGXACT.
> 
>   * Returns a palloc'd array that should be freed by the caller.
>   * *nvxids is the number of valid entries.
> @@ -2204,8 +2202,8 @@ GetOldestSafeDecodingTransactionId(void)
>   * the result is somewhat indeterminate, but we don't really care.  Even in
>   * a multiprocessor with delayed writes to shared memory, it should be certain
>   * that setting of delayChkpt will propagate to shared memory when the backend
> - * takes a lock, so we cannot fail to see a virtual xact as delayChkpt if
> - * it's already inserted its commit record.  Whether it takes a little while
> + * takes a lock, so we cannot fail to see a virtual xact as delayChkpt if it's
> + * already inserted its critical xlog record.  Whether it takes a little while
>   * for clearing of delayChkpt to propagate is unimportant for correctness.
>   */

Seems unrelated, given that this is already used in
MarkBufferDirtyHint(). Don't get me wrong, I think the changes are a
good idea, but it's not really tied to the truncation changes.


- Andres



Re: Rework the way multixact truncations work

From
Noah Misch
Date:
On Wed, Dec 02, 2015 at 04:46:26PM +0100, Andres Freund wrote:
> On 2015-12-02 09:57:19 -0500, Noah Misch wrote:
> > Hackers have been too reticent to revert and redo defect-laden
> > commits.  If doing that is weird today, let it be normal.
> 
> Why?

See my paragraph ending with the two sentences you quoted.

> Especially if reverting and redoing includes conflicts that mainly
> increases the chance of accidental bugs.

True.  (That doesn't apply to these patches.)

> > git remote add community git://git.postgresql.org/git/postgresql.git
> > git remote add nmisch_github https://github.com/nmisch/postgresql.git
> > git fetch --multiple community nmisch_github
> > git diff community/master...nmisch_github/mxt4-rm-legacy
> 
> That's a nearly useless diff, because it includes a lot of other changes
> (218 files changed, 2828 insertions(+), 8742 deletions(-)) made since
> you published the changes.

Perhaps you used "git diff a..b", not "git diff a...b".  If not, please send
the outputs of "git rev-parse community/master nmisch_github/mxt4-rm-legacy"
and "git --version".

> > @@ -2190,7 +2194,8 @@ MultiXactSetNextMXact(MultiXactId nextMulti,
> > -    /*
> > -     * Computing the actual limits is only possible once the data directory is
> > -     * in a consistent state. There's no need to compute the limits while
> > -     * still replaying WAL - no decisions about new multis are made even
> > -     * though multixact creations might be replayed. So we'll only do further
> > -     * checks after TrimMultiXact() has been called.
> > -     */
> > +    /* Before the TrimMultiXact() call at end of recovery, skip the rest. */
> >      if (!MultiXactState->finishedStartup)
> >          return;
> > -
> >      Assert(!InRecovery);
> > 
> > -    /* Set limits for offset vacuum. */
> > +    /*
> > +     * Setting MultiXactState->oldestOffset entails a find_multixact_start()
> > +     * call, which is only possible once the data directory is in a consistent
> > +     * state. There's no need for an offset limit while still replaying WAL;
> > +     * no decisions about new multis are made even though multixact creations
> > +     * might be replayed.
> > +     */
> >      needs_offset_vacuum = SetOffsetVacuumLimit();
> 
> I don't really see the benefit of this change. The previous location of
> the comment is where we return early, so it seems appropriate to
> document the reason there?

I made that low-importance change for two reasons.  First, returning at that
point skips more than just the setting a limit; it also skips autovacuum
signalling and wraparound warnings.  Second, the function has just computed
mxid "actual limits", so branch mxt2-cosmetic made the comment specify that we
defer an offset limit, not any and all limits.

> >  static bool
> >  SetOffsetVacuumLimit(void)
> >  {
> > -    MultiXactId oldestMultiXactId;
> > +    MultiXactId    oldestMultiXactId;
> >      MultiXactId nextMXact;
> > -    MultiXactOffset oldestOffset = 0;    /* placate compiler */
> > -    MultiXactOffset prevOldestOffset;
> > +    MultiXactOffset oldestOffset = 0;        /* placate compiler */
> >      MultiXactOffset nextOffset;
> >      bool        oldestOffsetKnown = false;
> > +    MultiXactOffset prevOldestOffset;
> >      bool        prevOldestOffsetKnown;
> > -    MultiXactOffset offsetStopLimit = 0;
> 
> I don't see the benefit of the order changes here.

I reacted the same way.  Commit 4f627f8 reordered some declarations, which I
reverted when I refinished that commit as branch mxt3-main.

> > -        if (oldestOffsetKnown)
> > -            ereport(DEBUG1,
> > -                    (errmsg("oldest MultiXactId member is at offset %u",
> > -                            oldestOffset)));
> 
> That's imo a rather useful debug message.

The branches emit that message at the same times 4f627f8^ and earlier emit it.

> > -        else
> > +        if (!oldestOffsetKnown)
> > +        {
> > +            /* XXX This message is incorrect if prevOldestOffsetKnown. */
> >              ereport(LOG,
> >                      (errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed
MultiXact%u does not exist on disk",
 
> >                              oldestMultiXactId)));
> > +        }
> >      }
> 
> Hm, the XXX is a "problem" in all 9.3+ - should we just fix it everywhere?

I welcome a project to fix it.

> >      LWLockRelease(MultiXactTruncationLock);
> > 
> >      /*
> > -     * If we can, compute limits (and install them MultiXactState) to prevent
> > -     * overrun of old data in the members SLRU area. We can only do so if the
> > -     * oldest offset is known though.
> > +     * There's no need to update anything if we don't know the oldest offset
> > +     * or if it hasn't changed.
> >       */
> 
> Is that really a worthwhile optimization?

I would neither remove that longstanding optimization nor add it from scratch
today.  Branch commit 06c9979 restored it as part of a larger restoration to
the pre-4f627f8 structure of SetOffsetVacuumLimit().

> > -typedef struct mxtruncinfo
> > -{
> > -    int            earliestExistingPage;
> > -} mxtruncinfo;
> > -
> > -/*
> > - * SlruScanDirectory callback
> > - *        This callback determines the earliest existing page number.
> > - */
> > -static bool
> > -SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int segpage, void *data)
> > -{
> > -    mxtruncinfo *trunc = (mxtruncinfo *) data;
> > -
> > -    if (trunc->earliestExistingPage == -1 ||
> > -        ctl->PagePrecedes(segpage, trunc->earliestExistingPage))
> > -    {
> > -        trunc->earliestExistingPage = segpage;
> > -    }
> > -
> > -    return false;                /* keep going */
> > -}
> > -
> 
> That really seems like an independent change, deserving its own commit +
> explanation.

Indeed.  I explained that change at length in
http://www.postgresql.org/message-id/20151108085407.GA1097830@tornado.leadboat.com,
including that it's alone on a branch (mxt1-disk-independent), to become its
own PostgreSQL commit.

> > [branch commit 89a7232]
> 
> I don't think that's really a good idea - this way we restart after
> every single page write, whereas currently we only restart after passing
> through all pages once. In nearly all cases we'll only ever have to
> retry once in total, be because such old pages aren't usually going to
> be reread/redirtied.

Your improvement sounds fine, then.  Would both SimpleLruTruncate() and
SlruDeleteSegment() benefit from it?

> > @@ -9216,10 +9212,8 @@ xlog_redo(XLogReaderState *record)
> >          LWLockRelease(OidGenLock);
> >          MultiXactSetNextMXact(checkPoint.nextMulti,
> >                                checkPoint.nextMultiOffset);
> > -
> > -        MultiXactAdvanceOldest(checkPoint.oldestMulti,
> > -                               checkPoint.oldestMultiDB);
> >          SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
> > +        SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
> > 
> >          /*
> >           * If we see a shutdown checkpoint while waiting for an end-of-backup
> > @@ -9309,17 +9303,13 @@ xlog_redo(XLogReaderState *record)
> >          LWLockRelease(OidGenLock);
> >          MultiXactAdvanceNextMXact(checkPoint.nextMulti,
> >                                    checkPoint.nextMultiOffset);
> > -
> > -        /*
> > -         * NB: This may perform multixact truncation when replaying WAL
> > -         * generated by an older primary.
> > -         */
> > -        MultiXactAdvanceOldest(checkPoint.oldestMulti,
> > -                               checkPoint.oldestMultiDB);
> >          if (TransactionIdPrecedes(ShmemVariableCache->oldestXid,
> >                                    checkPoint.oldestXid))
> >              SetTransactionIdLimit(checkPoint.oldestXid,
> >                                    checkPoint.oldestXidDB);
> > +        MultiXactAdvanceOldest(checkPoint.oldestMulti,
> > +                               checkPoint.oldestMultiDB);
> > +
> >          /* ControlFile->checkPointCopy always tracks the latest ckpt XID */
> >          ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
> >          ControlFile->checkPointCopy.nextXid =
> >          checkPoint.nextXid;
> 
> Why?

master does not and will not have legacy truncation, so the deleted comment
does not belong in master.  Regarding the SetMultiXactIdLimit() call:

commit 611a2ec
Author:     Noah Misch <noah@leadboat.com>
AuthorDate: Sat Nov 7 15:06:28 2015 -0500
Commit:     Noah Misch <noah@leadboat.com>
CommitDate: Sat Nov 7 15:06:28 2015 -0500
   In xlog_redo(), believe a SHUTDOWN checkPoint.oldestMulti exactly.      It was so before this branch.  This restores
consistencywith the   handling of nextXid, nextMulti and oldestMulti: we treat them as exact   for
XLOG_CHECKPOINT_SHUTDOWNand as minima for XLOG_CHECKPOINT_ONLINE.   I do not know of a case where this definitely
mattersfor any of these   counters.  It might matter if a bug causes oldestXid to move forward   wrongly, causing it to
thenmove backward later.  (I don't know if   VACUUM does ever move oldestXid backward, but it's a plausible thing to
doif on-disk state fully agrees with an older value.)  That example has   no counterpart for oldestMultiXactId, because
anyupdate first arrives   in an XLOG_MULTIXACT_TRUNCATE_ID record.  Therefore, this commit is   probably cosmetic.
 

> > -    /*
> > -     * Truncate CLOG, multixact and CommitTs to the oldest computed value.
> > -     */
> >      TruncateCLOG(frozenXID);
> >      TruncateCommitTs(frozenXID);
> >      TruncateMultiXact(minMulti, minmulti_datoid);
> 
> Why? Sure, it's not a super important comment, but ...?

Yeah, it scarcely matters either way.  Commit 4f627f8 reduced this comment to
merely restating the code, so I removed it instead.

> > @@ -2204,8 +2202,8 @@ GetOldestSafeDecodingTransactionId(void)
> >   * the result is somewhat indeterminate, but we don't really care.  Even in
> >   * a multiprocessor with delayed writes to shared memory, it should be certain
> >   * that setting of delayChkpt will propagate to shared memory when the backend
> > - * takes a lock, so we cannot fail to see a virtual xact as delayChkpt if
> > - * it's already inserted its commit record.  Whether it takes a little while
> > + * takes a lock, so we cannot fail to see a virtual xact as delayChkpt if it's
> > + * already inserted its critical xlog record.  Whether it takes a little while
> >   * for clearing of delayChkpt to propagate is unimportant for correctness.
> >   */
> 
> Seems unrelated, given that this is already used in
> MarkBufferDirtyHint(). Don't get me wrong, I think the changes are a
> good idea, but it's not really tied to the truncation changes.

Quite so; its branch (one branch = one proposed PostgreSQL commit),
mxt2-cosmetic, contains no truncation changes.  Likewise for the other
independent comment improvements you noted.

nm



Re: Rework the way multixact truncations work

From
Andres Freund
Date:
On 2015-12-03 04:38:45 -0500, Noah Misch wrote:
> On Wed, Dec 02, 2015 at 04:46:26PM +0100, Andres Freund wrote:
> > Especially if reverting and redoing includes conflicts that mainly
> > increases the chance of accidental bugs.
> 
> True.  (That doesn't apply to these patches.)

Uh, it does. You had conflicts in your process, and it's hard to verify
that the re-applied patch is actually functionally identical given the
volume of changes. It's much easier to see what actually changes by
looking at iterative commits forward from the current state.


Sorry, but I really just want to see these changes as iterative patches
ontop of 9.5/HEAD instead of this process. I won't revert the reversion
if you push it anyway, but I think it's a rather bad idea.


> > > git remote add community git://git.postgresql.org/git/postgresql.git
> > > git remote add nmisch_github https://github.com/nmisch/postgresql.git
> > > git fetch --multiple community nmisch_github
> > > git diff community/master...nmisch_github/mxt4-rm-legacy
> > 
> > That's a nearly useless diff, because it includes a lot of other changes
> > (218 files changed, 2828 insertions(+), 8742 deletions(-)) made since
> > you published the changes.
> 
> Perhaps you used "git diff a..b", not "git diff a...b".

Ah yes. Neat, didn't know that one.

> > > @@ -2190,7 +2194,8 @@ MultiXactSetNextMXact(MultiXactId nextMulti,
> > > -    /*
> > > -     * Computing the actual limits is only possible once the data directory is
> > > -     * in a consistent state. There's no need to compute the limits while
> > > -     * still replaying WAL - no decisions about new multis are made even
> > > -     * though multixact creations might be replayed. So we'll only do further
> > > -     * checks after TrimMultiXact() has been called.
> > > -     */
> > > +    /* Before the TrimMultiXact() call at end of recovery, skip the rest. */
> > >      if (!MultiXactState->finishedStartup)
> > >          return;
> > > -
> > >      Assert(!InRecovery);
> > > 
> > > -    /* Set limits for offset vacuum. */
> > > +    /*
> > > +     * Setting MultiXactState->oldestOffset entails a find_multixact_start()
> > > +     * call, which is only possible once the data directory is in a consistent
> > > +     * state. There's no need for an offset limit while still replaying WAL;
> > > +     * no decisions about new multis are made even though multixact creations
> > > +     * might be replayed.
> > > +     */
> > >      needs_offset_vacuum = SetOffsetVacuumLimit();
> > 
> > I don't really see the benefit of this change. The previous location of
> > the comment is where we return early, so it seems appropriate to
> > document the reason there?
> 
> I made that low-importance change for two reasons.  First, returning at that
> point skips more than just the setting a limit; it also skips autovacuum
> signalling and wraparound warnings.  Second, the function has just computed
> mxid "actual limits", so branch mxt2-cosmetic made the comment specify that we
> defer an offset limit, not any and all limits.

My question was more about the comment being after the "early return"
than about the content change, should have made that clearer. Can we
just move your comment up?

> > >  static bool
> > >  SetOffsetVacuumLimit(void)
> > >  {
> > > -    MultiXactId oldestMultiXactId;
> > > +    MultiXactId    oldestMultiXactId;
> > >      MultiXactId nextMXact;
> > > -    MultiXactOffset oldestOffset = 0;    /* placate compiler */
> > > -    MultiXactOffset prevOldestOffset;
> > > +    MultiXactOffset oldestOffset = 0;        /* placate compiler */
> > >      MultiXactOffset nextOffset;
> > >      bool        oldestOffsetKnown = false;
> > > +    MultiXactOffset prevOldestOffset;
> > >      bool        prevOldestOffsetKnown;
> > > -    MultiXactOffset offsetStopLimit = 0;
> > 
> > I don't see the benefit of the order changes here.
> 
> I reacted the same way.  Commit 4f627f8 reordered some declarations, which I
> reverted when I refinished that commit as branch mxt3-main.

But the other changes are there, and in the history anyway. As the new
order isn't more meaningful than the current one...

> > > -        if (oldestOffsetKnown)
> > > -            ereport(DEBUG1,
> > > -                    (errmsg("oldest MultiXactId member is at offset %u",
> > > -                            oldestOffset)));
> > 
> > That's imo a rather useful debug message.
> 
> The branches emit that message at the same times 4f627f8^ and earlier emit it.

During testing I found it rather helpful if it was emitted regularly.

> > >      LWLockRelease(MultiXactTruncationLock);
> > > 
> > >      /*
> > > -     * If we can, compute limits (and install them MultiXactState) to prevent
> > > -     * overrun of old data in the members SLRU area. We can only do so if the
> > > -     * oldest offset is known though.
> > > +     * There's no need to update anything if we don't know the oldest offset
> > > +     * or if it hasn't changed.
> > >       */
> > 
> > Is that really a worthwhile optimization?
> 
> I would neither remove that longstanding optimization nor add it from scratch
> today.  Branch commit 06c9979 restored it as part of a larger restoration to
> the pre-4f627f8 structure of SetOffsetVacuumLimit().

There DetermineSafeOldestOffset() did it unconditionally.

> > > -SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int segpage, void *data)
> > 
> > That really seems like an independent change, deserving its own commit +
> > explanation.
> 
> Indeed.  I explained that change at length in
> http://www.postgresql.org/message-id/20151108085407.GA1097830@tornado.leadboat.com,
> including that it's alone on a branch (mxt1-disk-independent), to become its
> own PostgreSQL commit.

The comment there doesn't include the explanation...

> > > [branch commit 89a7232]
> > 
> > I don't think that's really a good idea - this way we restart after
> > every single page write, whereas currently we only restart after passing
> > through all pages once. In nearly all cases we'll only ever have to
> > retry once in total, be because such old pages aren't usually going to
> > be reread/redirtied.
> 
> Your improvement sounds fine, then.  Would both SimpleLruTruncate() and
> SlruDeleteSegment() benefit from it?

It probably makes sense to do it in SimpleLruTruncate too - but it does
additional checks as part of the restarts which aren't applicable for
DeleteSegment(), which is IIRC why I didn't also change it.

Greetings,

Andres Freund



Re: Rework the way multixact truncations work

From
Noah Misch
Date:
On Thu, Dec 03, 2015 at 07:03:21PM +0100, Andres Freund wrote:
> On 2015-12-03 04:38:45 -0500, Noah Misch wrote:
> > On Wed, Dec 02, 2015 at 04:46:26PM +0100, Andres Freund wrote:
> > > Especially if reverting and redoing includes conflicts that mainly
> > > increases the chance of accidental bugs.
> > 
> > True.  (That doesn't apply to these patches.)
> 
> Uh, it does. You had conflicts in your process, and it's hard to verify
> that the re-applied patch is actually functionally identical given the
> volume of changes. It's much easier to see what actually changes by
> looking at iterative commits forward from the current state.

Ah, we were talking about different topics after all.  I was talking about
_merge_ conflicts in a reversion commit.

> Sorry, but I really just want to see these changes as iterative patches
> ontop of 9.5/HEAD instead of this process. I won't revert the reversion
> if you push it anyway, but I think it's a rather bad idea.

I hear you.  I evaluated your request and judged that the benefits you cited
did not make up for the losses I cited.  Should you wish to change my mind,
your best bet is to find defects in the commits I proposed.  If I introduced
juicy defects, that discovery would lend much weight to your conjectures.

> My question was more about the comment being after the "early return"
> than about the content change, should have made that clearer. Can we
> just move your comment up?

Sure, I will.

> > > >  static bool
> > > >  SetOffsetVacuumLimit(void)
> > > >  {
> > > > -    MultiXactId oldestMultiXactId;
> > > > +    MultiXactId    oldestMultiXactId;
> > > >      MultiXactId nextMXact;
> > > > -    MultiXactOffset oldestOffset = 0;    /* placate compiler */
> > > > -    MultiXactOffset prevOldestOffset;
> > > > +    MultiXactOffset oldestOffset = 0;        /* placate compiler */
> > > >      MultiXactOffset nextOffset;
> > > >      bool        oldestOffsetKnown = false;
> > > > +    MultiXactOffset prevOldestOffset;
> > > >      bool        prevOldestOffsetKnown;
> > > > -    MultiXactOffset offsetStopLimit = 0;
> > > 
> > > I don't see the benefit of the order changes here.
> > 
> > I reacted the same way.  Commit 4f627f8 reordered some declarations, which I
> > reverted when I refinished that commit as branch mxt3-main.
> 
> But the other changes are there, and in the history anyway. As the new
> order isn't more meaningful than the current one...

Right.  A revert+redo patch series can and should purge formatting changes
that did not belong in its predecessor commits.  Alternate change delivery
strategies wouldn't do that.

> > > > -        if (oldestOffsetKnown)
> > > > -            ereport(DEBUG1,
> > > > -                    (errmsg("oldest MultiXactId member is at offset %u",
> > > > -                            oldestOffset)));
> > > 
> > > That's imo a rather useful debug message.
> > 
> > The branches emit that message at the same times 4f627f8^ and earlier emit it.
> 
> During testing I found it rather helpful if it was emitted regularly.

I wouldn't oppose a patch making it happen more often.

> > > >      LWLockRelease(MultiXactTruncationLock);
> > > > 
> > > >      /*
> > > > -     * If we can, compute limits (and install them MultiXactState) to prevent
> > > > -     * overrun of old data in the members SLRU area. We can only do so if the
> > > > -     * oldest offset is known though.
> > > > +     * There's no need to update anything if we don't know the oldest offset
> > > > +     * or if it hasn't changed.
> > > >       */
> > > 
> > > Is that really a worthwhile optimization?
> > 
> > I would neither remove that longstanding optimization nor add it from scratch
> > today.  Branch commit 06c9979 restored it as part of a larger restoration to
> > the pre-4f627f8 structure of SetOffsetVacuumLimit().
> 
> There DetermineSafeOldestOffset() did it unconditionally.

That is true; one won't be consistent with both.  06c9979 materially shortened
the final patch and eliminated some user-visible message emission changes.
Moreover, this is clearly a case of SetOffsetVacuumLimit() absorbing
DetermineSafeOldestOffset(), not vice versa.

> > > > -SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int segpage, void *data)
> > > 
> > > That really seems like an independent change, deserving its own commit +
> > > explanation.
> > 
> > Indeed.  I explained that change at length in
> > http://www.postgresql.org/message-id/20151108085407.GA1097830@tornado.leadboat.com,
> > including that it's alone on a branch (mxt1-disk-independent), to become its
> > own PostgreSQL commit.
> 
> The comment there doesn't include the explanation...

If you visit that URL, everything from "If anything here requires careful
study, it's the small mxt1-disk-independent change, which ..." to the end of
the message is my explanation of this change.  What else would you like to
know about it?

> > > > [branch commit 89a7232]
> > > 
> > > I don't think that's really a good idea - this way we restart after
> > > every single page write, whereas currently we only restart after passing
> > > through all pages once. In nearly all cases we'll only ever have to
> > > retry once in total, be because such old pages aren't usually going to
> > > be reread/redirtied.
> > 
> > Your improvement sounds fine, then.  Would both SimpleLruTruncate() and
> > SlruDeleteSegment() benefit from it?
> 
> It probably makes sense to do it in SimpleLruTruncate too - but it does
> additional checks as part of the restarts which aren't applicable for
> DeleteSegment(), which is IIRC why I didn't also change it.

Understood.  There's no rule that these two functions must look as similar as
possible, so I will undo 89a7232.

Thanks,
nm



Re: Rework the way multixact truncations work

From
Andres Freund
Date:
On 2015-12-04 21:55:29 -0500, Noah Misch wrote:
> On Thu, Dec 03, 2015 at 07:03:21PM +0100, Andres Freund wrote:
> > Sorry, but I really just want to see these changes as iterative patches
> > ontop of 9.5/HEAD instead of this process. I won't revert the reversion
> > if you push it anyway, but I think it's a rather bad idea.
> 
> I hear you.

Not just me.

> I evaluated your request and judged that the benefits you cited
> did not make up for the losses I cited.  Should you wish to change my mind,
> your best bet is to find defects in the commits I proposed.  If I introduced
> juicy defects, that discovery would lend much weight to your conjectures.

I've absolutely no interest in "proving you wrong". And my desire to
review patches that are in a, in my opinion, barely reviewable format is
pretty low as well.

Andres



Re: Rework the way multixact truncations work

From
Robert Haas
Date:
On Tue, Dec 8, 2015 at 6:43 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-12-04 21:55:29 -0500, Noah Misch wrote:
>> On Thu, Dec 03, 2015 at 07:03:21PM +0100, Andres Freund wrote:
>> > Sorry, but I really just want to see these changes as iterative patches
>> > ontop of 9.5/HEAD instead of this process. I won't revert the reversion
>> > if you push it anyway, but I think it's a rather bad idea.
>>
>> I hear you.
>
> Not just me.
>
>> I evaluated your request and judged that the benefits you cited
>> did not make up for the losses I cited.  Should you wish to change my mind,
>> your best bet is to find defects in the commits I proposed.  If I introduced
>> juicy defects, that discovery would lend much weight to your conjectures.
>
> I've absolutely no interest in "proving you wrong". And my desire to
> review patches that are in a, in my opinion, barely reviewable format is
> pretty low as well.

I agree.  Noah, it seems to me that you are offering a novel theory of
how patches should be submitted, reviewed, and committed, but you've
got three people, two of them committers, telling you that we don't
like that approach.  I seriously doubt you're going to find anyone who
does.  When stuff gets committed to the tree, people want to to be
able to answer the question "what has just now changed?" and it is
indisputable that what you want to do here will make that harder.
That's not a one-time problem for Andres during the course of review;
that is a problem for every single person who looks at the commit
history from now until the end of time.  I don't think you have the
right to force your proposed approach through in the face of concerted
opposition.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Rework the way multixact truncations work

From
Noah Misch
Date:
On Tue, Dec 08, 2015 at 01:05:03PM -0500, Robert Haas wrote:
> On Tue, Dec 8, 2015 at 6:43 AM, Andres Freund <andres@anarazel.de> wrote:
> > On 2015-12-04 21:55:29 -0500, Noah Misch wrote:
> >> On Thu, Dec 03, 2015 at 07:03:21PM +0100, Andres Freund wrote:
> >> > Sorry, but I really just want to see these changes as iterative patches
> >> > ontop of 9.5/HEAD instead of this process. I won't revert the reversion
> >> > if you push it anyway, but I think it's a rather bad idea.
> >>
> >> I hear you.
> >
> > Not just me.
> >
> >> I evaluated your request and judged that the benefits you cited
> >> did not make up for the losses I cited.  Should you wish to change my mind,
> >> your best bet is to find defects in the commits I proposed.  If I introduced
> >> juicy defects, that discovery would lend much weight to your conjectures.
> >
> > I've absolutely no interest in "proving you wrong". And my desire to
> > review patches that are in a, in my opinion, barely reviewable format is
> > pretty low as well.
> 
> I agree.  Noah, it seems to me that you are offering a novel theory of
> how patches should be submitted, reviewed, and committed, but you've
> got three people, two of them committers, telling you that we don't
> like that approach.  I seriously doubt you're going to find anyone who
> does.

Andres writing the patch that became commit 4f627f8 and you reviewing it were
gifts to Alvaro and to the community.  Aware of that, I have avoided[1] saying
that I was shocked to see that commit's defects.  Despite a committer-author
and _two_ committer reviewers, the patch was rife with wrong new comments,
omitted updates to comments it caused to become wrong, and unsolicited
whitespace churn.  (Anyone could have missed the data loss bug, but these
collectively leap off the page.)  This in beleaguered code critical to data
integrity.  You call this thread's latest code a patch submission, but I call
it bandaging the tree after a recent commit that never should have reached the
tree.  Hey, if you'd like me to post the traditional patch files, that's easy.
It would have been easier for me.  I posted branches because it gives more
metadata to guide review.  As for the choice to revert and redo ...

> When stuff gets committed to the tree, people want to to be
> able to answer the question "what has just now changed?" and it is
> indisputable that what you want to do here will make that harder.

I hope those who have not already read commit 4f627f8 will not waste time
reading it.  They should instead ignore multixact changes from commit 4f627f8
through its revert.  The 2015-09-26 commits have not appeared in a supported
release, and no other work has built upon them.  They have no tenure.  (I am
glad you talked the author out of back-patching; otherwise, 9.4.5 and 9.3.10
would have introduced a data loss bug.)  Nobody reported a single defect
before my review overturned half the patch.  A revert will indeed impose on
those who invested time to understand commit 4f627f8, but the silence about
its defects suggests the people understanding it number either zero or one.
Even as its author and reviewers, you would do better to set aside what you
thought you knew about this code.

> That's not a one-time problem for Andres during the course of review;
> that is a problem for every single person who looks at the commit
> history from now until the end of time.

It's a service to future readers that no line of "git blame master <...>" will
refer to a 2015-09-26 multixact commit.  Blame reports will instead refer to
replacement commits designed to be meaningful for study in isolation.  If I
instead structured the repairs as you ask, the blame would find one of 4f627f8
or its various repair commits, none of which would be a self-contained unit of
development.  What's to enjoy about discovering that history?

> I don't think you have the
> right to force your proposed approach through in the face of concerted
> opposition.

That's correct; I do not have that right.  Your objection still worries me.

nm

[1] http://www.postgresql.org/message-id/20151029065903.GC770464@tornado.leadboat.com



Re: Rework the way multixact truncations work

From
Andres Freund
Date:
On 2015-12-09 09:43:19 -0500, Noah Misch wrote:
> Aware of that, I have avoided[1] saying that I was shocked to see that
> commit's defects.  Despite a committer-author and _two_ committer
> reviewers, the patch was rife with wrong new comments, omitted updates
> to comments it caused to become wrong,

It's not like that patch wasn't posted for review for months...


> and unsolicited whitespace churn.

Whitespace churn? The commit includes a pgindent run, because Alvaro
asked me to do that, but that just affected a handful of lines. If you
mean the variable ordering: given several variables were renamed anyway,
additionally putting them in a easier to understand order, seems rather
painless. If you mean 'pgindent immune' long lines - multixact.c is far
from the only one with those, and they're prett harmless.


> You call this thread's latest code a patch
> submission, but I call it bandaging the tree after a recent commit
> that never should have reached the tree.

Oh, for christs sake.


> Hey, if you'd like me to
> post the traditional patch files, that's easy.  It would have been
> easier for me.

You've been asked that, repeatedly. At least if you take 'traditional
patch files' to include traditional, iterative, patches ontop of the
current tree.


> I hope those who have not already read commit 4f627f8 will not waste time
> reading it.

We have to, who knows what's hiding in there. Your git log even shows
that you had conflicts in your approach (83cb04 Conflicts:
src/backend/access/transam/multixact.c).


> They should instead ignore multixact changes from commit 4f627f8
> through its revert.  The 2015-09-26 commits have not appeared in a supported
> release, and no other work has built upon them.

> They have no tenure.

Man.


> (I am glad you talked the author out of back-patching; otherwise,
> 9.4.5 and 9.3.10 would have introduced a data loss bug.)

Isn't that a bug in a, as far as we know, impossible scenario? Unless I
miss something there's no known case where it's "expected" that
find_multixact_start() fails after initially succeeding? Sure, it sucks
that the bug survived review and that it was written in the first
place. But it not showing up during testing isn't meaningful, given it's
a should-never-happen scenario.

I'm actually kinda inclined to rip out the whole "previous pass" logic
out alltogether, and replace it with a PANIC. It's a hard to test,
should never happen, scenario. If it happens, things have already
seriously gone sour.

> > That's not a one-time problem for Andres during the course of review;
> > that is a problem for every single person who looks at the commit
> > history from now until the end of time.
> 
> It's a service to future readers that no line of "git blame master <...>" will
> refer to a 2015-09-26 multixact commit.

And a disservice for everyone doing git log, or git blame for
intermediate states of the tree. The benefit for git blame, are almost
nonexistant, not seing a couple newlines changed, or not seing some
intermediate commits isn't really important.


> Blame reports will instead refer to
> replacement commits designed to be meaningful for study in isolation.  If I
> instead structured the repairs as you ask, the blame would find one of 4f627f8
> or its various repair commits, none of which would be a self-contained unit of
> development.

So what? That's how development in general works. And how it actually
happened in this specific case.



Re: Rework the way multixact truncations work

From
Robert Haas
Date:
On Wed, Dec 9, 2015 at 9:43 AM, Noah Misch <noah@leadboat.com> wrote:
> Andres writing the patch that became commit 4f627f8 and you reviewing it were
> gifts to Alvaro and to the community.  Aware of that, I have avoided[1] saying
> that I was shocked to see that commit's defects.  Despite a committer-author
> and _two_ committer reviewers, the patch was rife with wrong new comments,
> omitted updates to comments it caused to become wrong, and unsolicited
> whitespace churn.  (Anyone could have missed the data loss bug, but these
> collectively leap off the page.)  This in beleaguered code critical to data
> integrity.  You call this thread's latest code a patch submission, but I call
> it bandaging the tree after a recent commit that never should have reached the
> tree.  Hey, if you'd like me to post the traditional patch files, that's easy.
> It would have been easier for me.  I posted branches because it gives more
> metadata to guide review.  As for the choice to revert and redo ...

Yes, I'd like patch files, one per topic.

I wasn't very happy with the way that patch it was written; it seemed
to me that it touched too much code and move a lot of things around
unnecessarily, and I said so at the time.  I would have preferred
something more incremental, and I asked for it and didn't get it.
Well, I'm not giving up: I'm asking for the same thing here.  I didn't
think it was a good idea for Andres to rearrange that much code in a
single commit, because it was hard to review, and I don't think it's a
good idea for you to do it, either.  To the extent that you found
bugs, I think that proves the point that large commits are hard to
review and small commits that change things just a little bit at a
time are the way to go.

> I hope those who have not already read commit 4f627f8 will not waste time
> reading it.  They should instead ignore multixact changes from commit 4f627f8
> through its revert.  The 2015-09-26 commits have not appeared in a supported
> release, and no other work has built upon them.  They have no tenure.  (I am
> glad you talked the author out of back-patching; otherwise, 9.4.5 and 9.3.10
> would have introduced a data loss bug.)  Nobody reported a single defect
> before my review overturned half the patch.  A revert will indeed impose on
> those who invested time to understand commit 4f627f8, but the silence about
> its defects suggests the people understanding it number either zero or one.
> Even as its author and reviewers, you would do better to set aside what you
> thought you knew about this code.

I just don't find this a realistic model of how people use the git
log.  Maybe you use it this way; I don't.  I don't *want* git blame to
make it seem as if 4f627f8 is not part of the history.  For better or
worse, it is.  Ripping it out and replacing it monolithically will not
change that; it will only make the detailed history harder to
reconstruct, and I *will* want to reconstruct it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Rework the way multixact truncations work

From
Robert Haas
Date:
On Wed, Dec 9, 2015 at 10:41 AM, Andres Freund <andres@anarazel.de> wrote:
>> (I am glad you talked the author out of back-patching; otherwise,
>> 9.4.5 and 9.3.10 would have introduced a data loss bug.)
>
> Isn't that a bug in a, as far as we know, impossible scenario? Unless I
> miss something there's no known case where it's "expected" that
> find_multixact_start() fails after initially succeeding? Sure, it sucks
> that the bug survived review and that it was written in the first
> place. But it not showing up during testing isn't meaningful, given it's
> a should-never-happen scenario.

If I correctly understand the scenario that you are describing, that
does happen - not for the same MXID, but for different ones.  At least
the last time I checked, and I'm not sure if we've fixed this, it
could happen because the SLRU page that contains the multixact wasn't
flushed out of the SLRU buffers yet.  But apart from that, it could
happen any time there's a gap in the sequence of files, and that sure
doesn't seem like a can't-happen situation.  We know that, on 9.3,
there's definitely a sequence of events that leads to a 0000 file
followed by a gap followed by the series of files that are still live.
Given the number of other bugs we've fixed in this area, I would not
like to bet on that being the only scenario where this crops up.  It
*shouldn't* happen, and as far as we know, if you start and end on a
version newer than 4f627f8 and aa29c1c, it won't.  Older branches,
though, I wouldn't like to bet on.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Rework the way multixact truncations work

From
Andres Freund
Date:
On 2015-12-09 11:18:39 -0500, Robert Haas wrote:
> If I correctly understand the scenario that you are describing, that
> does happen - not for the same MXID, but for different ones.  At least
> the last time I checked, and I'm not sure if we've fixed this, it
> could happen because the SLRU page that contains the multixact wasn't
> flushed out of the SLRU buffers yet.

That should be fixed, with the brute force solution of flushing buffers
before searching for files on disk.

> But apart from that, it could
> happen any time there's a gap in the sequence of files, and that sure
> doesn't seem like a can't-happen situation.  We know that, on 9.3,
> there's definitely a sequence of events that leads to a 0000 file
> followed by a gap followed by the series of files that are still live.
> Given the number of other bugs we've fixed in this area, I would not
> like to bet on that being the only scenario where this crops up.  It
> *shouldn't* happen, and as far as we know, if you start and end on a
> version newer than 4f627f8 and aa29c1c, it won't.  Older branches,
> though, I wouldn't like to bet on.

Ok, fair enough.

andres



Re: Rework the way multixact truncations work

From
Noah Misch
Date:
On Wed, Dec 09, 2015 at 11:08:32AM -0500, Robert Haas wrote:
> On Wed, Dec 9, 2015 at 9:43 AM, Noah Misch <noah@leadboat.com> wrote:
> > I hope those who have not already read commit 4f627f8 will not waste time
> > reading it.  They should instead ignore multixact changes from commit 4f627f8
> > through its revert.  The 2015-09-26 commits have not appeared in a supported
> > release, and no other work has built upon them.  They have no tenure.  (I am
> > glad you talked the author out of back-patching; otherwise, 9.4.5 and 9.3.10
> > would have introduced a data loss bug.)  Nobody reported a single defect
> > before my review overturned half the patch.  A revert will indeed impose on
> > those who invested time to understand commit 4f627f8, but the silence about
> > its defects suggests the people understanding it number either zero or one.
> > Even as its author and reviewers, you would do better to set aside what you
> > thought you knew about this code.
> 
> I just don't find this a realistic model of how people use the git
> log.  Maybe you use it this way; I don't.  I don't *want* git blame to
> make it seem as if 4f627f8 is not part of the history.  For better or
> worse, it is.

I would like to understand how you use git, then.  What's one of your models
of using "git log" and/or "git blame" in which you foresee the revert making
history less clear, not more clear?

By the way, it occurs to me that I should also make pg_upgrade blacklist the
range of catversions that might have data loss.  No sense in putting ourselves
in the position of asking whether data files of a 9.9.3 cluster spent time in
a 9.5beta2 cluster.

> Ripping it out and replacing it monolithically will not
> change that; it will only make the detailed history harder to
> reconstruct, and I *will* want to reconstruct it.

What's something that might happen six months from now and lead you to inspect
master or 9.5 multixact.c between 4f627f8 and its revert?



Re: Rework the way multixact truncations work

From
Andres Freund
Date:
On 2015-12-09 20:23:06 -0500, Noah Misch wrote:
> By the way, it occurs to me that I should also make pg_upgrade blacklist the
> range of catversions that might have data loss.  No sense in putting ourselves
> in the position of asking whether data files of a 9.9.3 cluster spent time in
> a 9.5beta2 cluster.

I can't see any benefit in that. We're talking about a bug that, afaics,
needs another unknown bug to trigger (so find_multixact_start() fails),
and then very likely needs significant amounts of new multixacts
consumed, without a restart and without find_multixact_start()
succeeding later.


What I think would actually help for questions like this, is to add, as
discussed in some other threads, the following:
1) 'creating version' to pg_control
2) 'creating version' to each pg_class entry
3) 'last relation rewrite in version' to each pg_class entry
4) 'last full vacuum in version' to each pg_class entry

I think for this purpose 'version' should be something akin to
$catversion||$numericversion (int64 probably?) - that way development
branches and release branches are handled somewhat usefully.

I think that'd be useful, both from an investigatory perspective, as
from a tooling perspective, because it'd allow reusing things like hint
bits.


> > Ripping it out and replacing it monolithically will not
> > change that; it will only make the detailed history harder to
> > reconstruct, and I *will* want to reconstruct it.
> 
> What's something that might happen six months from now and lead you to inspect
> master or 9.5 multixact.c between 4f627f8 and its revert?

"Hey, what has happened to multixact.c lately? I'm investigating a bug,
and I wonder if it already has been fixed?", "Uh, what was the problem
with that earlier large commit?", "Hey, what has changed between beta2
and the final release?"...



Re: Rework the way multixact truncations work

From
Peter Geoghegan
Date:
On Thu, Dec 10, 2015 at 12:34 AM, Andres Freund <andres@anarazel.de> wrote:
>> > Ripping it out and replacing it monolithically will not
>> > change that; it will only make the detailed history harder to
>> > reconstruct, and I *will* want to reconstruct it.
>>
>> What's something that might happen six months from now and lead you to inspect
>> master or 9.5 multixact.c between 4f627f8 and its revert?
>
> "Hey, what has happened to multixact.c lately? I'm investigating a bug,
> and I wonder if it already has been fixed?", "Uh, what was the problem
> with that earlier large commit?", "Hey, what has changed between beta2
> and the final release?"...

Quite.

I can't believe we're still having this silly discussion. Can we please move on?

-- 
Peter Geoghegan



Re: Rework the way multixact truncations work

From
Bert
Date:
<div dir="ltr">+1<br /></div><div class="gmail_extra"><br /><div class="gmail_quote">On Thu, Dec 10, 2015 at 9:58 AM,
PeterGeoghegan <span dir="ltr"><<a href="mailto:pg@heroku.com" target="_blank">pg@heroku.com</a>></span>
wrote:<br/><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span
class="">OnThu, Dec 10, 2015 at 12:34 AM, Andres Freund <<a
href="mailto:andres@anarazel.de">andres@anarazel.de</a>>wrote:<br /> >> > Ripping it out and replacing it
monolithicallywill not<br /> >> > change that; it will only make the detailed history harder to<br /> >>
>reconstruct, and I *will* want to reconstruct it.<br /> >><br /> >> What's something that might happen
sixmonths from now and lead you to inspect<br /> >> master or 9.5 multixact.c between 4f627f8 and its revert?<br
/>><br /> > "Hey, what has happened to multixact.c lately? I'm investigating a bug,<br /> > and I wonder if it
alreadyhas been fixed?", "Uh, what was the problem<br /> > with that earlier large commit?", "Hey, what has changed
betweenbeta2<br /> > and the final release?"...<br /><br /></span>Quite.<br /><br /> I can't believe we're still
havingthis silly discussion. Can we please move on?<br /><span class="HOEnZb"><font color="#888888"><br /> --<br />
PeterGeoghegan<br /></font></span><div class="HOEnZb"><div class="h5"><br /><br /> --<br /> Sent via pgsql-hackers
mailinglist (<a href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br /> To make changes to
yoursubscription:<br /><a href="http://www.postgresql.org/mailpref/pgsql-hackers" rel="noreferrer"
target="_blank">http://www.postgresql.org/mailpref/pgsql-hackers</a><br/></div></div></blockquote></div><br /><br
clear="all"/><br />-- <br /><div class="gmail_signature">Bert Desmet<br />0477/305361</div></div> 

Re: Rework the way multixact truncations work

From
Robert Haas
Date:
On Wed, Dec 9, 2015 at 8:23 PM, Noah Misch <noah@leadboat.com> wrote:
> On Wed, Dec 09, 2015 at 11:08:32AM -0500, Robert Haas wrote:
>> On Wed, Dec 9, 2015 at 9:43 AM, Noah Misch <noah@leadboat.com> wrote:
>> > I hope those who have not already read commit 4f627f8 will not waste time
>> > reading it.  They should instead ignore multixact changes from commit 4f627f8
>> > through its revert.  The 2015-09-26 commits have not appeared in a supported
>> > release, and no other work has built upon them.  They have no tenure.  (I am
>> > glad you talked the author out of back-patching; otherwise, 9.4.5 and 9.3.10
>> > would have introduced a data loss bug.)  Nobody reported a single defect
>> > before my review overturned half the patch.  A revert will indeed impose on
>> > those who invested time to understand commit 4f627f8, but the silence about
>> > its defects suggests the people understanding it number either zero or one.
>> > Even as its author and reviewers, you would do better to set aside what you
>> > thought you knew about this code.
>>
>> I just don't find this a realistic model of how people use the git
>> log.  Maybe you use it this way; I don't.  I don't *want* git blame to
>> make it seem as if 4f627f8 is not part of the history.  For better or
>> worse, it is.
>
> I would like to understand how you use git, then.  What's one of your models
> of using "git log" and/or "git blame" in which you foresee the revert making
> history less clear, not more clear?

Well, suppose I wanted to know what bugs were fixed between 9.5beta
and 9.5.0, for example.  I mean, I'm going to run git log
src/backend/access/transam/multixact.c ... and the existing commits
are going to be there.

> By the way, it occurs to me that I should also make pg_upgrade blacklist the
> range of catversions that might have data loss.  No sense in putting ourselves
> in the position of asking whether data files of a 9.9.3 cluster spent time in
> a 9.5beta2 cluster.

Maybe.  But I think we could use a little more vigorous discussion of
that issue, since Andres doesn't seem to be very convinced by your
analysis, and I don't currently understand what you've fixed because I
can't, as mentioned several times, follow your patch stack.

>> Ripping it out and replacing it monolithically will not
>> change that; it will only make the detailed history harder to
>> reconstruct, and I *will* want to reconstruct it.
>
> What's something that might happen six months from now and lead you to inspect
> master or 9.5 multixact.c between 4f627f8 and its revert?

I don't know have anything to add to what others have said in response
to this point, except this: the whole point of using a source code
management system is to tell you what changed and when.  What you are
proposing to do makes it unusable for that purpose.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Rework the way multixact truncations work

From
Andres Freund
Date:
On 2015-12-10 08:55:54 -0500, Robert Haas wrote:
> Maybe.  But I think we could use a little more vigorous discussion of
> that issue, since Andres doesn't seem to be very convinced by your
> analysis, and I don't currently understand what you've fixed because I
> can't, as mentioned several times, follow your patch stack.

The issue at hand is that the following block    oldestOffsetKnown =        find_multixact_start(oldestMultiXactId,
&oldestOffset);

...else if (prevOldestOffsetKnown){    /*     * If we failed to get the oldest offset this time, but we have a     *
valuefrom a previous pass through this function, use the old value     * rather than automatically forcing it.     */
oldestOffset = prevOldestOffset;    oldestOffsetKnown = true;}
 
in SetOffsetVacuumLimit() fails to restore offsetStopLimit, which then
is set in shared memory:/* Install the computed values */LWLockAcquire(MultiXactGenLock,
LW_EXCLUSIVE);MultiXactState->oldestOffset= oldestOffset;MultiXactState->oldestOffsetKnown =
oldestOffsetKnown;MultiXactState->offsetStopLimit= offsetStopLimit;LWLockRelease(MultiXactGenLock);
 

so, if find_multixact_start() failed - a "should never happen" occurance
- we install a wrong stop limit. It does get 'repaired' upon the next
suceeding find_multixact_start() in SetOffsetVacuumLimit() or a restart
though.

Adding a 'prevOffsetStopLimit' and using it seems like a ~5 line patch.



Re: Rework the way multixact truncations work

From
Robert Haas
Date:
On Thu, Dec 10, 2015 at 9:04 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-12-10 08:55:54 -0500, Robert Haas wrote:
>> Maybe.  But I think we could use a little more vigorous discussion of
>> that issue, since Andres doesn't seem to be very convinced by your
>> analysis, and I don't currently understand what you've fixed because I
>> can't, as mentioned several times, follow your patch stack.
>
> The issue at hand is that the following block
>                 oldestOffsetKnown =
>                         find_multixact_start(oldestMultiXactId, &oldestOffset);
>
> ...
>         else if (prevOldestOffsetKnown)
>         {
>                 /*
>                  * If we failed to get the oldest offset this time, but we have a
>                  * value from a previous pass through this function, use the old value
>                  * rather than automatically forcing it.
>                  */
>                 oldestOffset = prevOldestOffset;
>                 oldestOffsetKnown = true;
>         }
> in SetOffsetVacuumLimit() fails to restore offsetStopLimit, which then
> is set in shared memory:
>         /* Install the computed values */
>         LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
>         MultiXactState->oldestOffset = oldestOffset;
>         MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
>         MultiXactState->offsetStopLimit = offsetStopLimit;
>         LWLockRelease(MultiXactGenLock);
>
> so, if find_multixact_start() failed - a "should never happen" occurance
> - we install a wrong stop limit. It does get 'repaired' upon the next
> suceeding find_multixact_start() in SetOffsetVacuumLimit() or a restart
> though.
>
> Adding a 'prevOffsetStopLimit' and using it seems like a ~5 line patch.

So let's do that, then.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Rework the way multixact truncations work

From
Robert Haas
Date:
On Thu, Dec 10, 2015 at 9:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Dec 10, 2015 at 9:04 AM, Andres Freund <andres@anarazel.de> wrote:
>> On 2015-12-10 08:55:54 -0500, Robert Haas wrote:
>>> Maybe.  But I think we could use a little more vigorous discussion of
>>> that issue, since Andres doesn't seem to be very convinced by your
>>> analysis, and I don't currently understand what you've fixed because I
>>> can't, as mentioned several times, follow your patch stack.
>>
>> The issue at hand is that the following block
>>                 oldestOffsetKnown =
>>                         find_multixact_start(oldestMultiXactId, &oldestOffset);
>>
>> ...
>>         else if (prevOldestOffsetKnown)
>>         {
>>                 /*
>>                  * If we failed to get the oldest offset this time, but we have a
>>                  * value from a previous pass through this function, use the old value
>>                  * rather than automatically forcing it.
>>                  */
>>                 oldestOffset = prevOldestOffset;
>>                 oldestOffsetKnown = true;
>>         }
>> in SetOffsetVacuumLimit() fails to restore offsetStopLimit, which then
>> is set in shared memory:
>>         /* Install the computed values */
>>         LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
>>         MultiXactState->oldestOffset = oldestOffset;
>>         MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
>>         MultiXactState->offsetStopLimit = offsetStopLimit;
>>         LWLockRelease(MultiXactGenLock);
>>
>> so, if find_multixact_start() failed - a "should never happen" occurance
>> - we install a wrong stop limit. It does get 'repaired' upon the next
>> suceeding find_multixact_start() in SetOffsetVacuumLimit() or a restart
>> though.
>>
>> Adding a 'prevOffsetStopLimit' and using it seems like a ~5 line patch.
>
> So let's do that, then.

Who is going to take care of this?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Rework the way multixact truncations work

From
Noah Misch
Date:
On Thu, Dec 10, 2015 at 08:55:54AM -0500, Robert Haas wrote:
> I don't know have anything to add to what others have said in response
> to this point, except this: the whole point of using a source code
> management system is to tell you what changed and when.  What you are
> proposing to do makes it unusable for that purpose.

Based on your comments, I'm calling the patch series returned with feedback.
I built the series around the goal of making history maximally reviewable for
persons not insiders to commit 4f627f8.  Having spent 90% of my 2015
PostgreSQL contribution time finding or fixing committed defects, my judgment
of how best to achieve that is no shout from the peanut gallery.  (Neither is
your judgment.)  In particular, I had in view two works, RLS and pg_audit,
that used the post-commit repair strategy you've advocated.  But you gave me a
fair chance to make the case, and you stayed convinced that my repairs oppose
my goal.  I can now follow your development of that belief, which is enough.



Re: Rework the way multixact truncations work

From
Andres Freund
Date:
Noah, Robert, All

On 2015-12-11 11:20:21 -0500, Robert Haas wrote:
> On Thu, Dec 10, 2015 at 9:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> Adding a 'prevOffsetStopLimit' and using it seems like a ~5 line patch.
> >
> > So let's do that, then.
>
> Who is going to take care of this?

Here's two patches:

1) The fix to SetOffsetVacuumLimit().

   I've tested this by introducing a probabilistic "return false;" to
   find_multixact_start(), and torturing postgres by burning through
   billions of multixactids of various sizes.  Behaves about as
   bad^Wgood as without the induced failures; before the patch there
   were moments of spurious warnings/errors when ->offsetStopLimit was
   out of whack.

2) A subset of the comment changes from Noah's repository. Some of the
   comment changes didn't make sense without the removal
   SlruScanDirCbFindEarliest(), a small number of other changes I couldn't
   fully agree with.

   Noah, are you ok with pushing that subset of your changes? Is
   "Slightly edited subset of a larger patchset by Noah Misch" an OK
   attribution?


Noah, on a first glance I think e50cca0ae ("Remove the
SlruScanDirCbFindEarliest() test from TruncateMultiXact().") is a good
idea. So I do encourage you to submit that as a separate patch.

Regards,

Andres

Attachment

Re: Rework the way multixact truncations work

From
Robert Haas
Date:
On Sat, Dec 12, 2015 at 12:02 PM, Andres Freund <andres@anarazel.de> wrote:
> Noah, Robert, All
>
> On 2015-12-11 11:20:21 -0500, Robert Haas wrote:
>> On Thu, Dec 10, 2015 at 9:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> >> Adding a 'prevOffsetStopLimit' and using it seems like a ~5 line patch.
>> >
>> > So let's do that, then.
>>
>> Who is going to take care of this?
>
> Here's two patches:
>
> 1) The fix to SetOffsetVacuumLimit().
>
>    I've tested this by introducing a probabilistic "return false;" to
>    find_multixact_start(), and torturing postgres by burning through
>    billions of multixactids of various sizes.  Behaves about as
>    bad^Wgood as without the induced failures; before the patch there
>    were moments of spurious warnings/errors when ->offsetStopLimit was
>    out of whack.

I find the commit message you wrote a little difficult to read, and
propose the following version instead, which reads better to me:

Previously, if find_multixact_start() failed, SetOffsetVacuumLimit()
would install 0 into MultiXactState->offsetStopLimit.  Luckily, there
are no known cases where find_multixact_start() will return an error
in 9.5 and above. But if it were to happen, for example due to
filesystem permission issues, it'd be somewhat bad:
GetNewMultiXactId() could continue allocating mxids even if close to a
wraparound, or it could erroneously stop allocating mxids, even if no
wraparound is looming.  The wrong value would be corrected the next
time SetOffsetVacuumLimit() is called, or by a restart.

(I have no comments on the substance of either patch and have reviewed
the first one to a negligible degree - it doesn't look obviously wrong
- and the second one not at all.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company