Thread: Heads-Up: multixact freezing bug

Heads-Up: multixact freezing bug

From
Andres Freund
Date:
Hello,

Investigating corruption in a client's database we unfortunately found
another data corrupting bug that's relevant for 9.3+:

Since 9.3 heap_tuple_needs_freeze() and heap_freeze_tuple() don't
correctly handle the xids contained in a multixact. They separately do a
check for their respective cutoffs but the xids contained in a multixact
are not checked.
That doesn't have too bad consequences for multixacts that lock only,
but it can lead to errors like:
ERROR:  could not access status of transaction 3883960912
DETAIL:  Could not open file "pg_clog/0E78": No such file or directory.
when accessing a tuple. Thats because the update-xid contained in the
multixact is lower than than the global datfrozenxid we've truncated the
clog to.

Unfortunately that scenario isn't too unlikely: We use
vacuum_freeze_min_age as the basis for both, the cutoff for xid and mxid
freezing. Since in many cases multis will be generated at a lower rate
than xids, we often will not have frozen away all mxids containing xids
lower than the xid cutoff.

To recover the data there's the lucky behaviour that
HeapTupleSatisfiesVacuum() sets the XMAX_INVALID hint bit if a updating
multixact isn't running anymore. So assuming that a contained update xid
outside ShmemVariableCache's [oldestXid, nextXid) committed will often
not cause rows to spuriously disappear.

I am working on a fix for the issue, but it's noticeably less simple
than I initially thought. With the current WAL format the freezing logic
needs to work the same during normal processing and recovery.

My current thoughts are that we need to check whether any member of a
multixact needs freezing. If we find one we do MultiXactIdIsRunning() &&
MultiXactIdWait() if!InRecovery. That's pretty unlikely to be necessary,
but afaics we cannot guarntee it is not.
During recovery we do *not* need to do so since the primary will have
performed all necessary waits.

The big problem with that solution is that we need to do a
GetMultiXactIdMembers() during crash recovery which is pretty damn
ugly. But I *think*, and that's where I really would like some input,
given the way multixact WAL logging works that should be safe.

I am not really sure what to do about this. It's quite likely to cause
corruption, but the next point release is coming up way too fast for a
nontrivial fix.

Thoughts? Better ideas?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Heads-Up: multixact freezing bug

From
Andres Freund
Date:
On 2013-11-28 16:28:53 +0100, Andres Freund wrote:
> My current thoughts are that we need to check whether any member of a
> multixact needs freezing. If we find one we do MultiXactIdIsRunning() &&
> MultiXactIdWait() if!InRecovery. That's pretty unlikely to be necessary,
> but afaics we cannot guarntee it is not.

> The big problem with that solution is that we need to do a
> GetMultiXactIdMembers() during crash recovery which is pretty damn
> ugly. But I *think*, and that's where I really would like some input,
> given the way multixact WAL logging works that should be safe.

Bashing my head against the wall some more I think there's a much better fix:

Instead of calculating the multixact cutoff xid by using the global
minimum of OldestMemberMXactId[] and OldestVisibleMXactId[] and then
subtracting vacuum_freeze_min_age compute it solely as the minimum of
OldestMemberMXactId[]. If we do that computation *after* doing the
GetOldestXmin() in vacuum_set_xid_limits() we can be sure no mxact above
the new mxact cutoff will contain a xid below the xid cutoff. This is so
since it would otherwise have been reported as running by
GetOldestXmin().
With that change we can leave heap_tuple_needs_freeze() and
heap_freeze_tuple() unchanged since using the mxact cutoff is
sufficient.

Now, one could argue that this is a pessimization since we'll freeze
more: I argue that reducing the overhead by not having multis in those
tuples well offsets that cost.

There's some details around when calling AtEOXact_MultiXact() and
similar to be worked out, but I think it's otherwise a sound principle.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Heads-Up: multixact freezing bug

From
Alvaro Herrera
Date:
Andres Freund wrote:

> Instead of calculating the multixact cutoff xid by using the global
> minimum of OldestMemberMXactId[] and OldestVisibleMXactId[] and then
> subtracting vacuum_freeze_min_age compute it solely as the minimum of
> OldestMemberMXactId[]. If we do that computation *after* doing the
> GetOldestXmin() in vacuum_set_xid_limits() we can be sure no mxact above
> the new mxact cutoff will contain a xid below the xid cutoff. This is so
> since it would otherwise have been reported as running by
> GetOldestXmin().
> With that change we can leave heap_tuple_needs_freeze() and
> heap_freeze_tuple() unchanged since using the mxact cutoff is
> sufficient.

Some thoughts here:

1. Using vacuum_freeze_min_age was clearly a poor choice.  Normally
(XIDs are normally consumed much faster than multis), it's far too
large.  In your reported case (per IM discussion), the customer is
approaching 4 billion Xids but is still at 15 million multixids; so the
relminmxid is still 1, because the default freeze_min_age is 50 million
... so at their current rate, they will wrap around the Xid counter 3-4
times before seeing this minmxid value advance at all.

2. Freezing too much has the disadvantage that you lose info possibly
useful for forensics.  And I believe that freezing just after a multi
has gone below the immediate visibility horizon will make them live far
too little.  Now the performance guys are always saying how they would
like tuples to even start life frozen, let alone delay any number of
transactions before them being frozen; but to help the case for those
who investigate and fix corrupted databases, we need a higher freeze
horizon.  Heck, maybe even 100k multis would be enough to keep enough
evidence to track bugs down.  I propose we keep at least a million.
This is an even more important argument currently, given how buggy the
current multixact code has proven to be.

2a. Freezing less also means less thrashing ...

3. I'm not sure I understand how the proposal above fixes things during
recovery.  If we keep the multi values above the freeze horizon you
propose above, are we certain no old Xid values will remain?

4. Maybe it would be useful to emit a more verbose freezing record in
HEAD, even if we introduce some dirty ugly hack in 9.3 to avoid having
to change WAL format.

4a. Maybe we can introduce a new WAL record in 9.3 anyway and tell
people to always upgrade the replicas before the masters.  (I think we
did this in the past once.)

3 and 4 in combination: maybe we can change 9.3 to not have any
breathing room for freezing, to fix the current swarm of bugs without
having to change WAL format, and do something more invasive in HEAD to
keep more multis around for forensics.

5. the new multixact stuff seems way too buggy.  Should we rip it all
out and return to the old tuple locking scheme?  We spent a huge amount
of time writing it and reviewing it and now maintaining, but I haven't
seen a *single* performance report saying how awesome 9.3 is compared to
older releases due to this change; the 9.3 request for testing, at the
start of the beta period, didn't even mention to try it out *at all*.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: Heads-Up: multixact freezing bug

From
Andres Freund
Date:
On 2013-11-28 14:10:43 -0300, Alvaro Herrera wrote:
> Andres Freund wrote:
> 
> > Instead of calculating the multixact cutoff xid by using the global
> > minimum of OldestMemberMXactId[] and OldestVisibleMXactId[] and then
> > subtracting vacuum_freeze_min_age compute it solely as the minimum of
> > OldestMemberMXactId[]. If we do that computation *after* doing the
> > GetOldestXmin() in vacuum_set_xid_limits() we can be sure no mxact above
> > the new mxact cutoff will contain a xid below the xid cutoff. This is so
> > since it would otherwise have been reported as running by
> > GetOldestXmin().
> > With that change we can leave heap_tuple_needs_freeze() and
> > heap_freeze_tuple() unchanged since using the mxact cutoff is
> > sufficient.

> 2. Freezing too much has the disadvantage that you lose info possibly
> useful for forensics.  And I believe that freezing just after a multi
> has gone below the immediate visibility horizon will make them live far
> too little.  Now the performance guys are always saying how they would
> like tuples to even start life frozen, let alone delay any number of
> transactions before them being frozen; but to help the case for those
> who investigate and fix corrupted databases, we need a higher freeze
> horizon.  Heck, maybe even 100k multis would be enough to keep enough
> evidence to track bugs down.  I propose we keep at least a million.
> This is an even more important argument currently, given how buggy the
> current multixact code has proven to be.

The above proposal wouldn't trigger a full table vacuum, or similar. So
it's not like we'd eagerly remove multi xmaxs - and it wouldn't
influence freezing of anything but the multis.

> 3. I'm not sure I understand how the proposal above fixes things during
> recovery.  If we keep the multi values above the freeze horizon you
> propose above, are we certain no old Xid values will remain?

It would fix things because we could simply set the multi cutoff value
in the xl_heap_freeze record which would make it trivial to re-perform
freezing on the standby.

The big - and possibly fatal - problem is the amount of conflicts it
creates on standbys.

> 5. the new multixact stuff seems way too buggy.  Should we rip it all
> out and return to the old tuple locking scheme?  We spent a huge amount
> of time writing it and reviewing it and now maintaining, but I haven't
> seen a *single* performance report saying how awesome 9.3 is compared to
> older releases due to this change;

I think ripping it out at this point would create as many bugs as it
would fix :(. Especially as we'd still need a large part of the code in
to handle existing multis.

> the 9.3 request for testing, at the
> start of the beta period, didn't even mention to try it out *at all*.

That's pretty sad given it was a) one of the more awesome b) one of the
more complicated features.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services