Thread: Heads-Up: multixact freezing bug
Hello, Investigating corruption in a client's database we unfortunately found another data corrupting bug that's relevant for 9.3+: Since 9.3 heap_tuple_needs_freeze() and heap_freeze_tuple() don't correctly handle the xids contained in a multixact. They separately do a check for their respective cutoffs but the xids contained in a multixact are not checked. That doesn't have too bad consequences for multixacts that lock only, but it can lead to errors like: ERROR: could not access status of transaction 3883960912 DETAIL: Could not open file "pg_clog/0E78": No such file or directory. when accessing a tuple. Thats because the update-xid contained in the multixact is lower than than the global datfrozenxid we've truncated the clog to. Unfortunately that scenario isn't too unlikely: We use vacuum_freeze_min_age as the basis for both, the cutoff for xid and mxid freezing. Since in many cases multis will be generated at a lower rate than xids, we often will not have frozen away all mxids containing xids lower than the xid cutoff. To recover the data there's the lucky behaviour that HeapTupleSatisfiesVacuum() sets the XMAX_INVALID hint bit if a updating multixact isn't running anymore. So assuming that a contained update xid outside ShmemVariableCache's [oldestXid, nextXid) committed will often not cause rows to spuriously disappear. I am working on a fix for the issue, but it's noticeably less simple than I initially thought. With the current WAL format the freezing logic needs to work the same during normal processing and recovery. My current thoughts are that we need to check whether any member of a multixact needs freezing. If we find one we do MultiXactIdIsRunning() && MultiXactIdWait() if!InRecovery. That's pretty unlikely to be necessary, but afaics we cannot guarntee it is not. During recovery we do *not* need to do so since the primary will have performed all necessary waits. The big problem with that solution is that we need to do a GetMultiXactIdMembers() during crash recovery which is pretty damn ugly. But I *think*, and that's where I really would like some input, given the way multixact WAL logging works that should be safe. I am not really sure what to do about this. It's quite likely to cause corruption, but the next point release is coming up way too fast for a nontrivial fix. Thoughts? Better ideas? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-11-28 16:28:53 +0100, Andres Freund wrote: > My current thoughts are that we need to check whether any member of a > multixact needs freezing. If we find one we do MultiXactIdIsRunning() && > MultiXactIdWait() if!InRecovery. That's pretty unlikely to be necessary, > but afaics we cannot guarntee it is not. > The big problem with that solution is that we need to do a > GetMultiXactIdMembers() during crash recovery which is pretty damn > ugly. But I *think*, and that's where I really would like some input, > given the way multixact WAL logging works that should be safe. Bashing my head against the wall some more I think there's a much better fix: Instead of calculating the multixact cutoff xid by using the global minimum of OldestMemberMXactId[] and OldestVisibleMXactId[] and then subtracting vacuum_freeze_min_age compute it solely as the minimum of OldestMemberMXactId[]. If we do that computation *after* doing the GetOldestXmin() in vacuum_set_xid_limits() we can be sure no mxact above the new mxact cutoff will contain a xid below the xid cutoff. This is so since it would otherwise have been reported as running by GetOldestXmin(). With that change we can leave heap_tuple_needs_freeze() and heap_freeze_tuple() unchanged since using the mxact cutoff is sufficient. Now, one could argue that this is a pessimization since we'll freeze more: I argue that reducing the overhead by not having multis in those tuples well offsets that cost. There's some details around when calling AtEOXact_MultiXact() and similar to be worked out, but I think it's otherwise a sound principle. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund wrote: > Instead of calculating the multixact cutoff xid by using the global > minimum of OldestMemberMXactId[] and OldestVisibleMXactId[] and then > subtracting vacuum_freeze_min_age compute it solely as the minimum of > OldestMemberMXactId[]. If we do that computation *after* doing the > GetOldestXmin() in vacuum_set_xid_limits() we can be sure no mxact above > the new mxact cutoff will contain a xid below the xid cutoff. This is so > since it would otherwise have been reported as running by > GetOldestXmin(). > With that change we can leave heap_tuple_needs_freeze() and > heap_freeze_tuple() unchanged since using the mxact cutoff is > sufficient. Some thoughts here: 1. Using vacuum_freeze_min_age was clearly a poor choice. Normally (XIDs are normally consumed much faster than multis), it's far too large. In your reported case (per IM discussion), the customer is approaching 4 billion Xids but is still at 15 million multixids; so the relminmxid is still 1, because the default freeze_min_age is 50 million ... so at their current rate, they will wrap around the Xid counter 3-4 times before seeing this minmxid value advance at all. 2. Freezing too much has the disadvantage that you lose info possibly useful for forensics. And I believe that freezing just after a multi has gone below the immediate visibility horizon will make them live far too little. Now the performance guys are always saying how they would like tuples to even start life frozen, let alone delay any number of transactions before them being frozen; but to help the case for those who investigate and fix corrupted databases, we need a higher freeze horizon. Heck, maybe even 100k multis would be enough to keep enough evidence to track bugs down. I propose we keep at least a million. This is an even more important argument currently, given how buggy the current multixact code has proven to be. 2a. Freezing less also means less thrashing ... 3. I'm not sure I understand how the proposal above fixes things during recovery. If we keep the multi values above the freeze horizon you propose above, are we certain no old Xid values will remain? 4. Maybe it would be useful to emit a more verbose freezing record in HEAD, even if we introduce some dirty ugly hack in 9.3 to avoid having to change WAL format. 4a. Maybe we can introduce a new WAL record in 9.3 anyway and tell people to always upgrade the replicas before the masters. (I think we did this in the past once.) 3 and 4 in combination: maybe we can change 9.3 to not have any breathing room for freezing, to fix the current swarm of bugs without having to change WAL format, and do something more invasive in HEAD to keep more multis around for forensics. 5. the new multixact stuff seems way too buggy. Should we rip it all out and return to the old tuple locking scheme? We spent a huge amount of time writing it and reviewing it and now maintaining, but I haven't seen a *single* performance report saying how awesome 9.3 is compared to older releases due to this change; the 9.3 request for testing, at the start of the beta period, didn't even mention to try it out *at all*. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2013-11-28 14:10:43 -0300, Alvaro Herrera wrote: > Andres Freund wrote: > > > Instead of calculating the multixact cutoff xid by using the global > > minimum of OldestMemberMXactId[] and OldestVisibleMXactId[] and then > > subtracting vacuum_freeze_min_age compute it solely as the minimum of > > OldestMemberMXactId[]. If we do that computation *after* doing the > > GetOldestXmin() in vacuum_set_xid_limits() we can be sure no mxact above > > the new mxact cutoff will contain a xid below the xid cutoff. This is so > > since it would otherwise have been reported as running by > > GetOldestXmin(). > > With that change we can leave heap_tuple_needs_freeze() and > > heap_freeze_tuple() unchanged since using the mxact cutoff is > > sufficient. > 2. Freezing too much has the disadvantage that you lose info possibly > useful for forensics. And I believe that freezing just after a multi > has gone below the immediate visibility horizon will make them live far > too little. Now the performance guys are always saying how they would > like tuples to even start life frozen, let alone delay any number of > transactions before them being frozen; but to help the case for those > who investigate and fix corrupted databases, we need a higher freeze > horizon. Heck, maybe even 100k multis would be enough to keep enough > evidence to track bugs down. I propose we keep at least a million. > This is an even more important argument currently, given how buggy the > current multixact code has proven to be. The above proposal wouldn't trigger a full table vacuum, or similar. So it's not like we'd eagerly remove multi xmaxs - and it wouldn't influence freezing of anything but the multis. > 3. I'm not sure I understand how the proposal above fixes things during > recovery. If we keep the multi values above the freeze horizon you > propose above, are we certain no old Xid values will remain? It would fix things because we could simply set the multi cutoff value in the xl_heap_freeze record which would make it trivial to re-perform freezing on the standby. The big - and possibly fatal - problem is the amount of conflicts it creates on standbys. > 5. the new multixact stuff seems way too buggy. Should we rip it all > out and return to the old tuple locking scheme? We spent a huge amount > of time writing it and reviewing it and now maintaining, but I haven't > seen a *single* performance report saying how awesome 9.3 is compared to > older releases due to this change; I think ripping it out at this point would create as many bugs as it would fix :(. Especially as we'd still need a large part of the code in to handle existing multis. > the 9.3 request for testing, at the > start of the beta period, didn't even mention to try it out *at all*. That's pretty sad given it was a) one of the more awesome b) one of the more complicated features. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services