Thread: Rework the way multixact truncations work
Hi, As discussed on list, over IM and in person at pgcon I want to make multixact truncations be WAL logged to address various bugs. Since that's a comparatively large and invasive change I thought it'd be a good idea to start a new thread instead of burying it in a already long thread. Here's the commit message which hopefully explains what's being changed and why: Rework the way multixact truncations work. The fact that multixact truncations are not WAL logged has caused a fair share of problems. Amongst others it requires to do computations during recovery while the database is not in a consistent state, delaying truncations till checkpoints, and handling members being truncated, but offset not. We tried to put bandaids on lots of these issues over the last years, but it seems time to change course. Thus this patch introduces WAL logging for truncation, even in the back branches. This allows: 1) to perform the truncation directly during VACUUM, instead of delaying it to the checkpoint. 2) to avoid looking at the offsets SLRU for truncation during recovery, we can just use the master's values. 3) simplify a fair amount of logic to keep in memory limits straight, this has gotten much easier During the course of fixing this a bunch of bugs had to be fixed: 1) Data was not purged from memory the member's slru before deleting segments. This happend to be hard or impossible to hit due to the interlock between checkpoints and truncation. 2) find_multixact_start() relied on SimpleLruDoesPhysicalPageExist - but that doesn't work for offsets that haven't yet been flushed to disk. Flush out before running to fix. Not pretty, but it feels slightly safer to only make decisions based on on-disk state. To handle the case of an updated standby replaying WAL from a not-yet upgraded primary we have to recognize that situation and use "old style" truncation (i.e. looking at the SLRUs) during WAL replay. In contrast to before this now happens in the startup process, when replaying a checkpoint record, instead of the checkpointer. Doing this in the restartpoint was incorrect, they can happen much later than the original checkpoint, thereby leading to wraparound. It's also more in line to how the WAL logging now works. To avoid "multixact_redo: unknown op code 48" errors standbys should be upgraded before primaries. This needs to be expressed clearly in the release notes. Backpatch to 9.3, where the use of multixacts was expanded. Arguably this could be backpatched further, but there doesn't seem to be sufficient benefit to outweigh the risk of applying a significantly different patch there. I've tested this a bunch, including using a newer standby against a older master and such. What I have yet to test is that the concurrency protections against multiple backends truncating at the same time are correct. It'd be very welcome to see some wider testing and review on this. I've attached three commits: 0001: Add functions to burn through multixacts - that should get its own file. 0002: Lower the lower bound limits for *_freeze_max_age - I think we should just do that. There really is no reason for the current limits and they make testing hard and force space wastage. 0003: The actual truncation patch. Greetings, Andres Freund
Attachment
Andres Freund wrote: > Rework the way multixact truncations work. I spent some time this morning reviewing this patch and had some comments that I relayed over IM to Andres. The vast majority is cosmetic, but there are two larger things: 1. I think this part of PerformMembersTruncation() is very confusing: /* verify whether the current segment is to be deleted */ if (segment != startsegment && segment != endsegment) SlruDeleteSegment(MultiXactMemberCtl, segment); I think this works correctly in that it preserves both endpoint files, but the files in between are removed ... which is a confusing interface, IMO. I think this merits a longer explanation. 2. We set PGXACT->delayChkpt while the truncation is executed. This seems reasonable, and there's a good reason for it, but all the other users of this facility only do small operations with this thing grabbed, while the multixact truncation could take a long time because a large number of files might be deleted. Maybe it's not a problem to have checkpoints be delayed by several seconds, or who knows maybe even a minute in a busy system. (We will have checkpointer sleeping in 10ms intervals until the truncation is complete). Maybe this is fine, not sure. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2015-06-26 14:48:35 -0300, Alvaro Herrera wrote: > Andres Freund wrote: > > > Rework the way multixact truncations work. > > I spent some time this morning reviewing this patch and had some > comments that I relayed over IM to Andres. Thanks for that! > 2. We set PGXACT->delayChkpt while the truncation is executed. This > seems reasonable, and there's a good reason for it, but all the other > users of this facility only do small operations with this thing grabbed, > while the multixact truncation could take a long time because a large > number of files might be deleted. Maybe it's not a problem to have > checkpoints be delayed by several seconds, or who knows maybe even a > minute in a busy system. (We will have checkpointer sleeping in 10ms > intervals until the truncation is complete). I don't think this is a problem. Consider that we're doing all this in the checkpointer today, blocking much more than just the actual xlog insertion. That's a bigger problem, as we'll not do the paced writing during that and such. The worst thatthis can cause is a bunch of sleeps, that seems fairly harmless. Greetings, Andres Freund
On Mon, Jun 22, 2015 at 7:24 AM, Andres Freund <andres@anarazel.de> wrote: > It'd be very welcome to see some wider testing and review on this. I started looking at this and doing some testing. Here is some initial feedback: Perhaps vac_truncate_clog needs a new name now that it does more, maybe vac_truncate_transaction_logs? MultiXactState->sawTruncationCkptCyle: is 'Cyle' supposed to be 'Cycle'? In the struct xl_multixact_truncate, and also the function WriteMTruncateXlogRec and other places, I think you have confused the typedefs MultiXactOffset and MultiXactId. If I'm not mistaken, startMemb and endMemb have the correct type, but startOff and endOff should be of type MultiXactId despite their names (the *values* stored inside pg_multixact/offsets are indeed offsets (into pg_multixact/members), but their *location* is what a multixact ID represents). I was trying to understand if there could be any problem caused by setting latest_page_number to the pageno that holds (or will hold) xlrec.endOff in multixact_redo. The only thing that jumps out at me is that the next call to SlruSelectLRUPage will no longer be prevented from evicting the *real* latest page (the most recently created page). In SlruDeleteSegment, is it OK to call unlink() while holding the SLRU control lock? In find_multixact_start you call SimpleLruFlush before calling SimpleLruDoesPhysicalPageExist. Should we do something like this instead? https://gist.github.com/macdice/8e5b2f0fe3827fdf3d5a I think saw some extra autovacuum activity that I didn't expect in a simple scenario, but I'm not sure and will continue looking tomorrow. -- Thomas Munro http://www.enterprisedb.com
On 2015-06-29 23:54:40 +1200, Thomas Munro wrote: > On Mon, Jun 22, 2015 at 7:24 AM, Andres Freund <andres@anarazel.de> wrote: > > It'd be very welcome to see some wider testing and review on this. > > I started looking at this and doing some testing. Here is some > initial feedback: > > Perhaps vac_truncate_clog needs a new name now that it does more, > maybe vac_truncate_transaction_logs? It has done more before, so I don't really see a connection to this patch... > MultiXactState->sawTruncationCkptCyle: is 'Cyle' supposed to be 'Cycle'? Oops. > In the struct xl_multixact_truncate, and also the function > WriteMTruncateXlogRec and other places, I think you have confused the > typedefs MultiXactOffset and MultiXactId. If I'm not mistaken, > startMemb and endMemb have the correct type, but startOff and endOff > should be of type MultiXactId despite their names (the *values* stored > inside pg_multixact/offsets are indeed offsets (into > pg_multixact/members), but their *location* is what a multixact ID > represents). IIRC I did it that way to make clear this is just 'byte' type offsets, without other meaning. Wasn't a good idea. > I was trying to understand if there could be any problem caused by > setting latest_page_number to the pageno that holds (or will hold) > xlrec.endOff in multixact_redo. The only thing that jumps out at me > is that the next call to SlruSelectLRUPage will no longer be prevented > from evicting the *real* latest page (the most recently created page). That hasn't changed unless I miss something? > In SlruDeleteSegment, is it OK to call unlink() while holding the SLRU > control lock? I think it's safer than not doing it, but don't particularly care. > In find_multixact_start you call SimpleLruFlush before calling > SimpleLruDoesPhysicalPageExist. Should we do something like this > instead? https://gist.github.com/macdice/8e5b2f0fe3827fdf3d5a I'm currently slightly inclined to do it "my way". They way these functions are used it doesn't seem like a bad property to ensure things are on disk. > I think saw some extra autovacuum activity that I didn't expect in a > simple scenario, but I'm not sure and will continue looking tomorrow. Cool, thanks! Greetings, Andres Freund
Hi, I'm still working on this patch. I found a bunch of issues. Non of them super critical and some pre-existing; but nonetheless I don't feel like it's ready to push yet. So we'll have the first alpha without those fixes :( The biggest issues so far are: * There's, both in the posted patch and as-is in all the branches, a lot of places that aren't actually safe against a concurrent truncation. A bunch of places grab oldestMultiXactId with an lwlock held, release it, and then make decisions based on that. A bunch of places (including the find_multixact_start callsites) is actually vulnerable to that, for a bunch of others it's less likely to be a problem. All callers of GetMultiXactIdMembers() are vulnerable, but with the exception of pg_get_multixact_members() they'll never pass in a value that's older than the new oldest member value. That's a problem for the current branches. Afaics that can lead to a useless round of emergency autovacs via SetMultiXactIdLimit()->SetOffsetVacuumLimit(). SetOffsetVacuumLimit() can protect easily agains that by taking the new MultiXactTruncationLock lock. We could do the same for pg_get_multixact_members() - afaics the only caller that'll look up too old values otherwise - but I don't think it matters, you'll get a slightly obscure error if you access a too old xact that's just being truncated away and that is that. * There was no update of the in-memory oldest* values when replaying a truncation. That's problematic if a standby is promoted after replaying a truncation record, but before a following checkpoint record. This would be fixed by an emergency autovacuum, but that's obviously not nice. Trivial to fix. * The in-memory oldest values were updated *after* the truncation happened. It's unlikely to matter in reality, but it's safer to update them before, so a concurrent GetMultiXactIdMembers() of stuff from before the truncation will get the proper error. * PerformMembersTruncation() probably confused Alvaro because it wasn't actually correct - there's no need to have the segment containing old oldestOffset (in contrast to oldestOffsetAlive) survive. Except leaking a segment that's harmless, but obviously not desirable. Additionally I'm changing some stuff, some requested by review: * xl_multixact_truncate's members are now called (start|end)Trunc(Off|Memb) * (start|end)TruncOff have the appropriate type now * typo fixes * comment improvements * pgindent New version attached. Greetings, Andres Freund
Attachment
On Mon, Jun 29, 2015 at 3:48 PM, Andres Freund <andres@anarazel.de> wrote: > New version attached. 0002 looks good, but the commit message should perhaps mention the comment fix. Or commit that separately. Will look at 0003 next. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jul 2, 2015 at 11:52 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Will look at 0003 next. + appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)", I don't think we typically use this style for notating intervals. case XLOG_MULTIXACT_CREATE_ID: id = "CREATE_ID"; break; + case XLOG_MULTIXACT_TRUNCATE_ID: + id = "TRUNCATE"; + break; If XLOG_MULTIXACT_CREATE_ID -> "CREATE_ID", then why not XLOG_MULTIXACT_TRUNCATE_ID -> "TRUNCATE_ID"? + * too old to general truncation records. s/general/generate/ + MultiXactId oldestMXactDB; Data type should be OID. + * Recompute limits as we are now fully started, we now can correctly + * compute how far a members wraparound is away. s/,/:/ or something. This isn't particularly good English as written. + * Computing the actual limits is only possible once the data directory is + * in a consistent state. There's no need to compute the limits while + * still replaying WAL as no new multis can be created anyway. So we'll + * only do further checks after TrimMultiXact() has been called. Multis can be and are created during replay. What this should really say is that we have no choice about whether to create them or not: we just have to replay whatever's there. + (errmsg("performing legacy multixact truncation, upgrade master"))); This message needs work. I'm not sure exactly what it should say, but I'm pretty sure that's not clear enough. I seriously, seriously doubt that it is a good idea to perform the legacy truncation from MultiXactAdvanceOldest() rather than TruncateMultiXact(). The checkpoint hasn't really happened at that point yet; you might truncate away stuff, then crash before the checkpoint is complete, and then we you restart recovery, you've got trouble. I think you should be very, very cautious about rejiggering the order of operations here. The current situation is not good, but casually rejiggering it can make things much worse. - * If no multixacts exist, then oldestMultiXactId will be the next - * multixact that will be created, rather than an existing multixact. + * Determine the offset of the oldest multixact. Normally, we can read + * the offset from the multixact itself, but there's an important special + * case: if there are no multixacts in existence at all, oldestMXact + * obviously can't point to one. It will instead point to the multixact + * ID that will be assigned the next time one is needed. There's no need to change this; it means the same thing either way. Generally, I think you've weakened the logic in SetOffsetVacuumLimit() appreciably here. The existing code is careful never to set oldestOffsetKnown false when it was previously true. Your rewrite removes that property. Generally, I see no need for this function to be overhauled to the extent that you have, and would suggest reverting the changes that aren't absolutely required. I don't particularly like the fact that find_multixact_start() calls SimpleLruFlush(). I think that's really a POLA violation: you don't expect that a function that looks like a simple inquiry is going to go do a bunch of unrelated I/O in the background. If somebody called find_multixact_start() with any frequency, you'd be sad. You're just doing it this way because of the context *in which you expect find_multixact_start* to be called, which does not seem very future-proof. I prefer Thomas's approach. If TruncateMultiXact() fails to acquire MultiXactTruncationLock right away, does it need to wait, or could it ConditionalAcquire and bail out if the lock isn't obtained? + * Make sure to only attempt truncation if there's values to truncate + * away. In normal processing values shouldn't go backwards, but there's + * some corner cases (due to bugs) where that's possible. I think this comment should be more detailed. Is that talking about the same thing as this comment: - * Due to bugs in early releases of PostgreSQL 9.3.X and 9.4.X, - * oldestMXact might point to a multixact that does not exist. - * Autovacuum will eventually advance it to a value that does exist, - * and we want to set a proper offsetStopLimit when that happens, - * so call DetermineSafeOldestOffset here even if we're not actually - * truncating. This comment seems to be saying there's a race condition: + * XXX: It's also possible that the page that oldestMXact is on has + * already been truncated away, and we crashed before updating + * oldestMXact. But why is that an XXX? I think this is just a case of recovery needing tolerate redo of an action already redone. I'm not convinced that it's a good idea to remove lastCheckpointedOldest and replace it with nothing. It seems like a very good idea to have two separate pieces of state in shared memory: - The oldest offset that we think anyone might need to access to make a visibility check for a tuple. - The oldest offset that we still have on disk. The latter would now need to be called something else, not lastCheckpointedOldest, but I think it's still good to have it. Otherwise, I don't see how you protect against the on-disk state wrapping around before you finish truncating, and then maybe truncation eats something that was busy getting reused. We might be kind of hosed in that situation anyway, because TruncateMultiXact() and some other places assume that circular comparisons will return sensible values. But that could be fixed, and probably should be fixed eventually. + (errmsg("supposedly still alive MultiXact %u not found, skipping truncation", Maybe "cannot truncate MultiXact %u because it does not exist on disk, skipping truncation"? I think "frozenMulti" is a slightly confusing variable name and deserves a comment. AUIU, that's the oldest multiXact we need to keep. So it's actually the oldest multi that is NOT guaranteed to be frozen. minMulti might be a better variable name, but a comment is justified either way. + * Update in-memory limits before performing the truncation, while inside + * the critical section: Have to do it before truncation, to prevent + * concurrent lookups of those values. Has to be inside the critical + * section asotherwise a future call to this function would error out, + * while looking up the oldest member in offsets, if our caller crashes + * before updating the limits. Missing space: asotherwise. Who else might be concurrently looking up those values? Nobody else can truncate while we're truncating, because we hold MultiXactTruncationLock. And nobody else should be getting here from looking up tuples, because if they are, we truncated too soon. - * Startup MultiXact. We need to do this early for two reasons: one is - * that we might try to access multixacts when we do tuple freezing, and - * the other is we need its state initialized because we attempt - * truncation during restartpoints. + * Startup MultiXact, we need to do this early, to be able to replay + * truncations. The period after "Startup MultiXact" was more correct than the comma you've replaced it with. Phew. That's all I see on a first read-through, but I've only spent a couple of hours on this, so I might easily have missed some things. But let me stop here, hit send, and see what you think of these comments. Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-07-02 13:58:45 -0400, Robert Haas wrote: > On Thu, Jul 2, 2015 at 11:52 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > Will look at 0003 next. > > + appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)", > > I don't think we typically use this style for notating intervals. I don't think we really have a very consistent style for xlog messages - this seems to describe the meaning accurately? > [several good points] > + (errmsg("performing legacy multixact truncation, > upgrade master"))); > > This message needs work. I'm not sure exactly what it should say, but > I'm pretty sure that's not clear enough. > > I seriously, seriously doubt that it is a good idea to perform the > legacy truncation from MultiXactAdvanceOldest() rather than > TruncateMultiXact(). But where should TruncateMultiXact() be called from? I mean, we could move the logic from inside MultiXactAdvanceOldest() to some special case in the replay routine, but what'd be the advantage? > The checkpoint hasn't really happened at that point yet; you might > truncate away stuff, then crash before the checkpoint is complete, and > then we you restart recovery, you've got trouble. We're only talking about restartpoints here, right? And I don't see the problem - we don't read the slru anymore until the end of recovery, and the end of recovery can't happen before reaching the minimum revovery location? > I think you should > be very, very cautious about rejiggering the order of operations here. > The current situation is not good, but casually rejiggering it can > make things much worse. The current placement - as part of the restartpoint - is utterly broken and unpredictable. There'll frequently be no restartpoints performed at all (due to different checkpoint segments, slow writeout, or pending actions). Because there's no careful timing of when this happens it's much harder to understand the exact state in which the truncation happens - I think moving it to a specific location during replay makes things considerably easier. > Generally, I think you've weakened the logic in SetOffsetVacuumLimit() > appreciably here. The existing code is careful never to set > oldestOffsetKnown false when it was previously true. Your rewrite > removes that property. Generally, I see no need for this function to > be overhauled to the extent that you have, and would suggest reverting > the changes that aren't absolutely required. A lot of that has to do that the whole stuff about truncations happening during checkpoints is gone and that thus the split with DetermineSafeOldestOffset() doesn't make sense anymore. That oldestOffsetKnown is unset is wrong - the if (prevOldestOffsetKnown && !oldestOffsetKnown) block should be a bit earlier. > I don't particularly like the fact that find_multixact_start() calls > SimpleLruFlush(). I think that's really a POLA violation: you don't > expect that a function that looks like a simple inquiry is going to go > do a bunch of unrelated I/O in the background. If somebody called > find_multixact_start() with any frequency, you'd be sad. You're just > doing it this way because of the context *in which you expect > find_multixact_start* to be called, which does not seem very > future-proof. I prefer Thomas's approach. I don't strongly care, but I do think it has some value to be sure about the on-disk state for the current callers. I think it'd be a pretty odd thing to call find_multixact_start() frequently. > If TruncateMultiXact() fails to acquire MultiXactTruncationLock right > away, does it need to wait, or could it ConditionalAcquire and bail > out if the lock isn't obtained? That seems like premature optimization to me. And one that's not that easy to do correctly - what if the current caller actually has a new, lower, minimum mxid? > + * XXX: It's also possible that the page that oldestMXact is on has > + * already been truncated away, and we crashed before updating > + * oldestMXact. > > But why is that an XXX? I think this is just a case of recovery > needing tolerate redo of an action already redone. Should rather have been NB. > I'm not convinced that it's a good idea to remove > lastCheckpointedOldest and replace it with nothing. It seems like a > very good idea to have two separate pieces of state in shared memory: > - The oldest offset that we think anyone might need to access to make > a visibility check for a tuple. > - The oldest offset that we still have on disk. > > The latter would now need to be called something else, not > lastCheckpointedOldest, but I think it's still good to have it. > Otherwise, I don't see how you protect against the on-disk state > wrapping around before you finish truncating, and then maybe > truncation eats something that was busy getting reused. Unless I miss something the stop limits will prevent that from happening? SetMultiXactIdLimit() is called only *after* the truncation has finished? > + * Update in-memory limits before performing the truncation, while inside > + * the critical section: Have to do it before truncation, to prevent > + * concurrent lookups of those values. Has to be inside the critical > + * section asotherwise a future call to this function would error out, > + * while looking up the oldest member in offsets, if our caller crashes > + * before updating the limits. > > Missing space: asotherwise. > > Who else might be concurrently looking up those values? Nobody else > can truncate while we're truncating, because we hold > MultiXactTruncationLock. And nobody else should be getting here from > looking up tuples, because if they are, we truncated too soon. pg_get_multixact_members(), a concurrent call to SetMultiXactIdLimit() (SetOffsetLimit()->find_multixact_start()) from vac_truncate_clog(). > Phew. That's all I see on a first read-through, but I've only spent a > couple of hours on this, so I might easily have missed some things. > But let me stop here, hit send, and see what you think of these > comments. Thanks for the look so far! Greetings, Andres Freund
On Thu, Jul 2, 2015 at 2:28 PM, Andres Freund <andres@anarazel.de> wrote: > On 2015-07-02 13:58:45 -0400, Robert Haas wrote: >> On Thu, Jul 2, 2015 at 11:52 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> > Will look at 0003 next. >> >> + appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)", >> >> I don't think we typically use this style for notating intervals. > > I don't think we really have a very consistent style for xlog messages - > this seems to describe the meaning accurately? Although I realize this is supposed to be interval notation, I'm not sure everyone will immediately figure that out. I believe it has created some confusion in the past. I'm not going to spend a lot of time arguing with you about it, but I'd do something else, like offsets from %u stop before %u, members %u stop before %u. >> [several good points] > >> + (errmsg("performing legacy multixact truncation, >> upgrade master"))); >> >> This message needs work. I'm not sure exactly what it should say, but >> I'm pretty sure that's not clear enough. >> >> I seriously, seriously doubt that it is a good idea to perform the >> legacy truncation from MultiXactAdvanceOldest() rather than >> TruncateMultiXact(). > > But where should TruncateMultiXact() be called from? I mean, we could > move the logic from inside MultiXactAdvanceOldest() to some special case > in the replay routine, but what'd be the advantage? I think you should call it from where TruncateMultiXact() is being called from today. Doing legacy truncations from a different place than we're currently doing them just gives us more ways to be wrong. >> The checkpoint hasn't really happened at that point yet; you might >> truncate away stuff, then crash before the checkpoint is complete, and >> then we you restart recovery, you've got trouble. > > We're only talking about restartpoints here, right? And I don't see the > problem - we don't read the slru anymore until the end of recovery, and > the end of recovery can't happen before reaching the minimum revovery > location? You're still going to have to read the SLRU for as long as you are doing legacy truncations, at least. >> If TruncateMultiXact() fails to acquire MultiXactTruncationLock right >> away, does it need to wait, or could it ConditionalAcquire and bail >> out if the lock isn't obtained? > > That seems like premature optimization to me. And one that's not that > easy to do correctly - what if the current caller actually has a new, > lower, minimum mxid? Doesn't the next truncation just catch up? But sure, I agree this is inessential (and maybe better left alone for now). >> I'm not convinced that it's a good idea to remove >> lastCheckpointedOldest and replace it with nothing. It seems like a >> very good idea to have two separate pieces of state in shared memory: > >> - The oldest offset that we think anyone might need to access to make >> a visibility check for a tuple. >> - The oldest offset that we still have on disk. >> >> The latter would now need to be called something else, not >> lastCheckpointedOldest, but I think it's still good to have it. > >> Otherwise, I don't see how you protect against the on-disk state >> wrapping around before you finish truncating, and then maybe >> truncation eats something that was busy getting reused. > > Unless I miss something the stop limits will prevent that from > happening? SetMultiXactIdLimit() is called only *after* the truncation > has finished? Hmm, that might be, I'd have to reread the patch. The reason we originally had it this way was because VACUUM was updating the limit and then checkpoint was truncating, but now I guess vacuum + truncate happen so close together that you might only need one value. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
(quick answer, off now) On 2015-07-05 14:20:11 -0400, Robert Haas wrote: > On Thu, Jul 2, 2015 at 2:28 PM, Andres Freund <andres@anarazel.de> wrote: > > On 2015-07-02 13:58:45 -0400, Robert Haas wrote: > >> I seriously, seriously doubt that it is a good idea to perform the > >> legacy truncation from MultiXactAdvanceOldest() rather than > >> TruncateMultiXact(). > > > > But where should TruncateMultiXact() be called from? I mean, we could > > move the logic from inside MultiXactAdvanceOldest() to some special case > > in the replay routine, but what'd be the advantage? > > I think you should call it from where TruncateMultiXact() is being > called from today. Doing legacy truncations from a different place > than we're currently doing them just gives us more ways to be wrong. The problem with that is that the current location is just plain wrong. Restartpoints can be skipped (due different checkpoint segments settings), may not happen at all (pending incomplete actions), and can just be slowed down. That's a currently existing bug that's easy to reproduce. > >> The checkpoint hasn't really happened at that point yet; you might > >> truncate away stuff, then crash before the checkpoint is complete, and > >> then we you restart recovery, you've got trouble. > > > > We're only talking about restartpoints here, right? And I don't see the > > problem - we don't read the slru anymore until the end of recovery, and > > the end of recovery can't happen before reaching the minimum revovery > > location? > > You're still going to have to read the SLRU for as long as you are > doing legacy truncations, at least. I'm not following. Sure, we read the SLRUs as we do today. But, in contrast to the current positioning in recovery, with the patch they're done at pretty much the same point on the standby as on the primary today? Greetings, Andres Freund
On Sun, Jul 5, 2015 at 2:28 PM, Andres Freund <andres@anarazel.de> wrote: > (quick answer, off now) > > On 2015-07-05 14:20:11 -0400, Robert Haas wrote: >> On Thu, Jul 2, 2015 at 2:28 PM, Andres Freund <andres@anarazel.de> wrote: >> > On 2015-07-02 13:58:45 -0400, Robert Haas wrote: >> >> I seriously, seriously doubt that it is a good idea to perform the >> >> legacy truncation from MultiXactAdvanceOldest() rather than >> >> TruncateMultiXact(). >> > >> > But where should TruncateMultiXact() be called from? I mean, we could >> > move the logic from inside MultiXactAdvanceOldest() to some special case >> > in the replay routine, but what'd be the advantage? >> >> I think you should call it from where TruncateMultiXact() is being >> called from today. Doing legacy truncations from a different place >> than we're currently doing them just gives us more ways to be wrong. > > The problem with that is that the current location is just plain > wrong. Restartpoints can be skipped (due different checkpoint segments > settings), may not happen at all (pending incomplete actions), and can > just be slowed down. > > That's a currently existing bug that's easy to reproduce. You might be right; I haven't tested that. On the other hand, in the common case, by the time we perform a restartpoint, we're consistent: I think the main exception to that is if we do a base backup that spans multiple checkpoints. I think that in the new location, the chances that the legacy truncation is trying to read inconsistent data is probably higher. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On July 5, 2015 8:50:57 PM GMT+02:00, Robert Haas <robertmhaas@gmail.com> wrote: >On Sun, Jul 5, 2015 at 2:28 PM, Andres Freund <andres@anarazel.de> >wrote: >> (quick answer, off now) >> >> On 2015-07-05 14:20:11 -0400, Robert Haas wrote: >>> On Thu, Jul 2, 2015 at 2:28 PM, Andres Freund <andres@anarazel.de> >wrote: >>> > On 2015-07-02 13:58:45 -0400, Robert Haas wrote: >>> >> I seriously, seriously doubt that it is a good idea to perform >the >>> >> legacy truncation from MultiXactAdvanceOldest() rather than >>> >> TruncateMultiXact(). >>> > >>> > But where should TruncateMultiXact() be called from? I mean, we >could >>> > move the logic from inside MultiXactAdvanceOldest() to some >special case >>> > in the replay routine, but what'd be the advantage? >>> >>> I think you should call it from where TruncateMultiXact() is being >>> called from today. Doing legacy truncations from a different place >>> than we're currently doing them just gives us more ways to be wrong. >> >> The problem with that is that the current location is just plain >> wrong. Restartpoints can be skipped (due different checkpoint >segments >> settings), may not happen at all (pending incomplete actions), and >can >> just be slowed down. >> >> That's a currently existing bug that's easy to reproduce. > >You might be right; I haven't tested that. > >On the other hand, in the common case, by the time we perform a >restartpoint, we're consistent: I think the main exception to that is >if we do a base backup that spans multiple checkpoints. I think that >in the new location, the chances that the legacy truncation is trying >to read inconsistent data is probably higher. The primary problem isn't that we truncate too early, it's that we delay truncation on the standby in comparison to the primaryby a considerable amount. All the while continuing to replay multi creations. I don't see the difference wrt. consistency right now, but I don't have access to the code right now. I mean we *have* todo something while inconsistent. A start/stop backup can easily span a day or four. Andres --- Please excuse brevity and formatting - I am writing this on my mobile phone.
On Sun, Jul 5, 2015 at 3:16 PM, Andres Freund <andres@anarazel.de> wrote: >>On the other hand, in the common case, by the time we perform a >>restartpoint, we're consistent: I think the main exception to that is >>if we do a base backup that spans multiple checkpoints. I think that >>in the new location, the chances that the legacy truncation is trying >>to read inconsistent data is probably higher. > > The primary problem isn't that we truncate too early, it's that we delay truncation on the standby in comparison to theprimary by a considerable amount. All the while continuing to replay multi creations. > > I don't see the difference wrt. consistency right now, but I don't have access to the code right now. I mean we *have*to do something while inconsistent. A start/stop backup can easily span a day or four. So, where are we with this patch? In my opinion, we ought to do something about master and 9.5 before beta, so that we're doing *yet another* major release with unfixed multixact bugs. Let's make the relevant truncation changes in master and 9.5 and bump the WAL page magic, so that a 9.5alpha standby can't be used with a 9.5beta master. Then, we don't need any of this legacy truncation stuff at all, and 9.5 is hopefully in a much better state than 9.4 and 9.3. Now, that still potentially leaves 9.4 and 9.3 users hanging out to dry. But we don't have a tremendous number of those people clamoring about this, and if we get 9.5+ correct, then we can go and change the logic in 9.4 and 9.3 later when, and if, we are confident that's the right thing to do. I am still not altogether convinced that it's a good idea, nor am I altogether convinced that this code is right. Perhaps it is, and if we consensus on it, fine. But regardless of that, we should not send a third major release to beta with the current broken system unless there is really no viable alternative. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-09-21 10:31:17 -0400, Robert Haas wrote: > On Sun, Jul 5, 2015 at 3:16 PM, Andres Freund <andres@anarazel.de> wrote: > >>On the other hand, in the common case, by the time we perform a > >>restartpoint, we're consistent: I think the main exception to that is > >>if we do a base backup that spans multiple checkpoints. I think that > >>in the new location, the chances that the legacy truncation is trying > >>to read inconsistent data is probably higher. > > > > The primary problem isn't that we truncate too early, it's that we delay truncation on the standby in comparison to theprimary by a considerable amount. All the while continuing to replay multi creations. > > > > I don't see the difference wrt. consistency right now, but I don't have access to the code right now. I mean we *have*to do something while inconsistent. A start/stop backup can easily span a day or four. > > So, where are we with this patch? Uh. I'd basically been waiting on further review and then forgot about it. > In my opinion, we ought to do something about master and 9.5 before > beta, so that we're doing *yet another* major release with unfixed > multixact bugs. Let's make the relevant truncation changes in master > and 9.5 and bump the WAL page magic, so that a 9.5alpha standby can't > be used with a 9.5beta master. Then, we don't need any of this legacy > truncation stuff at all, and 9.5 is hopefully in a much better state > than 9.4 and 9.3. Hm. > Now, that still potentially leaves 9.4 and 9.3 users hanging out to > dry. But we don't have a tremendous number of those people clamoring > about this, and if we get 9.5+ correct, then we can go and change the > logic in 9.4 and 9.3 later when, and if, we are confident that's the > right thing to do. I am still not altogether convinced that it's a > good idea, nor am I altogether convinced that this code is right. > Perhaps it is, and if we consensus on it, fine. To me the current logic is much worse than what's in the patch, so I don't think that's the best way to go. But I'm not not absolutely gung ho on that. > But regardless of that, we should not send a third major release to > beta with the current broken system unless there is really no viable > alternative. Agreed. I'll update the patch. Greetings, Andres Freund
On 09/21/2015 07:36 AM, Andres Freund wrote: > On 2015-09-21 10:31:17 -0400, Robert Haas wrote: >> So, where are we with this patch? > > Uh. I'd basically been waiting on further review and then forgot about > it. Does the current plan to never expire XIDs in 9.6 affect multixact truncation at all? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 2015-09-21 10:30:59 -0700, Josh Berkus wrote: > On 09/21/2015 07:36 AM, Andres Freund wrote: > > On 2015-09-21 10:31:17 -0400, Robert Haas wrote: > >> So, where are we with this patch? > > > > Uh. I'd basically been waiting on further review and then forgot about > > it. > > Does the current plan to never expire XIDs in 9.6 affect multixact > truncation at all? I doubt that it'd in a meaningful manner. Truncations will still need to happen to contain space usage. Besides, I'm pretty sceptical of shaping the design of bug fixes to suit some unwritten feature we only know the highest level design of as of yet. Greetings, Andres Freund
On 2015-07-02 11:52:04 -0400, Robert Haas wrote: > On Mon, Jun 29, 2015 at 3:48 PM, Andres Freund <andres@anarazel.de> wrote: > > New version attached. > > 0002 looks good, but the commit message should perhaps mention the > comment fix. Or commit that separately. I'm inclined to backpatch the applicable parts to 9.0 - seems pointless to have differing autovacuum_freeze_max_age values and the current value sucks for testing and space consumption there as well. Greetings, Andres Freund
On 2015-09-21 16:36:03 +0200, Andres Freund wrote: > Agreed. I'll update the patch. Here's updated patches against master. These include the "legacy" truncation support. There's no meaningful functional differences in this version except addressing the review comments that I agreed with, and a fair amount of additional polishing. I've not: * removed legacy truncation support * removed SimpleLruFlush() from find_multixact_start() - imo it's easier to reason about the system when the disk state is in sync with the in memory state. * removed the interval syntax from debug messages and xlogdump - they're a fair bit more concise and the target audience of those will be able to figure it out. * unsplit DetermineSafeOldestOffset & SetOffsetVacuumLimit - imo the separate functions don't make sense anymore now that limits and truncations aren't as separate anymore. What I've tested is the following: * continous burning of multis, both triggered via members and offsets * a standby keeping up when the primary is old * a standby keeping up when the primary is new * basebackups made while a new primary is under load * verified that we properly PANIC when a truncation record is replayed in an old standby. Does anybody have additional tests in mind? I plan to push 0002 fairly soon, it seemed to be pretty uncontroversial. I'll then work tomorrow afternoon on producing branch specific versions of 0003 and on producing 0004 removing all the legacy stuff for 9.5 & master. Greetings, Andres Freund
Attachment
On Tue, Sep 22, 2015 at 9:20 AM, Andres Freund <andres@anarazel.de> wrote: > On 2015-09-21 16:36:03 +0200, Andres Freund wrote: >> Agreed. I'll update the patch. > > Here's updated patches against master. These include the "legacy" > truncation support. There's no meaningful functional differences in this > version except addressing the review comments that I agreed with, and a > fair amount of additional polishing. 0002 looks fine. Regarding 0003, I'm still very much not convinced that it's a good idea to apply this to 9.3 and 9.4. This patch changes the way we do truncation in those older releases; instead of happening at a restartpoint, it happens when oldestMultiXid advances. I admit that I don't see a specific way that that can go wrong, but there are so many different old versions with slightly different multixact truncation behaviors that it seems very hard to be sure that we're not going to make things worse rather than better by introducing yet another approach to the problem. I realize that you disagree and will probably commit this to those branches anyway. But I want it to be clear that I don't endorse that. I wish more people were paying attention to these patches. These are critical data-corrupting bugs, the code in question is very tricky, it's been majorly revised multiple times, and we're revising it again. And nobody except me and Andres is looking at this, and I'm definitely not smart enough to get this all right. Other issues: - If SlruDeleteSegment fails in unlink(), shouldn't we at the very least log a message? If that file is still there when we loop back around, it's going to cause a failure, I think. Assorted minor nitpicking: - "happend" is misspelled in the commit message for 0003 - "in contrast to before" should have a comma after it, also in that commit message - "how far the next members wraparound is away" -> "how far away the next members wraparound is" - "seing" -> "seeing" - "Upgrade the primary," -> "Upgrade the primary;" - "toMultiXact" -> "to MultiXact" -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-09-22 13:38:58 -0400, Robert Haas wrote: > Regarding 0003, I'm still very much not convinced that it's a good > idea to apply this to 9.3 and 9.4. This patch changes the way we do > truncation in those older releases; instead of happening at a > restartpoint, it happens when oldestMultiXid advances. The primary reason for doing that is that doing it at restartpoints is simply *wrong*. Restartpoints aren't scheduled in sync with replay - which means that a restartpoint can (will actually) happen long long after the checkpoint from the primary has replayed. Which means that by the time the restartpoint is performed it's actually not unlikely that we've already filled all slru segments. Which is bad if we then fail over/start up. Aside from the more fundamental issue that restartpoints have to be "asynchronous" with respect to the checkpoint record for performance reasons, there's a bunch of additional reasons making this even more likely to occur: Differing checkpoint segments on the standby and pending actions (which we got rid off in 9.5+, but ...) > I realize that you disagree and will probably commit this to those > branches anyway. But I want it to be clear that I don't endorse that. I don't plan to commit/backpatch this over your objection. I do think it'd be the better approach, and I personally think that we're much more likely to introduce bugs if we backpatch this in a year. Which I think we'll end up having to. The longer people run on these branches, the more issues we'll see. > I wish more people were paying attention to these patches. +many > Other issues: > - If SlruDeleteSegment fails in unlink(), shouldn't we at the very > least log a message? If that file is still there when we loop back > around, it's going to cause a failure, I think. The existing unlink() call doesn't, that's the only reason I didn't add a message there. I'm fine with adding a (LOG or WARNING?) message. Greetings, Andres Freund
Robert Haas wrote: > Regarding 0003, I'm still very much not convinced that it's a good > idea to apply this to 9.3 and 9.4. This patch changes the way we do > truncation in those older releases; instead of happening at a > restartpoint, it happens when oldestMultiXid advances. I admit that I > don't see a specific way that that can go wrong, but there are so many > different old versions with slightly different multixact truncation > behaviors that it seems very hard to be sure that we're not going to > make things worse rather than better by introducing yet another > approach to the problem. I realize that you disagree and will > probably commit this to those branches anyway. But I want it to be > clear that I don't endorse that. Noted. I am not sure about changing things so invasively either TBH. The interactions of this stuff with other parts of the system are very complicated and it's easy to make a mistake that goes unnoticed until some weird scenario is run elsewhere. (Who would have thought that things would fail when a basebackup takes 12 hours to take and you have a custom preemptive tuple freeze script in crontab). > I wish more people were paying attention to these patches. These are > critical data-corrupting bugs, the code in question is very tricky, > it's been majorly revised multiple times, and we're revising it again. > And nobody except me and Andres is looking at this, and I'm definitely > not smart enough to get this all right. I'm also looking, and yes it's tricky. > Other issues: It would be good to pgindent the code before producing back-branch patches. I think some comments will get changed. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Sep 22, 2015 at 1:57 PM, Andres Freund <andres@anarazel.de> wrote: > On 2015-09-22 13:38:58 -0400, Robert Haas wrote: >> Regarding 0003, I'm still very much not convinced that it's a good >> idea to apply this to 9.3 and 9.4. This patch changes the way we do >> truncation in those older releases; instead of happening at a >> restartpoint, it happens when oldestMultiXid advances. > > The primary reason for doing that is that doing it at restartpoints is > simply *wrong*. Restartpoints aren't scheduled in sync with replay - > which means that a restartpoint can (will actually) happen long long > after the checkpoint from the primary has replayed. Which means that by > the time the restartpoint is performed it's actually not unlikely that > we've already filled all slru segments. Which is bad if we then fail > over/start up. 1. It would be possible to write a patch that included ONLY the changes needed to make that happen, and did nothing else. It would be largely a subset of this. If we want to change 9.3 and 9.4, I recommend we do that first, and then come back to the rest of this. 2. I agree that what we're doing right now is wrong. And I agree that this fixes a real problem. But it seems to me to be quite possible, even likely, that it will create other problems. For example, suppose that there are files in the data directory that precede oldestMultiXact. In the current approach, we'll remove those because they're not in the range we expect to be used. But in this approach we no longer remove everything we think shouldn't be there. We remove exactly the stuff we think should go away. As a general principle, that's clearly superior. But in the back-branches, it creates a risk: a leftover old file that doesn't get removed the first time through - for whatever reason - becomes a time bomb that will explode on the next wraparound. I don't know that that will happen. But I sure as heck don't know that won't happen with any combination of the variously broken 9.3.X releases we've put out there. Even if you can prove that particular risk never materializes to your satisfaction and mine, I will bet you a beer that there are other possible hazards neither of us is foreseeing right now. >> I realize that you disagree and will probably commit this to those >> branches anyway. But I want it to be clear that I don't endorse that. > > I don't plan to commit/backpatch this over your objection. I'm not in a position to demand that you take my advice, but I'm telling you what I think as honestly as I know how. To be clear, I am fully in favor of making these changes (without the legacy truncation stuff) in 9.5 and master, bumping WAL page magic so that we invalidate any 9.5 alpha standys. I think it's a far more solid approach than what we've got right now, and it clearly eliminates a host of dangers. In fact, I think it would be a pretty stupid idea not to make these changes in those branches. It would be doubling down on a design we know can never be made robust. But I do not have confidence that we can change 9.4 and especially 9.3 without knock-on consequences. You may have that confidence. I most definitely do not. My previous two rounds in the boxing ring with this problem convinced me that (1) it's incredibly easy to break things with well-intentioned changes in this area, (2) it's practically impossible to foresee everything that might go wrong with some screwy combination of versions, and (3) early 9.3.X releases are in much worse shape than early 9.4.X releases, to the point where guessing what any given variable is going to contain on 9.3.X is essentially throwing darts at the wall. That's an awfully challenging environment in which to write a bullet-proof patch. >> - If SlruDeleteSegment fails in unlink(), shouldn't we at the very >> least log a message? If that file is still there when we loop back >> around, it's going to cause a failure, I think. > > The existing unlink() call doesn't, that's the only reason I didn't add > a message there. I'm fine with adding a (LOG or WARNING?) message. Cool. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-09-22 14:52:49 -0400, Robert Haas wrote: > 1. It would be possible to write a patch that included ONLY the > changes needed to make that happen, and did nothing else. It would be > largely a subset of this. If we want to change 9.3 and 9.4, I > recommend we do that first, and then come back to the rest of this. I think that patch would be pretty much what I wrote. To be correct you basically have to: 1) Never skip a truncation on the standby. Otherwise there might have already have been wraparound and you read the completelywrong offset. 2) Always perform truncations on the standby exactly the same moment (in the replay stream) as on the primary. Otherwisethere also can be a wraparound. 3) Never read anything from an SLRU from the data directory while inconsistent. In an inconsistent state we can read completelywrong data. A standby can be inconsistent in many situations, including crashes, restarts and fresh base backups. To me these three together leave only the option to never read an SLRUs contents on a standby. That only leaves minor changes in the patch that could be removed afaics. I mean we could leave in DetermineSafeOldestOffset() but it'd be doing pretty much the same as SetOffsetVacuumLimit(). I think we put at least three layers on bandaid on this issue since 9.3.2, and each layer made things more complicated. We primarily did so because of the compatibility and complexity concerns. I think that was a bad mistake. We should have done it mostly right back then, and we'd be better of now. If we continue with bandaids on the back branches while having a fixed 9.5+ with significantly different behaviour we'll have a hellish time fixing things in the back branches. And introduce more bugs than this might introduce. > 2. I agree that what we're doing right now is wrong. And I agree that > this fixes a real problem. But it seems to me to be quite possible, > even likely, that it will create other problems. Possible. But I think those bugs will be just bugs and not more fundamental architectural problems. To be very clear. I'm scared of the idea of backpatching this. I'm more scared of doing that myself. But even more I am scared of the current state. > For example, suppose that there are files in the data directory that > precede oldestMultiXact. In the current approach, we'll remove those > because they're not in the range we expect to be used. Hm. For offsets/ we continue to use SimpleLruTruncate() for truncation, which scans the directory, so I don't see a problem. For members/ we won't - but neither do we really today, see SlruScanDirCbRemoveMembers(). So I don't think there'll be a significant difference? > a leftover old file that doesn't get removed the first time through - > for whatever reason - becomes a time bomb that will explode on the > next wraparound. I don't know that that will happen. We should be able to deal with that, otherwise recovery is pretty borked. It can be a problem for the 'recovery from wrong oldest multi' case, but that's the same today. > I will bet you a beer that there are other possible hazards neither of > us is foreseeing right now. Right. I'm not dismissing that. I just think it's much more likely to be handleable problems than the set we have today. It's incredibly hard to get an accurate mental model of the combined behaviour & state of primary and standby today. Even if we three have that today, I'm pretty sure we won't in half a year. And sure as hell nearly nobody else will have one. > >> - If SlruDeleteSegment fails in unlink(), shouldn't we at the very > >> least log a message? If that file is still there when we loop back > >> around, it's going to cause a failure, I think. > > > > The existing unlink() call doesn't, that's the only reason I didn't add > > a message there. I'm fine with adding a (LOG or WARNING?) message. > > Cool. Hm. When redoing a truncation during [crash] recovery that can cause a host of spurious warnings if already done before. DEBUG1 to avoid scaring users? Greetings, Andres Freund
On 2015-09-23 01:24:31 +0200, Andres Freund wrote: > I think we put at least three layers on bandaid on this issue since > 9.3.2, and each layer made things more complicated. 2a9b01928f193f529b885ac577051c4fd00bd427 - Cope with possible failure of the oldest MultiXact to exist. 5bbac7ec1b5754043e073a45454e4c257512ce30 - Advance the stop point for multixact offset creation only at checkpoint. 9a28c3752c89ec01fb8b28bb5904c6d547507fda - Have multixact be truncated by checkpoint, not vacuum 215ac4ad6589e0f6a31cc4cd867aedba3cd42924 - Truncate pg_multixact/'s contents during crash recovery At least these are closely related to the fact that truncation isn't WAL logged. There are more that are tangentially related. We (primarily me, writing the timewise first one) should have gone for a new WAL record from the start. We've discussed that in at least three of the threads around the above commits...
On Tue, Sep 22, 2015 at 7:45 PM, Andres Freund <andres@anarazel.de> wrote: > On 2015-09-23 01:24:31 +0200, Andres Freund wrote: >> I think we put at least three layers on bandaid on this issue since >> 9.3.2, and each layer made things more complicated. > > 2a9b01928f193f529b885ac577051c4fd00bd427 - Cope with possible failure of the oldest MultiXact to exist. > 5bbac7ec1b5754043e073a45454e4c257512ce30 - Advance the stop point for multixact offset creation only at checkpoint. > 9a28c3752c89ec01fb8b28bb5904c6d547507fda - Have multixact be truncated by checkpoint, not vacuum > 215ac4ad6589e0f6a31cc4cd867aedba3cd42924 - Truncate pg_multixact/'s contents during crash recovery > > At least these are closely related to the fact that truncation isn't WAL > logged. There are more that are tangentially related. We (primarily me, > writing the timewise first one) should have gone for a new WAL record > from the start. We've discussed that in at least three of the threads > around the above commits... I'm not disagreeing with any of that. I'm just disagreeing with you on the likelihood that we're going to make things better vs. making them worse. But, really, I've said everything I have to say about this. You have a commit bit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-09-22 20:14:11 -0400, Robert Haas wrote: > I'm not disagreeing with any of that. I'm just disagreeing with you > on the likelihood that we're going to make things better vs. making > them worse. But, really, I've said everything I have to say about > this. You have a commit bit. I'm not going to push backpatch this to 9.3/9.4 without you being on board. For that I think you're unfortunately too often right, and this is too critical. But I'm also not going to develop an alternative stopgap for those versions, since I have no clue how that'd end up being better. The only alternative proposal I have right now is to push this to 9.5/9.6 (squashed with a followup patch removing legacy truncations) and then push the patch including legacy stuff to 9.3/4 after the next set of releases. Greetings, Andres Freund
> @@ -1210,8 +1211,14 @@ restart:; > (void) SlruScanDirectory(ctl, SlruScanDirCbDeleteCutoff, &cutoffPage); > } > > -void > -SlruDeleteSegment(SlruCtl ctl, char *filename) > +/* > + * Delete an individual SLRU segment, identified by the filename. > + * > + * NB: This does not touch the SLRU buffers themselves, callers have to ensure > + * they either can't yet contain anything, or have already been cleaned out. > + */ > +static void > +SlruInternalDeleteSegment(SlruCtl ctl, char *filename) > { > char path[MAXPGPATH]; > > @@ -1222,6 +1229,64 @@ SlruDeleteSegment(SlruCtl ctl, char *filename) > } > > /* > + * Delete an individual SLRU segment, identified by the segment number. > + */ > +void > +SlruDeleteSegment(SlruCtl ctl, int segno) Is it okay to change the ABI of SlruDeleteSegment? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2015-09-23 10:29:09 -0300, Alvaro Herrera wrote: > > /* > > + * Delete an individual SLRU segment, identified by the segment number. > > + */ > > +void > > +SlruDeleteSegment(SlruCtl ctl, int segno) > > Is it okay to change the ABI of SlruDeleteSegment? I think so. Any previous user of the API is going to be currently broken anyway due to the missing flushing of buffers. Andres
The comment on top of TrimMultiXact states that "no locks are needed here", but then goes on to grab a few locks. I think we should remove the comment, or rephrase it to state that we still grab them for consistency or whatever; or perhaps even remove the lock acquisitions. (I think the comment is still true: by the time TrimMultiXact runs, we're out of recovery but not yet running, so it's not possible for anyone to try to do anything multixact-related.) I wonder if it would be cleaner to move the setting of finishedStartup down to just before calling SetMultiXactIdLimit, instead of at the top of the function. It's a bit odd that SetMultiXactIdLimit has the "finishedStartup" test so low. Why bother setting all those local variables only to bail out? I think it would make more sense to just do it at the top. The only thing you lose AFAICS is that elog(DEBUG1) message -- is that worth it? Also, the fact that finishedStartup itself is read without a lock at least merits a comment. In MultiXactAdvanceOldest, the test for sawTruncationinCkptCycle seems reversed? if (!MultiXactState->sawTruncationInCkptCycle) surely we should be doing truncation if it's set? Honestly, I wonder whether this message ereport(LOG, (errmsg("performing legacy multixact truncation"), errdetail("Legacy truncations are sometimes performed when replaying WAL from an older primary."), errhint("Upgrade the primary, it is susceptible to data corruption."))); shouldn't rather be a PANIC. (The main reason not to, I think, is that once you see this, there is no way to put the standby in a working state without recloning). I think the prevOldestOffsetKnown test in line 2667 ("if we failed to get ...") is better expressed as an else-if of the previous "if" block. I think the two "there are NO MultiXacts" cases in TruncateMultiXact would benefit in readability from adding braces around the lone statement (and moving the comment to the line prior). If the find_multixact_start(oldestMulti) call in TruncateMultiXact fails, what recourse does the user have? I wonder if the elog() should be a FATAL instead of just LOG. It's not like it would work on a subsequent run, is it? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2015-09-23 15:03:05 -0300, Alvaro Herrera wrote: > The comment on top of TrimMultiXact states that "no locks are needed > here", but then goes on to grab a few locks. Hm. Yea. Although that was the case before. > It's a bit odd that SetMultiXactIdLimit has the "finishedStartup" test > so low. Why bother setting all those local variables only to bail > out? Hm. Doesn't seem to matter much to me, but I can change it. > In MultiXactAdvanceOldest, the test for sawTruncationinCkptCycle seems > reversed? > if (!MultiXactState->sawTruncationInCkptCycle) > surely we should be doing truncation if it's set? No, that's correct. If there was a checkpoint cycle where oldestMulti advanced without seing a truncation record we need to perform a legacy truncation. > Honestly, I wonder whether this message > ereport(LOG, > (errmsg("performing legacy multixact truncation"), > errdetail("Legacy truncations are sometimes performed when replaying WAL from an older primary."), > errhint("Upgrade the primary, it is susceptible to data corruption."))); > shouldn't rather be a PANIC. (The main reason not to, I think, is that > once you see this, there is no way to put the standby in a working state > without recloning). Huh? The behaviour in that case is still better than what we have in 9.3+ today (not delayed till the restartpoint). Don't see why that should be a panic. That'd imo make it pretty much impossible to upgrade a pair of primary/master where you normally upgrade the standby first? This is all moot given Robert's objection to backpatching this to 9.3/4. > If the find_multixact_start(oldestMulti) call in TruncateMultiXact > fails, what recourse does the user have? I wonder if the elog() should > be a FATAL instead of just LOG. It's not like it would work on a > subsequent run, is it? It currently only LOGs, I don't want to change that. The cases where we currently know it's possible to hit this, it should be fixed by the next set of emergency autovacuums (which we trigger). Thanks for the look, Andres
Andres Freund wrote: > On 2015-09-23 15:03:05 -0300, Alvaro Herrera wrote: > > Honestly, I wonder whether this message > > ereport(LOG, > > (errmsg("performing legacy multixact truncation"), > > errdetail("Legacy truncations are sometimes performed when replaying WAL from an older primary."), > > errhint("Upgrade the primary, it is susceptible to data corruption."))); > > shouldn't rather be a PANIC. (The main reason not to, I think, is that > > once you see this, there is no way to put the standby in a working state > > without recloning). > > Huh? The behaviour in that case is still better than what we have in > 9.3+ today (not delayed till the restartpoint). Don't see why that > should be a panic. That'd imo make it pretty much impossible to upgrade > a pair of primary/master where you normally upgrade the standby first? > > This is all moot given Robert's objection to backpatching this to > 9.3/4. I think we need to make a decision here. Is this a terribly serious bug/misdesign that needs addressing? If so, we need to backpatch. If not, then by all means lets leave it alone. I don't think it is a good idea to leave it open if we think it's serious, which is what I think is happening. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2015-09-23 15:57:02 -0300, Alvaro Herrera wrote: > I think we need to make a decision here. Is this a terribly serious > bug/misdesign that needs addressing? Imo yes. Not sure about terribly, but definitely serious. It's several data loss bugs in one package. > If so, we need to backpatch. If not, then by all means lets leave it > alone. I don't think it is a good idea to leave it open if we think > it's serious, which is what I think is happening. Right, but I don't want to backpatch this over an objection, and it doesn't seem like I have a chance to convince Robert that it'd be a good idea. So it'll be 9.5+master for now. Greetings, Andres Freund
Hi, On 2015-09-23 15:03:05 -0300, Alvaro Herrera wrote: > I wonder if it would be cleaner to move the setting of finishedStartup > down to just before calling SetMultiXactIdLimit, instead of at the top > of the function. Done. I don't think it makes much of a difference, but there's really no reason not to change it. > It's a bit odd that SetMultiXactIdLimit has the "finishedStartup" test > so low. Why bother setting all those local variables only to bail > out? But we do more than set local variables? We actually set some in-memory limits (MultiXactState->multiVacLimit et al). What we can't do is to startup the members wraparound protection because that requires accessing the SLRUs. Perhaps we should, independently of this patch really, rename SetOffsetVacuumLimit() - it may be rather confusing that it actually is about members/? The current name is correct, but also a bit confusing. ComputeMembersVacuumLimits()? > In MultiXactAdvanceOldest, the test for sawTruncationinCkptCycle seems > reversed? > if (!MultiXactState->sawTruncationInCkptCycle) > surely we should be doing truncation if it's set? I wanted to add a comment explaining this, but the existing job seems to do a fair job at that: /* * If there has been a truncation on the master, detected by seeing a * moving oldestMulti, without a corresponding truncation record, we * know that the primary is still running an older version of postgres * that doesn't yet log multixact truncations. So perform the * truncation ourselves. */ I've done some additional comment smithing. Attached is 0002 (prev 0003) including the legacy truncation support, and 0003 removing that and bumping page magic. I'm slightly inclined to commit them separately (to 9.5 & master) so that we have something to backpatch from. Greetings, Andres Freund
Attachment
Hi, I pushed this to 9.5 and master, committing the xlog page magic bump separately. To avoid using a magic value from master in 9.5 I bumped the numbers by two in both branches. Should this get a release note entry given that we're not (at least immediately) backpatching this? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > Should this get a release note entry given that we're not (at least > immediately) backpatching this? I'll probably put something in when I update the release notes for beta1 (next week sometime); no real need to deal with it individually. regards, tom lane
On 9/23/15 1:48 PM, Andres Freund wrote: >> Honestly, I wonder whether this message >> > ereport(LOG, >> > (errmsg("performing legacy multixact truncation"), >> > errdetail("Legacy truncations are sometimes performed when replaying WAL from an older primary."), >> > errhint("Upgrade the primary, it is susceptible to data corruption."))); >> >shouldn't rather be a PANIC. (The main reason not to, I think, is that >> >once you see this, there is no way to put the standby in a working state >> >without recloning). > Huh? The behaviour in that case is still better than what we have in > 9.3+ today (not delayed till the restartpoint). Don't see why that > should be a panic. That'd imo make it pretty much impossible to upgrade > a pair of primary/master where you normally upgrade the standby first? IMHO doing just a log of something this serious; it should at least be a WARNING. I think the concern about upgrading a replica before the master is valid; is there some way we could over-ride a PANIC when that's exactly what someone is trying to do? Check for a special file maybe? + bool sawTruncationInCkptCycle; What happens if someone downgrades the master, back to a version that no longer logs truncation? (I don't think assuming that the replica will need to restart if that happens is a safe bet...) - if (MultiXactIdPrecedes(oldestMXact, earliest)) + /* If there's nothing to remove, we can bail out early. */ + if (MultiXactIdPrecedes(oldestMulti, earliest)) { - DetermineSafeOldestOffset(oldestMXact); + LWLockRelease(MultiXactTruncationLock); If/when this is backpatched, would it be safer to just leave this alone? -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
On 2015-09-27 14:21:08 -0500, Jim Nasby wrote: > IMHO doing just a log of something this serious; it should at least be a > WARNING. In postgres LOG, somewhat confusingly, is more severe than WARNING. > I think the concern about upgrading a replica before the master is valid; is > there some way we could over-ride a PANIC when that's exactly what someone > is trying to do? Check for a special file maybe? I don't understand this concern - that's just the situation we have in all released branches today. > + bool sawTruncationInCkptCycle; > What happens if someone downgrades the master, back to a version that no > longer logs truncation? (I don't think assuming that the replica will need > to restart if that happens is a safe bet...) It'll just to do legacy truncation again - without a restart on the standby required. > - if (MultiXactIdPrecedes(oldestMXact, earliest)) > + /* If there's nothing to remove, we can bail out early. */ > + if (MultiXactIdPrecedes(oldestMulti, earliest)) > { > - DetermineSafeOldestOffset(oldestMXact); > + LWLockRelease(MultiXactTruncationLock); > If/when this is backpatched, would it be safer to just leave this alone? What do you mean? This can't just isolated be left alone?
On 9/27/15 2:25 PM, Andres Freund wrote: > On 2015-09-27 14:21:08 -0500, Jim Nasby wrote: >> IMHO doing just a log of something this serious; it should at least be a >> WARNING. > > In postgres LOG, somewhat confusingly, is more severe than WARNING. Ahh, right. Which in this case stinks, because WARNING is a lot more attention grabbing than LOG. :/ >> I think the concern about upgrading a replica before the master is valid; is >> there some way we could over-ride a PANIC when that's exactly what someone >> is trying to do? Check for a special file maybe? > > I don't understand this concern - that's just the situation we have in > all released branches today. There was discussion about making this a PANIC instead of a LOG, which I think is a good idea... but then there'd need to be some way to not PANIC if you were doing an upgrade. >> + bool sawTruncationInCkptCycle; >> What happens if someone downgrades the master, back to a version that no >> longer logs truncation? (I don't think assuming that the replica will need >> to restart if that happens is a safe bet...) > > It'll just to do legacy truncation again - without a restart on the > standby required. Oh, I thought once that was set it would stay set. NM. >> - if (MultiXactIdPrecedes(oldestMXact, earliest)) >> + /* If there's nothing to remove, we can bail out early. */ >> + if (MultiXactIdPrecedes(oldestMulti, earliest)) >> { >> - DetermineSafeOldestOffset(oldestMXact); >> + LWLockRelease(MultiXactTruncationLock); >> If/when this is backpatched, would it be safer to just leave this alone? > > What do you mean? This can't just isolated be left alone? I thought removing DetermineSafeOldestOffset was just an optimization, but I guess I was confused. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
On Mon, Sep 28, 2015 at 5:47 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > There was discussion about making this a PANIC instead of a LOG, which I > think is a good idea... but then there'd need to be some way to not PANIC if > you were doing an upgrade. I think you're worrying about a non-problem. This code has not been back-patched prior to 9.5, and the legacy truncation code has been removed in 9.5+. So it's a complete non-issue right at the moment. If at some point we back-patch this further, then it potentially becomes a live issue, but I would like to respectfully inquire what exactly you think making it a PANIC would accomplish? There are a lot of scary things about this patch, but the logic for deciding whether to perform a legacy truncation is solid as far as I know. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 9/28/15 8:49 PM, Robert Haas wrote: > If at some point we back-patch this further, then it potentially > becomes a live issue, but I would like to respectfully inquire what > exactly you think making it a PANIC would accomplish? There are a lot > of scary things about this patch, but the logic for deciding whether > to perform a legacy truncation is solid as far as I know. Maybe I'm confused, but I thought the whole purpose of this was to get rid of the risk associated with that calculation in favor of explicit truncation boundaries in the WAL log. Even if that's not the case, ISTM that being big and in your face about a potential data corruption bug is a good thing, as long as the DBA has a way to "hit the snooze button". Either way, I'm not going to make a fuss over it. Just to make sure we're on the same page; Alvaro's original comment was: > Honestly, I wonder whether this message > ereport(LOG, > (errmsg("performing legacy multixact truncation"), > errdetail("Legacy truncations are sometimes performed when replaying WAL from an older primary."), > errhint("Upgrade the primary, it is susceptible to data corruption."))); > shouldn't rather be a PANIC. (The main reason not to, I think, is that > once you see this, there is no way to put the standby in a working state > without recloning). -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
On 2015-09-28 21:48:00 -0500, Jim Nasby wrote: > On 9/28/15 8:49 PM, Robert Haas wrote: > >If at some point we back-patch this further, then it potentially > >becomes a live issue, but I would like to respectfully inquire what > >exactly you think making it a PANIC would accomplish? There are a lot > >of scary things about this patch, but the logic for deciding whether > >to perform a legacy truncation is solid as far as I know. > > Maybe I'm confused, but I thought the whole purpose of this was to get rid > of the risk associated with that calculation in favor of explicit truncation > boundaries in the WAL log. > Even if that's not the case, ISTM that being big and in your face about a > potential data corruption bug is a good thing, as long as the DBA has a way > to "hit the snooze button". So we'd end up with a guc that everyone has to set while they upgrade. That seems like a pointless hassle. Greetings, Andres Freund
On Mon, Sep 28, 2015 at 10:48 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > Maybe I'm confused, but I thought the whole purpose of this was to get rid > of the risk associated with that calculation in favor of explicit truncation > boundaries in the WAL log. Yes. But if the master hasn't been updated yet, then we still need to do something based on a calculation. > Even if that's not the case, ISTM that being big and in your face about a > potential data corruption bug is a good thing, as long as the DBA has a way > to "hit the snooze button". Panicking the standby because the master hasn't been updated does not seem like a good thing to me in any way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Sep 22, 2015 at 3:20 PM, Andres Freund <andres@anarazel.de> wrote: > What I've tested is the following: > * continous burning of multis, both triggered via members and offsets > * a standby keeping up when the primary is old > * a standby keeping up when the primary is new > * basebackups made while a new primary is under load > * verified that we properly PANIC when a truncation record is replayed > in an old standby. Are these test scripts available somewhere? I understand they might be undocumented and perhaps tricky to set it all up, but I would be very interested in them anyway, think you could push them somewhere? Thanks a lot for working on this!
Robert Haas wrote: > On Mon, Sep 28, 2015 at 10:48 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > > Maybe I'm confused, but I thought the whole purpose of this was to get rid > > of the risk associated with that calculation in favor of explicit truncation > > boundaries in the WAL log. > > Yes. But if the master hasn't been updated yet, then we still need to > do something based on a calculation. Right. > > Even if that's not the case, ISTM that being big and in your face about a > > potential data corruption bug is a good thing, as long as the DBA has a way > > to "hit the snooze button". > > Panicking the standby because the master hasn't been updated does not > seem like a good thing to me in any way. If we had a way to force the master to upgrade, I think it would be good because we have a mechanism to get rid of the legacy truncation code; but as I said several messages ago this doesn't actually work which is why I dropped the idea of panicking. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I'm several days into a review of this change (commits 4f627f8 and aa29c1c). There's one part of the design I want to understand before commenting on specific code. What did you anticipate to be the consequences of failing to remove SLRU segment files that MultiXactState->oldestMultiXactId implies we should have removed? I ask because, on the one hand, I see code making substantial efforts to ensure that it truncates exactly as planned: /* * Do truncation, and the WAL logging of the truncation, in a critical * section. That way offsets/members cannot get outof sync anymore, i.e. * once consistent the newOldestMulti will always exist in members, even * if we crashed in the wrongmoment. */START_CRIT_SECTION(); /* * Prevent checkpoints from being scheduled concurrently. This is critical * because otherwise a truncation record mightnot be replayed after a * crash/basebackup, even though the state of the data directory would * require it. */Assert(!MyPgXact->delayChkpt);MyPgXact->delayChkpt= true; .../* * Update in-memory limits before performing the truncation, while inside * the critical section: Have to do it beforetruncation, to prevent * concurrent lookups of those values. Has to be inside the critical * section as otherwise afuture call to this function would error out, * while looking up the oldest member in offsets, if our caller crashes * beforeupdating the limits. */ On the other hand, TruncateMultiXact() callees ignore unlink() return values: On Tue, Sep 22, 2015 at 07:57:27PM +0200, Andres Freund wrote: > On 2015-09-22 13:38:58 -0400, Robert Haas wrote: > > - If SlruDeleteSegment fails in unlink(), shouldn't we at the very > > least log a message? If that file is still there when we loop back > > around, it's going to cause a failure, I think. > > The existing unlink() call doesn't, that's the only reason I didn't add > a message there. I'm fine with adding a (LOG or WARNING?) message. Unlinking old pg_clog files is strictly an optimization. If you were to comment out every unlink() call in slru.c, the only ill effect on CLOG is the waste of disk space. Is the same true of MultiXact? If there's anyplace where failure to unlink would cause a malfunction, I think it would be around the use of SlruScanDirCbFindEarliest(). That function's result becomes undefined if the range of pg_multixact/offsets segment files present on disk spans more than about INT_MAX/2 MultiXactId. Thanks, nm
Hi, On 2015-10-24 22:07:00 -0400, Noah Misch wrote: > I'm several days into a review of this change (commits 4f627f8 and > aa29c1c). Cool! > There's one part of the design I want to understand before commenting on > specific code. What did you anticipate to be the consequences of failing to > remove SLRU segment files that MultiXactState->oldestMultiXactId implies we > should have removed? I ask because, on the one hand, I see code making > substantial efforts to ensure that it truncates exactly as planned: > [portion of TruncateMultiXact] The reason we can't have checkpoints there, is that checkpoints records multixact values in the checkpoint record. If we crash-restart before the truncation has finished we can end up in the situation that ->oldestMultiXactId doesn't exist. Which will trigger a round of emergency vacuum at the next startup, not something that should happen due to a concurrency problem. We could instead update the in-memory values first, but that could lead to other problems. So the critical section/delaying of checkpoints is more about having the on-disk agreeing with the status data in the checkpoint/control file. > On Tue, Sep 22, 2015 at 07:57:27PM +0200, Andres Freund wrote: > > On 2015-09-22 13:38:58 -0400, Robert Haas wrote: > > > - If SlruDeleteSegment fails in unlink(), shouldn't we at the very > > > least log a message? If that file is still there when we loop back > > > around, it's going to cause a failure, I think. > > > > The existing unlink() call doesn't, that's the only reason I didn't add > > a message there. I'm fine with adding a (LOG or WARNING?) message. Note that I didn't add the warning after all, as it'd be too noisy during repeated replay, as all the files would already be gone. We could only emit it when the error is not ENOFILE, if people prefer that. > Unlinking old pg_clog files is strictly an optimization. If you were to > comment out every unlink() call in slru.c, the only ill effect on CLOG is the > waste of disk space. Is the same true of MultiXact? Well, multixacts are a lot larger than the other SLRUs, I think that makes some sort of difference. Thanks, Andres
On 10/27/2015 07:44 AM, Andres Freund wrote: >> Unlinking old pg_clog files is strictly an optimization. If you were to >> > comment out every unlink() call in slru.c, the only ill effect on CLOG is the >> > waste of disk space. Is the same true of MultiXact? > Well, multixacts are a lot larger than the other SLRUs, I think that > makes some sort of difference. And by "a lot larger" we're talking like 50X to 100X. I regularly see pg_multixact directories larger than 1GB. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Tue, Oct 27, 2015 at 03:44:10PM +0100, Andres Freund wrote: > On 2015-10-24 22:07:00 -0400, Noah Misch wrote: > > On Tue, Sep 22, 2015 at 07:57:27PM +0200, Andres Freund wrote: > > > On 2015-09-22 13:38:58 -0400, Robert Haas wrote: > > > > - If SlruDeleteSegment fails in unlink(), shouldn't we at the very > > > > least log a message? If that file is still there when we loop back > > > > around, it's going to cause a failure, I think. > > > > > > The existing unlink() call doesn't, that's the only reason I didn't add > > > a message there. I'm fine with adding a (LOG or WARNING?) message. > > Note that I didn't add the warning after all, as it'd be too noisy > during repeated replay, as all the files would already be gone. We could > only emit it when the error is not ENOFILE, if people prefer that. > > > > Unlinking old pg_clog files is strictly an optimization. If you were to > > comment out every unlink() call in slru.c, the only ill effect on CLOG is the > > waste of disk space. Is the same true of MultiXact? > > Well, multixacts are a lot larger than the other SLRUs, I think that > makes some sort of difference. That helps; thanks. Your design seems good. I've located only insipid defects. I propose to save some time by writing a patch series eliminating them, which you could hopefully review. Does that sound good?
Hi, On October 29, 2015 7:59:03 AM GMT+01:00, Noah Misch <noah@leadboat.com> wrote: >On Tue, Oct 27, 2015 at 03:44:10PM +0100, Andres Freund wrote: >> On 2015-10-24 22:07:00 -0400, Noah Misch wrote: >> > On Tue, Sep 22, 2015 at 07:57:27PM +0200, Andres Freund wrote: >> > > On 2015-09-22 13:38:58 -0400, Robert Haas wrote: >> > > > - If SlruDeleteSegment fails in unlink(), shouldn't we at the >very >> > > > least log a message? If that file is still there when we loop >back >> > > > around, it's going to cause a failure, I think. >> > > >> > > The existing unlink() call doesn't, that's the only reason I >didn't add >> > > a message there. I'm fine with adding a (LOG or WARNING?) >message. >> >> Note that I didn't add the warning after all, as it'd be too noisy >> during repeated replay, as all the files would already be gone. We >could >> only emit it when the error is not ENOFILE, if people prefer that. >> >> >> > Unlinking old pg_clog files is strictly an optimization. If you >were to >> > comment out every unlink() call in slru.c, the only ill effect on >CLOG is the >> > waste of disk space. Is the same true of MultiXact? >> >> Well, multixacts are a lot larger than the other SLRUs, I think that >> makes some sort of difference. > >That helps; thanks. Your design seems good. I've located only insipid >defects. Great! > I propose to save some time by writing a patch series >eliminating >them, which you could hopefully review. Does that sound good? Yes, it does. Andres --- Please excuse brevity and formatting - I am writing this on my mobile phone.
On Thu, Oct 29, 2015 at 08:46:52AM +0100, Andres Freund wrote: > On October 29, 2015 7:59:03 AM GMT+01:00, Noah Misch <noah@leadboat.com> wrote: > >That helps; thanks. Your design seems good. I've located only insipid > >defects. > > Great! > > > I propose to save some time by writing a patch series > >eliminating > >them, which you could hopefully review. Does that sound good? > > Yes, it does. I have pushed a stack of branches to https://github.com/nmisch/postgresql.git: mxt0-revert - reverts commits 4f627f8 and aa29c1c mxt1-disk-independent - see below mxt2-cosmetic - update already-wrong comments and formatting mxt3-main - replaces commit 4f627f8 mxt4-rm-legacy - replaces commit aa29c1c The plan is to squash each branch into one PostgreSQL commit. In addition to examining overall "git diff mxt2-cosmetic mxt3-main", I recommend reviewing itemized changes and commit log entries in "git log -p --reverse --no-merges mxt2-cosmetic..mxt3-main". In particular, when a change involves something you discussed upthread or was otherwise not obvious, I put a statement of rationale in the commit log. > + * Make sure to only attempt truncation if there's values to truncate > + * away. In normal processing values shouldn't go backwards, but there's > + * some corner cases (due to bugs) where that's possible. Which bugs are those? I would like to include more detail if available. If anything here requires careful study, it's the small mxt1-disk-independent change, which I have also attached in patch form. That patch removes the SlruScanDirCbFindEarliest() test from TruncateMultiXact(), which in turn makes multixact control decisions independent of whether TruncateMultiXact() is successful at unlinking segments. Today, one undeletable segment file can cause TruncateMultiXact() to give up on truncation completely for a span of hundreds of millions of MultiXactId. Patched multixact.c will, like CLOG, make its decisions strictly based on the content of files it expects to exist. It will no longer depend on the absence of files it hopes will not exist. To aid in explaining the change's effects, I will define some terms. A "logical wrap" occurs when no range of 2^31 integers covers the set of MultiXactId stored in xmax fields. A "file-level wrap" occurs when there exists a pair of pg_multixact/offsets segment files such that: | segno_a * SLRU_PAGES_PER_SEGMENT * MULTIXACT_OFFSETS_PER_PAGE - segno_b * SLRU_PAGES_PER_SEGMENT * MULTIXACT_OFFSETS_PER_PAGE | > 2^31 A logical wrap implies either a file-level wrap or missing visibility metadata, but a file-level wrap does not imply other consequences. The SlruScanDirCbFindEarliest() test is usually redundant with find_multixact_start(), because MultiXactIdPrecedes(oldestMXact, earliest) almost implies that find_multixact_start() will fail. The exception arises when pg_multixact/offsets files compose a file-level wrap, which can happen when TruncateMultiXact() fails to unlink segments as planned. When it does happen, the result of SlruScanDirCbFindEarliest(), and therefore the computed "earliest" value, is undefined. (This outcome is connected to our requirement to use only half the pg_clog or pg_multixact/offsets address space at any one time. The PagePrecedes callbacks for these SLRUs cease to be transitive if more than half their address space is in use.) The SlruScanDirCbFindEarliest() test can be helpful when a file-level wrap coexists with incorrect oldestMultiXactId (extant xmax values define "correct oldestMultiXactId"). If we're lucky with readdir() order, the test will block truncation so we don't delete still-needed segments. I am content to lose that, because (a) the code isn't reliable for (or even directed toward) that purpose and (b) sites running on today's latest point releases no longer have incorrect oldestMultiXactId.
Attachment
On November 8, 2015 12:54:07 AM PST, Noah Misch <noah@leadboat.com> wrote: >I have pushed a stack of branches to >https://github.com/nmisch/postgresql.git: > >mxt0-revert - reverts commits 4f627f8 and aa29c1c >mxt1-disk-independent - see below >mxt2-cosmetic - update already-wrong comments and formatting >mxt3-main - replaces commit 4f627f8 >mxt4-rm-legacy - replaces commit aa29c1c > >The plan is to squash each branch into one PostgreSQL commit. In >addition to >examining overall "git diff mxt2-cosmetic mxt3-main", I recommend >reviewing >itemized changes and commit log entries in "git log -p --reverse >--no-merges >mxt2-cosmetic..mxt3-main". In particular, when a change involves >something >you discussed upthread or was otherwise not obvious, I put a statement >of >rationale in the commit log. I'm not following along right now - in order to make cleanups the plan is to revert a couple commits and then redo them prettyfied? Andres Hi --- Please excuse brevity and formatting - I am writing this on my mobile phone.
On Sun, Nov 08, 2015 at 11:11:42AM -0800, Andres Freund wrote: > On November 8, 2015 12:54:07 AM PST, Noah Misch <noah@leadboat.com> wrote: > > >I have pushed a stack of branches to > >https://github.com/nmisch/postgresql.git: > > > >mxt0-revert - reverts commits 4f627f8 and aa29c1c > >mxt1-disk-independent - see below > >mxt2-cosmetic - update already-wrong comments and formatting > >mxt3-main - replaces commit 4f627f8 > >mxt4-rm-legacy - replaces commit aa29c1c > > > >The plan is to squash each branch into one PostgreSQL commit. In > >addition to > >examining overall "git diff mxt2-cosmetic mxt3-main", I recommend > >reviewing > >itemized changes and commit log entries in "git log -p --reverse > >--no-merges > >mxt2-cosmetic..mxt3-main". In particular, when a change involves > >something > >you discussed upthread or was otherwise not obvious, I put a statement > >of > >rationale in the commit log. > > I'm not following along right now - in order to make cleanups the plan is to revert a couple commits and then redo themprettyfied? Yes, essentially. Given the volume of updates, this seemed neater than framing those updates as in-tree incremental development.
On November 8, 2015 11:52:05 AM PST, Noah Misch <noah@leadboat.com> wrote: >On Sun, Nov 08, 2015 at 11:11:42AM -0800, Andres Freund wrote: >> On November 8, 2015 12:54:07 AM PST, Noah Misch <noah@leadboat.com> >wrote: >> >> >I have pushed a stack of branches to >> >https://github.com/nmisch/postgresql.git: >> > >> >mxt0-revert - reverts commits 4f627f8 and aa29c1c >> >mxt1-disk-independent - see below >> >mxt2-cosmetic - update already-wrong comments and formatting >> >mxt3-main - replaces commit 4f627f8 >> >mxt4-rm-legacy - replaces commit aa29c1c >> > >> >The plan is to squash each branch into one PostgreSQL commit. In >> >addition to >> >examining overall "git diff mxt2-cosmetic mxt3-main", I recommend >> >reviewing >> >itemized changes and commit log entries in "git log -p --reverse >> >--no-merges >> >mxt2-cosmetic..mxt3-main". In particular, when a change involves >> >something >> >you discussed upthread or was otherwise not obvious, I put a >statement >> >of >> >rationale in the commit log. >> >> I'm not following along right now - in order to make cleanups the >plan is to revert a couple commits and then redo them prettyfied? > >Yes, essentially. Given the volume of updates, this seemed neater than >framing those updates as in-tree incremental development. I don't like that plan. I don't have a problem doing that in some development branch somewhere, but I fail to see any benefitdoing that in 9.5/master. It'll just make the history more convoluted for no benefit. I'll obviously still review the changes. Even for review it's nor particularly convenient, because now the entirety of the large changes essentially needs to be reviewedanew, given they're not the same. --- Please excuse brevity and formatting - I am writing this on my mobile phone.
On Sun, Nov 08, 2015 at 11:59:33AM -0800, Andres Freund wrote: > On November 8, 2015 11:52:05 AM PST, Noah Misch <noah@leadboat.com> wrote: > >On Sun, Nov 08, 2015 at 11:11:42AM -0800, Andres Freund wrote: > >> On November 8, 2015 12:54:07 AM PST, Noah Misch <noah@leadboat.com> wrote: > >> > >> >I have pushed a stack of branches to > >> >https://github.com/nmisch/postgresql.git: > >> > > >> >mxt0-revert - reverts commits 4f627f8 and aa29c1c > >> >mxt1-disk-independent - see below > >> >mxt2-cosmetic - update already-wrong comments and formatting > >> >mxt3-main - replaces commit 4f627f8 > >> >mxt4-rm-legacy - replaces commit aa29c1c > >> > > >> >The plan is to squash each branch into one PostgreSQL commit. In > >> >addition to > >> >examining overall "git diff mxt2-cosmetic mxt3-main", I recommend > >> >reviewing > >> >itemized changes and commit log entries in "git log -p --reverse > >> >--no-merges > >> >mxt2-cosmetic..mxt3-main". In particular, when a change involves > >> >something > >> >you discussed upthread or was otherwise not obvious, I put a > >statement > >> >of > >> >rationale in the commit log. > >> > >> I'm not following along right now - in order to make cleanups the > >plan is to revert a couple commits and then redo them prettyfied? > > > >Yes, essentially. Given the volume of updates, this seemed neater than > >framing those updates as in-tree incremental development. > > I don't like that plan. I don't have a problem doing that in some development branch somewhere, but I fail to see any benefitdoing that in 9.5/master. It'll just make the history more convoluted for no benefit. > > I'll obviously still review the changes. Cleanliness of history is precisely why I did it this way. If I had framed the changes as in-tree incremental development, no one "git diff" command would show the truncation rework or a coherent subset. To review the whole, students of this code might resort to a cherry-pick of the repair commits onto aa29c1c. That, too, proves dissatisfying; the history would nowhere carry a finished version of legacy truncation support. A hacker opting to back-patch in the future, as commit 4f627f8 contemplated, would need to dig through this thread for the bits added in mxt3-main and removed in mxt4-rm-legacy. The benefits may become clearer as you continue to review the branches. > Even for review it's nor particularly convenient, because now the entirety of the large changes essentially needs to bereviewed anew, given they're not the same. Agreed; I optimized for future readers, and I don't doubt this is less convenient for you and for others already familiar with commits 4f627f8 and aa29c1c. I published branches, not squashed patches, mostly because I think the individual branch commits will facilitate your study of the changes. I admit the cost to you remains high.
On Sun, Nov 8, 2015 at 11:52 AM, Noah Misch <noah@leadboat.com> wrote: >> I'm not following along right now - in order to make cleanups the plan is to revert a couple commits and then redo themprettyfied? > > Yes, essentially. Given the volume of updates, this seemed neater than > framing those updates as in-tree incremental development. I think that's an odd way of representing this work. I tend to remember roughly when major things were committed even years later. An outright revert should represent a total back out of the original commit IMV. Otherwise, a git blame can be quite misleading. I can imagine questioning my recollection, even when it is accurate, if only because I don't expect this. -- Peter Geoghegan
On Mon, Nov 23, 2015 at 11:44:45AM -0800, Peter Geoghegan wrote: > On Sun, Nov 8, 2015 at 11:52 AM, Noah Misch <noah@leadboat.com> wrote: > >> I'm not following along right now - in order to make cleanups the plan is to revert a couple commits and then redo themprettyfied? > > > > Yes, essentially. Given the volume of updates, this seemed neater than > > framing those updates as in-tree incremental development. > > I think that's an odd way of representing this work. I tend to > remember roughly when major things were committed even years later. An > outright revert should represent a total back out of the original > commit IMV. Otherwise, a git blame can be quite misleading. I think you're saying that "clearer git blame" is a more-important reason than "volume of updates" for preferring an outright revert over in-tree incremental development. Fair preference. If that's a correct reading of your message, then we do agree on the bottom line.
On Fri, Nov 27, 2015 at 5:16 PM, Noah Misch <noah@leadboat.com> wrote: > On Mon, Nov 23, 2015 at 11:44:45AM -0800, Peter Geoghegan wrote: >> On Sun, Nov 8, 2015 at 11:52 AM, Noah Misch <noah@leadboat.com> wrote: >> >> I'm not following along right now - in order to make cleanups the plan is to revert a couple commits and then redothem prettyfied? >> > >> > Yes, essentially. Given the volume of updates, this seemed neater than >> > framing those updates as in-tree incremental development. >> >> I think that's an odd way of representing this work. I tend to >> remember roughly when major things were committed even years later. An >> outright revert should represent a total back out of the original >> commit IMV. Otherwise, a git blame can be quite misleading. > > I think you're saying that "clearer git blame" is a more-important reason than > "volume of updates" for preferring an outright revert over in-tree incremental > development. Fair preference. If that's a correct reading of your message, > then we do agree on the bottom line. Hmm. I read Peter's message as agreeing with Andres rather than with you. And I have to say I agree with Andres as well. I think it's weird to back a commit out only to put a bunch of very similar stuff back in. Even figuring out what you've actually changed here seems rather hard. I couldn't get github to give me a diff showing your changes vs. master. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Dec 1, 2015 at 2:07 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Hmm. I read Peter's message as agreeing with Andres rather than with > you. And I have to say I agree with Andres as well. I think it's > weird to back a commit out only to put a bunch of very similar stuff > back in. Your interpretation was correct. I think it's surprising to structure things this way, especially since we haven't done things this way in the past. -- Peter Geoghegan
On Tue, Dec 01, 2015 at 05:07:15PM -0500, Robert Haas wrote: > I think it's weird to back a commit out only to put a bunch of very similar > stuff back in. I agree with that. If the original patches and their replacements shared 95% of diff lines in common, we wouldn't be having this conversation. These replacements redo closer to 50% of the lines, so the patches are not very similar by verbatim line comparison. Even so, let's stipulate that my proposal is weird. I'd rather be weird than lose one of the benefits I articulated upthread, let alone all of them. My post-commit review of RLS woke me up to the problems of gradually finishing work in-tree. By the time I started that review in 2015-06, the base RLS feature already spanned twenty-two commits. (That count has since more than doubled.) Reviewing each commit threatened to be wasteful, because I would presumably find matters already fixed later. I tried to cherry-pick the twenty-two commits onto a branch, hoping to review the overall diff as "git diff master...squash-rls", but that yielded complex merge conflicts. Where the conflicts were too much, I reviewed entire files instead. (Granted, no matter how this thread ends, I do not expect an outcome that opaque.) Hackers have been too reticent to revert and redo defect-laden commits. If doing that is weird today, let it be normal. > Even figuring out what you've actually changed here seems > rather hard. I couldn't get github to give me a diff showing your > changes vs. master. If you, an expert in the 2015-09-26 commits, want to study the key changes I made, I recommend perusing these outputs: git remote add nmisch_github https://github.com/nmisch/postgresql.git git fetch nmisch_github git diff nmisch_github/mxt0-revert nmisch_github/mxt1-disk-independent git log -p --reverse --no-merges nmisch_github/mxt2-cosmetic..nmisch_github/mxt3-main For the overall diff vs. master that you sought: git remote add community git://git.postgresql.org/git/postgresql.git git remote add nmisch_github https://github.com/nmisch/postgresql.git git fetch --multiple community nmisch_github git diff community/master...nmisch_github/mxt4-rm-legacy If anyone not an author or reviewer of the 2015-09-26 commits wishes to review the work, don't read the above diffs; the intermediate states are uninteresting. Read these four: git remote add nmisch_github https://github.com/nmisch/postgresql.git git fetch nmisch_github git diff nmisch_github/mxt0-revert nmisch_github/mxt1-disk-independent git diff nmisch_github/mxt1-disk-independent nmisch_github/mxt2-cosmetic git diff nmisch_github/mxt2-cosmetic nmisch_github/mxt3-main git diff nmisch_github/mxt3-main nmisch_github/mxt4-rm-legacy Thanks, nm
On 2015-12-02 09:57:19 -0500, Noah Misch wrote: > On Tue, Dec 01, 2015 at 05:07:15PM -0500, Robert Haas wrote: > > I think it's weird to back a commit out only to put a bunch of very similar > > stuff back in. > > I agree with that. If the original patches and their replacements shared 95% > of diff lines in common, we wouldn't be having this conversation. These > replacements redo closer to 50% of the lines, so the patches are not very > similar by verbatim line comparison. Which is a huge problem, because it makes it very hard to see what your changes actually do. And a significant portion of the changes relative to master aren't particularly critical. Which is easy to see if if a commit only changes comments, but harder if you see one commit reverting things, and another redoing most of the same things. > Hackers have been too reticent to revert and redo defect-laden > commits. If doing that is weird today, let it be normal. Why? Especially if reverting and redoing includes conflicts that mainly increases the chance of accidental bugs. > git remote add community git://git.postgresql.org/git/postgresql.git > git remote add nmisch_github https://github.com/nmisch/postgresql.git > git fetch --multiple community nmisch_github > git diff community/master...nmisch_github/mxt4-rm-legacy That's a nearly useless diff, because it includes a lot of other changes (218 files changed, 2828 insertions(+), 8742 deletions(-)) made since you published the changes. What kinda works is git diff $(git merge-base community/master nmisch_github/mxt4-rm-legacy)..nmisch_github/mxt4-rm-legacy which shows the diff to the version of master you start off from. Review of the above diff: > @@ -2013,7 +2017,7 @@ TrimMultiXact(void) > { > MultiXactId nextMXact; > MultiXactOffset offset; > - MultiXactId oldestMXact; > + MultiXactId oldestMXact; That's a bit weird, given that nextMXact isn't indented... > @@ -2190,7 +2194,8 @@ MultiXactSetNextMXact(MultiXactId nextMulti, > - /* > - * Computing the actual limits is only possible once the data directory is > - * in a consistent state. There's no need to compute the limits while > - * still replaying WAL - no decisions about new multis are made even > - * though multixact creations might be replayed. So we'll only do further > - * checks after TrimMultiXact() has been called. > - */ > + /* Before the TrimMultiXact() call at end of recovery, skip the rest. */ > if (!MultiXactState->finishedStartup) > return; > - > Assert(!InRecovery); > > - /* Set limits for offset vacuum. */ > + /* > + * Setting MultiXactState->oldestOffset entails a find_multixact_start() > + * call, which is only possible once the data directory is in a consistent > + * state. There's no need for an offset limit while still replaying WAL; > + * no decisions about new multis are made even though multixact creations > + * might be replayed. > + */ > needs_offset_vacuum = SetOffsetVacuumLimit(); I don't really see the benefit of this change. The previous location of the comment is where we return early, so it seems appropriate to document the reason there? > /* > @@ -2354,6 +2356,12 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti, > debug_elog3(DEBUG2, "MultiXact: setting next multi to %u", minMulti); > MultiXactState->nextMXact = minMulti; > } > + > + /* > + * MultiXactOffsetPrecedes() gives the wrong answer if nextOffset would > + * advance more than 2^31 between calls. Since we get a call for each > + * XLOG_MULTIXACT_CREATE_ID, that should never happen. > + */ Independent comment improvement. Good idea though. > /* > - * Update our oldestMultiXactId value, but only if it's more recent than what > - * we had. > - * > - * This may only be called during WAL replay. > + * Update our oldestMultiXactId value, but only if it's more recent than > + * what we had. This may only be called during WAL replay. > */ Whatever? > void > MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB) > @@ -2544,14 +2550,13 @@ GetOldestMultiXactId(void) > static bool > SetOffsetVacuumLimit(void) > { > - MultiXactId oldestMultiXactId; > + MultiXactId oldestMultiXactId; > MultiXactId nextMXact; > - MultiXactOffset oldestOffset = 0; /* placate compiler */ > - MultiXactOffset prevOldestOffset; > + MultiXactOffset oldestOffset = 0; /* placate compiler */ > MultiXactOffset nextOffset; > bool oldestOffsetKnown = false; > + MultiXactOffset prevOldestOffset; > bool prevOldestOffsetKnown; > - MultiXactOffset offsetStopLimit = 0; I don't see the benefit of the order changes here. > @@ -2588,40 +2590,50 @@ SetOffsetVacuumLimit(void) > else > { > /* > - * Figure out where the oldest existing multixact's offsets are > - * stored. Due to bugs in early release of PostgreSQL 9.3.X and 9.4.X, > - * the supposedly-earliest multixact might not really exist. We are > + * Figure out where the oldest existing multixact's offsets are stored. > + * Due to bugs in early release of PostgreSQL 9.3.X and 9.4.X, the > + * supposedly-earliest multixact might not really exist. We are > * careful not to fail in that case. > */ > oldestOffsetKnown = > find_multixact_start(oldestMultiXactId, &oldestOffset); > - > - if (oldestOffsetKnown) > - ereport(DEBUG1, > - (errmsg("oldest MultiXactId member is at offset %u", > - oldestOffset))); That's imo a rather useful debug message. > - else > + if (!oldestOffsetKnown) > + { > + /* XXX This message is incorrect if prevOldestOffsetKnown. */ > ereport(LOG, > (errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact%u does not exist on disk", > oldestMultiXactId))); > + } > } Hm, the XXX is a "problem" in all 9.3+ - should we just fix it everywhere? > LWLockRelease(MultiXactTruncationLock); > > /* > - * If we can, compute limits (and install them MultiXactState) to prevent > - * overrun of old data in the members SLRU area. We can only do so if the > - * oldest offset is known though. > + * There's no need to update anything if we don't know the oldest offset > + * or if it hasn't changed. > */ Is that really a worthwhile optimization? > -typedef struct mxtruncinfo > -{ > - int earliestExistingPage; > -} mxtruncinfo; > - > -/* > - * SlruScanDirectory callback > - * This callback determines the earliest existing page number. > - */ > -static bool > -SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int segpage, void *data) > -{ > - mxtruncinfo *trunc = (mxtruncinfo *) data; > - > - if (trunc->earliestExistingPage == -1 || > - ctl->PagePrecedes(segpage, trunc->earliestExistingPage)) > - { > - trunc->earliestExistingPage = segpage; > - } > - > - return false; /* keep going */ > -} > - That really seems like an independent change, deserving its own commit + explanation. Just referring to "See mailing list submission notes for rationale." makes understanding the change later imo much harder than all the incremental commits you try to avoid. > /* > - * Decide which of two MultiXactMember page numbers is "older" for truncation > - * purposes. There is no "invalid offset number" so use the numbers verbatim. > + * Dummy notion of which of two MultiXactMember page numbers is "older". > + * > + * Due to the MultiXactOffsetPrecedes() specification, this function's result > + * is meaningless unless the system is preserving less than 2^31 members. It > + * is adequate for SlruSelectLRUPage() guessing the cheapest slot to reclaim. > + * Do not pass MultiXactMemberCtl to any of the functions that use the > + * PagePrecedes callback in other ways. > */ > static bool > MultiXactMemberPagePrecedes(int page1, int page2) > @@ -3157,6 +3134,10 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2) > > /* > * Decide which of two offsets is earlier. > + * > + * Avoid calling this function. pg_multixact/members can preserve almost 2^32 > + * members at any given time, but this function is transitive only when the > + * system is preserving less than 2^31 members. > */ > static bool > MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2) As mentioned before, these really seem unrelated. > @@ -1237,24 +1235,25 @@ SlruDeleteSegment(SlruCtl ctl, int segno) > SlruShared shared = ctl->shared; > int slotno; > char path[MAXPGPATH]; > - bool did_write; > > /* Clean out any possibly existing references to the segment. */ > LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE); > restart: > - did_write = false; > for (slotno = 0; slotno < shared->num_slots; slotno++) > { > - int pagesegno = shared->page_number[slotno] / SLRU_PAGES_PER_SEGMENT; > + int pagesegno; > > if (shared->page_status[slotno] == SLRU_PAGE_EMPTY) > continue; > > /* not the segment we're looking for */ > + pagesegno = shared->page_number[slotno] / SLRU_PAGES_PER_SEGMENT; > if (pagesegno != segno) > continue; > > - /* If page is clean, just change state to EMPTY (expected case). */ > + /* > + * If page is clean, just change state to EMPTY (expected case). > + */ > if (shared->page_status[slotno] == SLRU_PAGE_VALID && > !shared->page_dirty[slotno]) > { > @@ -1267,18 +1266,10 @@ restart: > SlruInternalWritePage(ctl, slotno, NULL); > else > SimpleLruWaitIO(ctl, slotno); > - > - did_write = true; > - } > - > - /* > - * Be extra careful and re-check. The IO functions release the control > - * lock, so new pages could have been read in. > - */ > - if (did_write) > goto restart; > + } I don't think that's really a good idea - this way we restart after every single page write, whereas currently we only restart after passing through all pages once. In nearly all cases we'll only ever have to retry once in total, be because such old pages aren't usually going to be reread/redirtied. > @@ -9216,10 +9212,8 @@ xlog_redo(XLogReaderState *record) > LWLockRelease(OidGenLock); > MultiXactSetNextMXact(checkPoint.nextMulti, > checkPoint.nextMultiOffset); > - > - MultiXactAdvanceOldest(checkPoint.oldestMulti, > - checkPoint.oldestMultiDB); > SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB); > + SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB); > > /* > * If we see a shutdown checkpoint while waiting for an end-of-backup > @@ -9309,17 +9303,13 @@ xlog_redo(XLogReaderState *record) > LWLockRelease(OidGenLock); > MultiXactAdvanceNextMXact(checkPoint.nextMulti, > checkPoint.nextMultiOffset); > - > - /* > - * NB: This may perform multixact truncation when replaying WAL > - * generated by an older primary. > - */ > - MultiXactAdvanceOldest(checkPoint.oldestMulti, > - checkPoint.oldestMultiDB); > if (TransactionIdPrecedes(ShmemVariableCache->oldestXid, > checkPoint.oldestXid)) > SetTransactionIdLimit(checkPoint.oldestXid, > checkPoint.oldestXidDB); > + MultiXactAdvanceOldest(checkPoint.oldestMulti, > + checkPoint.oldestMultiDB); > + > /* ControlFile->checkPointCopy always tracks the latest ckpt XID */ > ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch; > ControlFile->checkPointCopy.nextXid = > checkPoint.nextXid; Why? > diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c > index 7c4ef58..e2b4f4c 100644 > --- a/src/backend/commands/vacuum.c > +++ b/src/backend/commands/vacuum.c > @@ -1136,9 +1136,6 @@ vac_truncate_clog(TransactionId frozenXID, > if (bogus) > return; > > - /* > - * Truncate CLOG, multixact and CommitTs to the oldest computed value. > - */ > TruncateCLOG(frozenXID); > TruncateCommitTs(frozenXID); > TruncateMultiXact(minMulti, minmulti_datoid); Why? Sure, it's not a super important comment, but ...? > diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c > index 0dc4117..41e51cf 100644 > --- a/src/backend/storage/ipc/procarray.c > +++ b/src/backend/storage/ipc/procarray.c > @@ -2192,10 +2192,8 @@ GetOldestSafeDecodingTransactionId(void) > > /* > * GetVirtualXIDsDelayingChkpt -- Get the VXIDs of transactions that are > - * delaying checkpoint because they have critical actions in progress. > - * > - * Constructs an array of VXIDs of transactions that are currently in commit > - * critical sections, as shown by having delayChkpt set in their PGXACT. > + * delaying checkpoint because they have critical actions in progress, as > + * shown by having delayChkpt set in their PGXACT. > > * Returns a palloc'd array that should be freed by the caller. > * *nvxids is the number of valid entries. > @@ -2204,8 +2202,8 @@ GetOldestSafeDecodingTransactionId(void) > * the result is somewhat indeterminate, but we don't really care. Even in > * a multiprocessor with delayed writes to shared memory, it should be certain > * that setting of delayChkpt will propagate to shared memory when the backend > - * takes a lock, so we cannot fail to see a virtual xact as delayChkpt if > - * it's already inserted its commit record. Whether it takes a little while > + * takes a lock, so we cannot fail to see a virtual xact as delayChkpt if it's > + * already inserted its critical xlog record. Whether it takes a little while > * for clearing of delayChkpt to propagate is unimportant for correctness. > */ Seems unrelated, given that this is already used in MarkBufferDirtyHint(). Don't get me wrong, I think the changes are a good idea, but it's not really tied to the truncation changes. - Andres
On Wed, Dec 02, 2015 at 04:46:26PM +0100, Andres Freund wrote: > On 2015-12-02 09:57:19 -0500, Noah Misch wrote: > > Hackers have been too reticent to revert and redo defect-laden > > commits. If doing that is weird today, let it be normal. > > Why? See my paragraph ending with the two sentences you quoted. > Especially if reverting and redoing includes conflicts that mainly > increases the chance of accidental bugs. True. (That doesn't apply to these patches.) > > git remote add community git://git.postgresql.org/git/postgresql.git > > git remote add nmisch_github https://github.com/nmisch/postgresql.git > > git fetch --multiple community nmisch_github > > git diff community/master...nmisch_github/mxt4-rm-legacy > > That's a nearly useless diff, because it includes a lot of other changes > (218 files changed, 2828 insertions(+), 8742 deletions(-)) made since > you published the changes. Perhaps you used "git diff a..b", not "git diff a...b". If not, please send the outputs of "git rev-parse community/master nmisch_github/mxt4-rm-legacy" and "git --version". > > @@ -2190,7 +2194,8 @@ MultiXactSetNextMXact(MultiXactId nextMulti, > > - /* > > - * Computing the actual limits is only possible once the data directory is > > - * in a consistent state. There's no need to compute the limits while > > - * still replaying WAL - no decisions about new multis are made even > > - * though multixact creations might be replayed. So we'll only do further > > - * checks after TrimMultiXact() has been called. > > - */ > > + /* Before the TrimMultiXact() call at end of recovery, skip the rest. */ > > if (!MultiXactState->finishedStartup) > > return; > > - > > Assert(!InRecovery); > > > > - /* Set limits for offset vacuum. */ > > + /* > > + * Setting MultiXactState->oldestOffset entails a find_multixact_start() > > + * call, which is only possible once the data directory is in a consistent > > + * state. There's no need for an offset limit while still replaying WAL; > > + * no decisions about new multis are made even though multixact creations > > + * might be replayed. > > + */ > > needs_offset_vacuum = SetOffsetVacuumLimit(); > > I don't really see the benefit of this change. The previous location of > the comment is where we return early, so it seems appropriate to > document the reason there? I made that low-importance change for two reasons. First, returning at that point skips more than just the setting a limit; it also skips autovacuum signalling and wraparound warnings. Second, the function has just computed mxid "actual limits", so branch mxt2-cosmetic made the comment specify that we defer an offset limit, not any and all limits. > > static bool > > SetOffsetVacuumLimit(void) > > { > > - MultiXactId oldestMultiXactId; > > + MultiXactId oldestMultiXactId; > > MultiXactId nextMXact; > > - MultiXactOffset oldestOffset = 0; /* placate compiler */ > > - MultiXactOffset prevOldestOffset; > > + MultiXactOffset oldestOffset = 0; /* placate compiler */ > > MultiXactOffset nextOffset; > > bool oldestOffsetKnown = false; > > + MultiXactOffset prevOldestOffset; > > bool prevOldestOffsetKnown; > > - MultiXactOffset offsetStopLimit = 0; > > I don't see the benefit of the order changes here. I reacted the same way. Commit 4f627f8 reordered some declarations, which I reverted when I refinished that commit as branch mxt3-main. > > - if (oldestOffsetKnown) > > - ereport(DEBUG1, > > - (errmsg("oldest MultiXactId member is at offset %u", > > - oldestOffset))); > > That's imo a rather useful debug message. The branches emit that message at the same times 4f627f8^ and earlier emit it. > > - else > > + if (!oldestOffsetKnown) > > + { > > + /* XXX This message is incorrect if prevOldestOffsetKnown. */ > > ereport(LOG, > > (errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact%u does not exist on disk", > > oldestMultiXactId))); > > + } > > } > > Hm, the XXX is a "problem" in all 9.3+ - should we just fix it everywhere? I welcome a project to fix it. > > LWLockRelease(MultiXactTruncationLock); > > > > /* > > - * If we can, compute limits (and install them MultiXactState) to prevent > > - * overrun of old data in the members SLRU area. We can only do so if the > > - * oldest offset is known though. > > + * There's no need to update anything if we don't know the oldest offset > > + * or if it hasn't changed. > > */ > > Is that really a worthwhile optimization? I would neither remove that longstanding optimization nor add it from scratch today. Branch commit 06c9979 restored it as part of a larger restoration to the pre-4f627f8 structure of SetOffsetVacuumLimit(). > > -typedef struct mxtruncinfo > > -{ > > - int earliestExistingPage; > > -} mxtruncinfo; > > - > > -/* > > - * SlruScanDirectory callback > > - * This callback determines the earliest existing page number. > > - */ > > -static bool > > -SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int segpage, void *data) > > -{ > > - mxtruncinfo *trunc = (mxtruncinfo *) data; > > - > > - if (trunc->earliestExistingPage == -1 || > > - ctl->PagePrecedes(segpage, trunc->earliestExistingPage)) > > - { > > - trunc->earliestExistingPage = segpage; > > - } > > - > > - return false; /* keep going */ > > -} > > - > > That really seems like an independent change, deserving its own commit + > explanation. Indeed. I explained that change at length in http://www.postgresql.org/message-id/20151108085407.GA1097830@tornado.leadboat.com, including that it's alone on a branch (mxt1-disk-independent), to become its own PostgreSQL commit. > > [branch commit 89a7232] > > I don't think that's really a good idea - this way we restart after > every single page write, whereas currently we only restart after passing > through all pages once. In nearly all cases we'll only ever have to > retry once in total, be because such old pages aren't usually going to > be reread/redirtied. Your improvement sounds fine, then. Would both SimpleLruTruncate() and SlruDeleteSegment() benefit from it? > > @@ -9216,10 +9212,8 @@ xlog_redo(XLogReaderState *record) > > LWLockRelease(OidGenLock); > > MultiXactSetNextMXact(checkPoint.nextMulti, > > checkPoint.nextMultiOffset); > > - > > - MultiXactAdvanceOldest(checkPoint.oldestMulti, > > - checkPoint.oldestMultiDB); > > SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB); > > + SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB); > > > > /* > > * If we see a shutdown checkpoint while waiting for an end-of-backup > > @@ -9309,17 +9303,13 @@ xlog_redo(XLogReaderState *record) > > LWLockRelease(OidGenLock); > > MultiXactAdvanceNextMXact(checkPoint.nextMulti, > > checkPoint.nextMultiOffset); > > - > > - /* > > - * NB: This may perform multixact truncation when replaying WAL > > - * generated by an older primary. > > - */ > > - MultiXactAdvanceOldest(checkPoint.oldestMulti, > > - checkPoint.oldestMultiDB); > > if (TransactionIdPrecedes(ShmemVariableCache->oldestXid, > > checkPoint.oldestXid)) > > SetTransactionIdLimit(checkPoint.oldestXid, > > checkPoint.oldestXidDB); > > + MultiXactAdvanceOldest(checkPoint.oldestMulti, > > + checkPoint.oldestMultiDB); > > + > > /* ControlFile->checkPointCopy always tracks the latest ckpt XID */ > > ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch; > > ControlFile->checkPointCopy.nextXid = > > checkPoint.nextXid; > > Why? master does not and will not have legacy truncation, so the deleted comment does not belong in master. Regarding the SetMultiXactIdLimit() call: commit 611a2ec Author: Noah Misch <noah@leadboat.com> AuthorDate: Sat Nov 7 15:06:28 2015 -0500 Commit: Noah Misch <noah@leadboat.com> CommitDate: Sat Nov 7 15:06:28 2015 -0500 In xlog_redo(), believe a SHUTDOWN checkPoint.oldestMulti exactly. It was so before this branch. This restores consistencywith the handling of nextXid, nextMulti and oldestMulti: we treat them as exact for XLOG_CHECKPOINT_SHUTDOWNand as minima for XLOG_CHECKPOINT_ONLINE. I do not know of a case where this definitely mattersfor any of these counters. It might matter if a bug causes oldestXid to move forward wrongly, causing it to thenmove backward later. (I don't know if VACUUM does ever move oldestXid backward, but it's a plausible thing to doif on-disk state fully agrees with an older value.) That example has no counterpart for oldestMultiXactId, because anyupdate first arrives in an XLOG_MULTIXACT_TRUNCATE_ID record. Therefore, this commit is probably cosmetic. > > - /* > > - * Truncate CLOG, multixact and CommitTs to the oldest computed value. > > - */ > > TruncateCLOG(frozenXID); > > TruncateCommitTs(frozenXID); > > TruncateMultiXact(minMulti, minmulti_datoid); > > Why? Sure, it's not a super important comment, but ...? Yeah, it scarcely matters either way. Commit 4f627f8 reduced this comment to merely restating the code, so I removed it instead. > > @@ -2204,8 +2202,8 @@ GetOldestSafeDecodingTransactionId(void) > > * the result is somewhat indeterminate, but we don't really care. Even in > > * a multiprocessor with delayed writes to shared memory, it should be certain > > * that setting of delayChkpt will propagate to shared memory when the backend > > - * takes a lock, so we cannot fail to see a virtual xact as delayChkpt if > > - * it's already inserted its commit record. Whether it takes a little while > > + * takes a lock, so we cannot fail to see a virtual xact as delayChkpt if it's > > + * already inserted its critical xlog record. Whether it takes a little while > > * for clearing of delayChkpt to propagate is unimportant for correctness. > > */ > > Seems unrelated, given that this is already used in > MarkBufferDirtyHint(). Don't get me wrong, I think the changes are a > good idea, but it's not really tied to the truncation changes. Quite so; its branch (one branch = one proposed PostgreSQL commit), mxt2-cosmetic, contains no truncation changes. Likewise for the other independent comment improvements you noted. nm
On 2015-12-03 04:38:45 -0500, Noah Misch wrote: > On Wed, Dec 02, 2015 at 04:46:26PM +0100, Andres Freund wrote: > > Especially if reverting and redoing includes conflicts that mainly > > increases the chance of accidental bugs. > > True. (That doesn't apply to these patches.) Uh, it does. You had conflicts in your process, and it's hard to verify that the re-applied patch is actually functionally identical given the volume of changes. It's much easier to see what actually changes by looking at iterative commits forward from the current state. Sorry, but I really just want to see these changes as iterative patches ontop of 9.5/HEAD instead of this process. I won't revert the reversion if you push it anyway, but I think it's a rather bad idea. > > > git remote add community git://git.postgresql.org/git/postgresql.git > > > git remote add nmisch_github https://github.com/nmisch/postgresql.git > > > git fetch --multiple community nmisch_github > > > git diff community/master...nmisch_github/mxt4-rm-legacy > > > > That's a nearly useless diff, because it includes a lot of other changes > > (218 files changed, 2828 insertions(+), 8742 deletions(-)) made since > > you published the changes. > > Perhaps you used "git diff a..b", not "git diff a...b". Ah yes. Neat, didn't know that one. > > > @@ -2190,7 +2194,8 @@ MultiXactSetNextMXact(MultiXactId nextMulti, > > > - /* > > > - * Computing the actual limits is only possible once the data directory is > > > - * in a consistent state. There's no need to compute the limits while > > > - * still replaying WAL - no decisions about new multis are made even > > > - * though multixact creations might be replayed. So we'll only do further > > > - * checks after TrimMultiXact() has been called. > > > - */ > > > + /* Before the TrimMultiXact() call at end of recovery, skip the rest. */ > > > if (!MultiXactState->finishedStartup) > > > return; > > > - > > > Assert(!InRecovery); > > > > > > - /* Set limits for offset vacuum. */ > > > + /* > > > + * Setting MultiXactState->oldestOffset entails a find_multixact_start() > > > + * call, which is only possible once the data directory is in a consistent > > > + * state. There's no need for an offset limit while still replaying WAL; > > > + * no decisions about new multis are made even though multixact creations > > > + * might be replayed. > > > + */ > > > needs_offset_vacuum = SetOffsetVacuumLimit(); > > > > I don't really see the benefit of this change. The previous location of > > the comment is where we return early, so it seems appropriate to > > document the reason there? > > I made that low-importance change for two reasons. First, returning at that > point skips more than just the setting a limit; it also skips autovacuum > signalling and wraparound warnings. Second, the function has just computed > mxid "actual limits", so branch mxt2-cosmetic made the comment specify that we > defer an offset limit, not any and all limits. My question was more about the comment being after the "early return" than about the content change, should have made that clearer. Can we just move your comment up? > > > static bool > > > SetOffsetVacuumLimit(void) > > > { > > > - MultiXactId oldestMultiXactId; > > > + MultiXactId oldestMultiXactId; > > > MultiXactId nextMXact; > > > - MultiXactOffset oldestOffset = 0; /* placate compiler */ > > > - MultiXactOffset prevOldestOffset; > > > + MultiXactOffset oldestOffset = 0; /* placate compiler */ > > > MultiXactOffset nextOffset; > > > bool oldestOffsetKnown = false; > > > + MultiXactOffset prevOldestOffset; > > > bool prevOldestOffsetKnown; > > > - MultiXactOffset offsetStopLimit = 0; > > > > I don't see the benefit of the order changes here. > > I reacted the same way. Commit 4f627f8 reordered some declarations, which I > reverted when I refinished that commit as branch mxt3-main. But the other changes are there, and in the history anyway. As the new order isn't more meaningful than the current one... > > > - if (oldestOffsetKnown) > > > - ereport(DEBUG1, > > > - (errmsg("oldest MultiXactId member is at offset %u", > > > - oldestOffset))); > > > > That's imo a rather useful debug message. > > The branches emit that message at the same times 4f627f8^ and earlier emit it. During testing I found it rather helpful if it was emitted regularly. > > > LWLockRelease(MultiXactTruncationLock); > > > > > > /* > > > - * If we can, compute limits (and install them MultiXactState) to prevent > > > - * overrun of old data in the members SLRU area. We can only do so if the > > > - * oldest offset is known though. > > > + * There's no need to update anything if we don't know the oldest offset > > > + * or if it hasn't changed. > > > */ > > > > Is that really a worthwhile optimization? > > I would neither remove that longstanding optimization nor add it from scratch > today. Branch commit 06c9979 restored it as part of a larger restoration to > the pre-4f627f8 structure of SetOffsetVacuumLimit(). There DetermineSafeOldestOffset() did it unconditionally. > > > -SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int segpage, void *data) > > > > That really seems like an independent change, deserving its own commit + > > explanation. > > Indeed. I explained that change at length in > http://www.postgresql.org/message-id/20151108085407.GA1097830@tornado.leadboat.com, > including that it's alone on a branch (mxt1-disk-independent), to become its > own PostgreSQL commit. The comment there doesn't include the explanation... > > > [branch commit 89a7232] > > > > I don't think that's really a good idea - this way we restart after > > every single page write, whereas currently we only restart after passing > > through all pages once. In nearly all cases we'll only ever have to > > retry once in total, be because such old pages aren't usually going to > > be reread/redirtied. > > Your improvement sounds fine, then. Would both SimpleLruTruncate() and > SlruDeleteSegment() benefit from it? It probably makes sense to do it in SimpleLruTruncate too - but it does additional checks as part of the restarts which aren't applicable for DeleteSegment(), which is IIRC why I didn't also change it. Greetings, Andres Freund
On Thu, Dec 03, 2015 at 07:03:21PM +0100, Andres Freund wrote: > On 2015-12-03 04:38:45 -0500, Noah Misch wrote: > > On Wed, Dec 02, 2015 at 04:46:26PM +0100, Andres Freund wrote: > > > Especially if reverting and redoing includes conflicts that mainly > > > increases the chance of accidental bugs. > > > > True. (That doesn't apply to these patches.) > > Uh, it does. You had conflicts in your process, and it's hard to verify > that the re-applied patch is actually functionally identical given the > volume of changes. It's much easier to see what actually changes by > looking at iterative commits forward from the current state. Ah, we were talking about different topics after all. I was talking about _merge_ conflicts in a reversion commit. > Sorry, but I really just want to see these changes as iterative patches > ontop of 9.5/HEAD instead of this process. I won't revert the reversion > if you push it anyway, but I think it's a rather bad idea. I hear you. I evaluated your request and judged that the benefits you cited did not make up for the losses I cited. Should you wish to change my mind, your best bet is to find defects in the commits I proposed. If I introduced juicy defects, that discovery would lend much weight to your conjectures. > My question was more about the comment being after the "early return" > than about the content change, should have made that clearer. Can we > just move your comment up? Sure, I will. > > > > static bool > > > > SetOffsetVacuumLimit(void) > > > > { > > > > - MultiXactId oldestMultiXactId; > > > > + MultiXactId oldestMultiXactId; > > > > MultiXactId nextMXact; > > > > - MultiXactOffset oldestOffset = 0; /* placate compiler */ > > > > - MultiXactOffset prevOldestOffset; > > > > + MultiXactOffset oldestOffset = 0; /* placate compiler */ > > > > MultiXactOffset nextOffset; > > > > bool oldestOffsetKnown = false; > > > > + MultiXactOffset prevOldestOffset; > > > > bool prevOldestOffsetKnown; > > > > - MultiXactOffset offsetStopLimit = 0; > > > > > > I don't see the benefit of the order changes here. > > > > I reacted the same way. Commit 4f627f8 reordered some declarations, which I > > reverted when I refinished that commit as branch mxt3-main. > > But the other changes are there, and in the history anyway. As the new > order isn't more meaningful than the current one... Right. A revert+redo patch series can and should purge formatting changes that did not belong in its predecessor commits. Alternate change delivery strategies wouldn't do that. > > > > - if (oldestOffsetKnown) > > > > - ereport(DEBUG1, > > > > - (errmsg("oldest MultiXactId member is at offset %u", > > > > - oldestOffset))); > > > > > > That's imo a rather useful debug message. > > > > The branches emit that message at the same times 4f627f8^ and earlier emit it. > > During testing I found it rather helpful if it was emitted regularly. I wouldn't oppose a patch making it happen more often. > > > > LWLockRelease(MultiXactTruncationLock); > > > > > > > > /* > > > > - * If we can, compute limits (and install them MultiXactState) to prevent > > > > - * overrun of old data in the members SLRU area. We can only do so if the > > > > - * oldest offset is known though. > > > > + * There's no need to update anything if we don't know the oldest offset > > > > + * or if it hasn't changed. > > > > */ > > > > > > Is that really a worthwhile optimization? > > > > I would neither remove that longstanding optimization nor add it from scratch > > today. Branch commit 06c9979 restored it as part of a larger restoration to > > the pre-4f627f8 structure of SetOffsetVacuumLimit(). > > There DetermineSafeOldestOffset() did it unconditionally. That is true; one won't be consistent with both. 06c9979 materially shortened the final patch and eliminated some user-visible message emission changes. Moreover, this is clearly a case of SetOffsetVacuumLimit() absorbing DetermineSafeOldestOffset(), not vice versa. > > > > -SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int segpage, void *data) > > > > > > That really seems like an independent change, deserving its own commit + > > > explanation. > > > > Indeed. I explained that change at length in > > http://www.postgresql.org/message-id/20151108085407.GA1097830@tornado.leadboat.com, > > including that it's alone on a branch (mxt1-disk-independent), to become its > > own PostgreSQL commit. > > The comment there doesn't include the explanation... If you visit that URL, everything from "If anything here requires careful study, it's the small mxt1-disk-independent change, which ..." to the end of the message is my explanation of this change. What else would you like to know about it? > > > > [branch commit 89a7232] > > > > > > I don't think that's really a good idea - this way we restart after > > > every single page write, whereas currently we only restart after passing > > > through all pages once. In nearly all cases we'll only ever have to > > > retry once in total, be because such old pages aren't usually going to > > > be reread/redirtied. > > > > Your improvement sounds fine, then. Would both SimpleLruTruncate() and > > SlruDeleteSegment() benefit from it? > > It probably makes sense to do it in SimpleLruTruncate too - but it does > additional checks as part of the restarts which aren't applicable for > DeleteSegment(), which is IIRC why I didn't also change it. Understood. There's no rule that these two functions must look as similar as possible, so I will undo 89a7232. Thanks, nm
On 2015-12-04 21:55:29 -0500, Noah Misch wrote: > On Thu, Dec 03, 2015 at 07:03:21PM +0100, Andres Freund wrote: > > Sorry, but I really just want to see these changes as iterative patches > > ontop of 9.5/HEAD instead of this process. I won't revert the reversion > > if you push it anyway, but I think it's a rather bad idea. > > I hear you. Not just me. > I evaluated your request and judged that the benefits you cited > did not make up for the losses I cited. Should you wish to change my mind, > your best bet is to find defects in the commits I proposed. If I introduced > juicy defects, that discovery would lend much weight to your conjectures. I've absolutely no interest in "proving you wrong". And my desire to review patches that are in a, in my opinion, barely reviewable format is pretty low as well. Andres
On Tue, Dec 8, 2015 at 6:43 AM, Andres Freund <andres@anarazel.de> wrote: > On 2015-12-04 21:55:29 -0500, Noah Misch wrote: >> On Thu, Dec 03, 2015 at 07:03:21PM +0100, Andres Freund wrote: >> > Sorry, but I really just want to see these changes as iterative patches >> > ontop of 9.5/HEAD instead of this process. I won't revert the reversion >> > if you push it anyway, but I think it's a rather bad idea. >> >> I hear you. > > Not just me. > >> I evaluated your request and judged that the benefits you cited >> did not make up for the losses I cited. Should you wish to change my mind, >> your best bet is to find defects in the commits I proposed. If I introduced >> juicy defects, that discovery would lend much weight to your conjectures. > > I've absolutely no interest in "proving you wrong". And my desire to > review patches that are in a, in my opinion, barely reviewable format is > pretty low as well. I agree. Noah, it seems to me that you are offering a novel theory of how patches should be submitted, reviewed, and committed, but you've got three people, two of them committers, telling you that we don't like that approach. I seriously doubt you're going to find anyone who does. When stuff gets committed to the tree, people want to to be able to answer the question "what has just now changed?" and it is indisputable that what you want to do here will make that harder. That's not a one-time problem for Andres during the course of review; that is a problem for every single person who looks at the commit history from now until the end of time. I don't think you have the right to force your proposed approach through in the face of concerted opposition. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Dec 08, 2015 at 01:05:03PM -0500, Robert Haas wrote: > On Tue, Dec 8, 2015 at 6:43 AM, Andres Freund <andres@anarazel.de> wrote: > > On 2015-12-04 21:55:29 -0500, Noah Misch wrote: > >> On Thu, Dec 03, 2015 at 07:03:21PM +0100, Andres Freund wrote: > >> > Sorry, but I really just want to see these changes as iterative patches > >> > ontop of 9.5/HEAD instead of this process. I won't revert the reversion > >> > if you push it anyway, but I think it's a rather bad idea. > >> > >> I hear you. > > > > Not just me. > > > >> I evaluated your request and judged that the benefits you cited > >> did not make up for the losses I cited. Should you wish to change my mind, > >> your best bet is to find defects in the commits I proposed. If I introduced > >> juicy defects, that discovery would lend much weight to your conjectures. > > > > I've absolutely no interest in "proving you wrong". And my desire to > > review patches that are in a, in my opinion, barely reviewable format is > > pretty low as well. > > I agree. Noah, it seems to me that you are offering a novel theory of > how patches should be submitted, reviewed, and committed, but you've > got three people, two of them committers, telling you that we don't > like that approach. I seriously doubt you're going to find anyone who > does. Andres writing the patch that became commit 4f627f8 and you reviewing it were gifts to Alvaro and to the community. Aware of that, I have avoided[1] saying that I was shocked to see that commit's defects. Despite a committer-author and _two_ committer reviewers, the patch was rife with wrong new comments, omitted updates to comments it caused to become wrong, and unsolicited whitespace churn. (Anyone could have missed the data loss bug, but these collectively leap off the page.) This in beleaguered code critical to data integrity. You call this thread's latest code a patch submission, but I call it bandaging the tree after a recent commit that never should have reached the tree. Hey, if you'd like me to post the traditional patch files, that's easy. It would have been easier for me. I posted branches because it gives more metadata to guide review. As for the choice to revert and redo ... > When stuff gets committed to the tree, people want to to be > able to answer the question "what has just now changed?" and it is > indisputable that what you want to do here will make that harder. I hope those who have not already read commit 4f627f8 will not waste time reading it. They should instead ignore multixact changes from commit 4f627f8 through its revert. The 2015-09-26 commits have not appeared in a supported release, and no other work has built upon them. They have no tenure. (I am glad you talked the author out of back-patching; otherwise, 9.4.5 and 9.3.10 would have introduced a data loss bug.) Nobody reported a single defect before my review overturned half the patch. A revert will indeed impose on those who invested time to understand commit 4f627f8, but the silence about its defects suggests the people understanding it number either zero or one. Even as its author and reviewers, you would do better to set aside what you thought you knew about this code. > That's not a one-time problem for Andres during the course of review; > that is a problem for every single person who looks at the commit > history from now until the end of time. It's a service to future readers that no line of "git blame master <...>" will refer to a 2015-09-26 multixact commit. Blame reports will instead refer to replacement commits designed to be meaningful for study in isolation. If I instead structured the repairs as you ask, the blame would find one of 4f627f8 or its various repair commits, none of which would be a self-contained unit of development. What's to enjoy about discovering that history? > I don't think you have the > right to force your proposed approach through in the face of concerted > opposition. That's correct; I do not have that right. Your objection still worries me. nm [1] http://www.postgresql.org/message-id/20151029065903.GC770464@tornado.leadboat.com
On 2015-12-09 09:43:19 -0500, Noah Misch wrote: > Aware of that, I have avoided[1] saying that I was shocked to see that > commit's defects. Despite a committer-author and _two_ committer > reviewers, the patch was rife with wrong new comments, omitted updates > to comments it caused to become wrong, It's not like that patch wasn't posted for review for months... > and unsolicited whitespace churn. Whitespace churn? The commit includes a pgindent run, because Alvaro asked me to do that, but that just affected a handful of lines. If you mean the variable ordering: given several variables were renamed anyway, additionally putting them in a easier to understand order, seems rather painless. If you mean 'pgindent immune' long lines - multixact.c is far from the only one with those, and they're prett harmless. > You call this thread's latest code a patch > submission, but I call it bandaging the tree after a recent commit > that never should have reached the tree. Oh, for christs sake. > Hey, if you'd like me to > post the traditional patch files, that's easy. It would have been > easier for me. You've been asked that, repeatedly. At least if you take 'traditional patch files' to include traditional, iterative, patches ontop of the current tree. > I hope those who have not already read commit 4f627f8 will not waste time > reading it. We have to, who knows what's hiding in there. Your git log even shows that you had conflicts in your approach (83cb04 Conflicts: src/backend/access/transam/multixact.c). > They should instead ignore multixact changes from commit 4f627f8 > through its revert. The 2015-09-26 commits have not appeared in a supported > release, and no other work has built upon them. > They have no tenure. Man. > (I am glad you talked the author out of back-patching; otherwise, > 9.4.5 and 9.3.10 would have introduced a data loss bug.) Isn't that a bug in a, as far as we know, impossible scenario? Unless I miss something there's no known case where it's "expected" that find_multixact_start() fails after initially succeeding? Sure, it sucks that the bug survived review and that it was written in the first place. But it not showing up during testing isn't meaningful, given it's a should-never-happen scenario. I'm actually kinda inclined to rip out the whole "previous pass" logic out alltogether, and replace it with a PANIC. It's a hard to test, should never happen, scenario. If it happens, things have already seriously gone sour. > > That's not a one-time problem for Andres during the course of review; > > that is a problem for every single person who looks at the commit > > history from now until the end of time. > > It's a service to future readers that no line of "git blame master <...>" will > refer to a 2015-09-26 multixact commit. And a disservice for everyone doing git log, or git blame for intermediate states of the tree. The benefit for git blame, are almost nonexistant, not seing a couple newlines changed, or not seing some intermediate commits isn't really important. > Blame reports will instead refer to > replacement commits designed to be meaningful for study in isolation. If I > instead structured the repairs as you ask, the blame would find one of 4f627f8 > or its various repair commits, none of which would be a self-contained unit of > development. So what? That's how development in general works. And how it actually happened in this specific case.
On Wed, Dec 9, 2015 at 9:43 AM, Noah Misch <noah@leadboat.com> wrote: > Andres writing the patch that became commit 4f627f8 and you reviewing it were > gifts to Alvaro and to the community. Aware of that, I have avoided[1] saying > that I was shocked to see that commit's defects. Despite a committer-author > and _two_ committer reviewers, the patch was rife with wrong new comments, > omitted updates to comments it caused to become wrong, and unsolicited > whitespace churn. (Anyone could have missed the data loss bug, but these > collectively leap off the page.) This in beleaguered code critical to data > integrity. You call this thread's latest code a patch submission, but I call > it bandaging the tree after a recent commit that never should have reached the > tree. Hey, if you'd like me to post the traditional patch files, that's easy. > It would have been easier for me. I posted branches because it gives more > metadata to guide review. As for the choice to revert and redo ... Yes, I'd like patch files, one per topic. I wasn't very happy with the way that patch it was written; it seemed to me that it touched too much code and move a lot of things around unnecessarily, and I said so at the time. I would have preferred something more incremental, and I asked for it and didn't get it. Well, I'm not giving up: I'm asking for the same thing here. I didn't think it was a good idea for Andres to rearrange that much code in a single commit, because it was hard to review, and I don't think it's a good idea for you to do it, either. To the extent that you found bugs, I think that proves the point that large commits are hard to review and small commits that change things just a little bit at a time are the way to go. > I hope those who have not already read commit 4f627f8 will not waste time > reading it. They should instead ignore multixact changes from commit 4f627f8 > through its revert. The 2015-09-26 commits have not appeared in a supported > release, and no other work has built upon them. They have no tenure. (I am > glad you talked the author out of back-patching; otherwise, 9.4.5 and 9.3.10 > would have introduced a data loss bug.) Nobody reported a single defect > before my review overturned half the patch. A revert will indeed impose on > those who invested time to understand commit 4f627f8, but the silence about > its defects suggests the people understanding it number either zero or one. > Even as its author and reviewers, you would do better to set aside what you > thought you knew about this code. I just don't find this a realistic model of how people use the git log. Maybe you use it this way; I don't. I don't *want* git blame to make it seem as if 4f627f8 is not part of the history. For better or worse, it is. Ripping it out and replacing it monolithically will not change that; it will only make the detailed history harder to reconstruct, and I *will* want to reconstruct it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Dec 9, 2015 at 10:41 AM, Andres Freund <andres@anarazel.de> wrote: >> (I am glad you talked the author out of back-patching; otherwise, >> 9.4.5 and 9.3.10 would have introduced a data loss bug.) > > Isn't that a bug in a, as far as we know, impossible scenario? Unless I > miss something there's no known case where it's "expected" that > find_multixact_start() fails after initially succeeding? Sure, it sucks > that the bug survived review and that it was written in the first > place. But it not showing up during testing isn't meaningful, given it's > a should-never-happen scenario. If I correctly understand the scenario that you are describing, that does happen - not for the same MXID, but for different ones. At least the last time I checked, and I'm not sure if we've fixed this, it could happen because the SLRU page that contains the multixact wasn't flushed out of the SLRU buffers yet. But apart from that, it could happen any time there's a gap in the sequence of files, and that sure doesn't seem like a can't-happen situation. We know that, on 9.3, there's definitely a sequence of events that leads to a 0000 file followed by a gap followed by the series of files that are still live. Given the number of other bugs we've fixed in this area, I would not like to bet on that being the only scenario where this crops up. It *shouldn't* happen, and as far as we know, if you start and end on a version newer than 4f627f8 and aa29c1c, it won't. Older branches, though, I wouldn't like to bet on. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-12-09 11:18:39 -0500, Robert Haas wrote: > If I correctly understand the scenario that you are describing, that > does happen - not for the same MXID, but for different ones. At least > the last time I checked, and I'm not sure if we've fixed this, it > could happen because the SLRU page that contains the multixact wasn't > flushed out of the SLRU buffers yet. That should be fixed, with the brute force solution of flushing buffers before searching for files on disk. > But apart from that, it could > happen any time there's a gap in the sequence of files, and that sure > doesn't seem like a can't-happen situation. We know that, on 9.3, > there's definitely a sequence of events that leads to a 0000 file > followed by a gap followed by the series of files that are still live. > Given the number of other bugs we've fixed in this area, I would not > like to bet on that being the only scenario where this crops up. It > *shouldn't* happen, and as far as we know, if you start and end on a > version newer than 4f627f8 and aa29c1c, it won't. Older branches, > though, I wouldn't like to bet on. Ok, fair enough. andres
On Wed, Dec 09, 2015 at 11:08:32AM -0500, Robert Haas wrote: > On Wed, Dec 9, 2015 at 9:43 AM, Noah Misch <noah@leadboat.com> wrote: > > I hope those who have not already read commit 4f627f8 will not waste time > > reading it. They should instead ignore multixact changes from commit 4f627f8 > > through its revert. The 2015-09-26 commits have not appeared in a supported > > release, and no other work has built upon them. They have no tenure. (I am > > glad you talked the author out of back-patching; otherwise, 9.4.5 and 9.3.10 > > would have introduced a data loss bug.) Nobody reported a single defect > > before my review overturned half the patch. A revert will indeed impose on > > those who invested time to understand commit 4f627f8, but the silence about > > its defects suggests the people understanding it number either zero or one. > > Even as its author and reviewers, you would do better to set aside what you > > thought you knew about this code. > > I just don't find this a realistic model of how people use the git > log. Maybe you use it this way; I don't. I don't *want* git blame to > make it seem as if 4f627f8 is not part of the history. For better or > worse, it is. I would like to understand how you use git, then. What's one of your models of using "git log" and/or "git blame" in which you foresee the revert making history less clear, not more clear? By the way, it occurs to me that I should also make pg_upgrade blacklist the range of catversions that might have data loss. No sense in putting ourselves in the position of asking whether data files of a 9.9.3 cluster spent time in a 9.5beta2 cluster. > Ripping it out and replacing it monolithically will not > change that; it will only make the detailed history harder to > reconstruct, and I *will* want to reconstruct it. What's something that might happen six months from now and lead you to inspect master or 9.5 multixact.c between 4f627f8 and its revert?
On 2015-12-09 20:23:06 -0500, Noah Misch wrote: > By the way, it occurs to me that I should also make pg_upgrade blacklist the > range of catversions that might have data loss. No sense in putting ourselves > in the position of asking whether data files of a 9.9.3 cluster spent time in > a 9.5beta2 cluster. I can't see any benefit in that. We're talking about a bug that, afaics, needs another unknown bug to trigger (so find_multixact_start() fails), and then very likely needs significant amounts of new multixacts consumed, without a restart and without find_multixact_start() succeeding later. What I think would actually help for questions like this, is to add, as discussed in some other threads, the following: 1) 'creating version' to pg_control 2) 'creating version' to each pg_class entry 3) 'last relation rewrite in version' to each pg_class entry 4) 'last full vacuum in version' to each pg_class entry I think for this purpose 'version' should be something akin to $catversion||$numericversion (int64 probably?) - that way development branches and release branches are handled somewhat usefully. I think that'd be useful, both from an investigatory perspective, as from a tooling perspective, because it'd allow reusing things like hint bits. > > Ripping it out and replacing it monolithically will not > > change that; it will only make the detailed history harder to > > reconstruct, and I *will* want to reconstruct it. > > What's something that might happen six months from now and lead you to inspect > master or 9.5 multixact.c between 4f627f8 and its revert? "Hey, what has happened to multixact.c lately? I'm investigating a bug, and I wonder if it already has been fixed?", "Uh, what was the problem with that earlier large commit?", "Hey, what has changed between beta2 and the final release?"...
On Thu, Dec 10, 2015 at 12:34 AM, Andres Freund <andres@anarazel.de> wrote: >> > Ripping it out and replacing it monolithically will not >> > change that; it will only make the detailed history harder to >> > reconstruct, and I *will* want to reconstruct it. >> >> What's something that might happen six months from now and lead you to inspect >> master or 9.5 multixact.c between 4f627f8 and its revert? > > "Hey, what has happened to multixact.c lately? I'm investigating a bug, > and I wonder if it already has been fixed?", "Uh, what was the problem > with that earlier large commit?", "Hey, what has changed between beta2 > and the final release?"... Quite. I can't believe we're still having this silly discussion. Can we please move on? -- Peter Geoghegan
<div dir="ltr">+1<br /></div><div class="gmail_extra"><br /><div class="gmail_quote">On Thu, Dec 10, 2015 at 9:58 AM, PeterGeoghegan <span dir="ltr"><<a href="mailto:pg@heroku.com" target="_blank">pg@heroku.com</a>></span> wrote:<br/><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">OnThu, Dec 10, 2015 at 12:34 AM, Andres Freund <<a href="mailto:andres@anarazel.de">andres@anarazel.de</a>>wrote:<br /> >> > Ripping it out and replacing it monolithicallywill not<br /> >> > change that; it will only make the detailed history harder to<br /> >> >reconstruct, and I *will* want to reconstruct it.<br /> >><br /> >> What's something that might happen sixmonths from now and lead you to inspect<br /> >> master or 9.5 multixact.c between 4f627f8 and its revert?<br />><br /> > "Hey, what has happened to multixact.c lately? I'm investigating a bug,<br /> > and I wonder if it alreadyhas been fixed?", "Uh, what was the problem<br /> > with that earlier large commit?", "Hey, what has changed betweenbeta2<br /> > and the final release?"...<br /><br /></span>Quite.<br /><br /> I can't believe we're still havingthis silly discussion. Can we please move on?<br /><span class="HOEnZb"><font color="#888888"><br /> --<br /> PeterGeoghegan<br /></font></span><div class="HOEnZb"><div class="h5"><br /><br /> --<br /> Sent via pgsql-hackers mailinglist (<a href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br /> To make changes to yoursubscription:<br /><a href="http://www.postgresql.org/mailpref/pgsql-hackers" rel="noreferrer" target="_blank">http://www.postgresql.org/mailpref/pgsql-hackers</a><br/></div></div></blockquote></div><br /><br clear="all"/><br />-- <br /><div class="gmail_signature">Bert Desmet<br />0477/305361</div></div>
On Wed, Dec 9, 2015 at 8:23 PM, Noah Misch <noah@leadboat.com> wrote: > On Wed, Dec 09, 2015 at 11:08:32AM -0500, Robert Haas wrote: >> On Wed, Dec 9, 2015 at 9:43 AM, Noah Misch <noah@leadboat.com> wrote: >> > I hope those who have not already read commit 4f627f8 will not waste time >> > reading it. They should instead ignore multixact changes from commit 4f627f8 >> > through its revert. The 2015-09-26 commits have not appeared in a supported >> > release, and no other work has built upon them. They have no tenure. (I am >> > glad you talked the author out of back-patching; otherwise, 9.4.5 and 9.3.10 >> > would have introduced a data loss bug.) Nobody reported a single defect >> > before my review overturned half the patch. A revert will indeed impose on >> > those who invested time to understand commit 4f627f8, but the silence about >> > its defects suggests the people understanding it number either zero or one. >> > Even as its author and reviewers, you would do better to set aside what you >> > thought you knew about this code. >> >> I just don't find this a realistic model of how people use the git >> log. Maybe you use it this way; I don't. I don't *want* git blame to >> make it seem as if 4f627f8 is not part of the history. For better or >> worse, it is. > > I would like to understand how you use git, then. What's one of your models > of using "git log" and/or "git blame" in which you foresee the revert making > history less clear, not more clear? Well, suppose I wanted to know what bugs were fixed between 9.5beta and 9.5.0, for example. I mean, I'm going to run git log src/backend/access/transam/multixact.c ... and the existing commits are going to be there. > By the way, it occurs to me that I should also make pg_upgrade blacklist the > range of catversions that might have data loss. No sense in putting ourselves > in the position of asking whether data files of a 9.9.3 cluster spent time in > a 9.5beta2 cluster. Maybe. But I think we could use a little more vigorous discussion of that issue, since Andres doesn't seem to be very convinced by your analysis, and I don't currently understand what you've fixed because I can't, as mentioned several times, follow your patch stack. >> Ripping it out and replacing it monolithically will not >> change that; it will only make the detailed history harder to >> reconstruct, and I *will* want to reconstruct it. > > What's something that might happen six months from now and lead you to inspect > master or 9.5 multixact.c between 4f627f8 and its revert? I don't know have anything to add to what others have said in response to this point, except this: the whole point of using a source code management system is to tell you what changed and when. What you are proposing to do makes it unusable for that purpose. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-12-10 08:55:54 -0500, Robert Haas wrote: > Maybe. But I think we could use a little more vigorous discussion of > that issue, since Andres doesn't seem to be very convinced by your > analysis, and I don't currently understand what you've fixed because I > can't, as mentioned several times, follow your patch stack. The issue at hand is that the following block oldestOffsetKnown = find_multixact_start(oldestMultiXactId, &oldestOffset); ...else if (prevOldestOffsetKnown){ /* * If we failed to get the oldest offset this time, but we have a * valuefrom a previous pass through this function, use the old value * rather than automatically forcing it. */ oldestOffset = prevOldestOffset; oldestOffsetKnown = true;} in SetOffsetVacuumLimit() fails to restore offsetStopLimit, which then is set in shared memory:/* Install the computed values */LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);MultiXactState->oldestOffset= oldestOffset;MultiXactState->oldestOffsetKnown = oldestOffsetKnown;MultiXactState->offsetStopLimit= offsetStopLimit;LWLockRelease(MultiXactGenLock); so, if find_multixact_start() failed - a "should never happen" occurance - we install a wrong stop limit. It does get 'repaired' upon the next suceeding find_multixact_start() in SetOffsetVacuumLimit() or a restart though. Adding a 'prevOffsetStopLimit' and using it seems like a ~5 line patch.
On Thu, Dec 10, 2015 at 9:04 AM, Andres Freund <andres@anarazel.de> wrote: > On 2015-12-10 08:55:54 -0500, Robert Haas wrote: >> Maybe. But I think we could use a little more vigorous discussion of >> that issue, since Andres doesn't seem to be very convinced by your >> analysis, and I don't currently understand what you've fixed because I >> can't, as mentioned several times, follow your patch stack. > > The issue at hand is that the following block > oldestOffsetKnown = > find_multixact_start(oldestMultiXactId, &oldestOffset); > > ... > else if (prevOldestOffsetKnown) > { > /* > * If we failed to get the oldest offset this time, but we have a > * value from a previous pass through this function, use the old value > * rather than automatically forcing it. > */ > oldestOffset = prevOldestOffset; > oldestOffsetKnown = true; > } > in SetOffsetVacuumLimit() fails to restore offsetStopLimit, which then > is set in shared memory: > /* Install the computed values */ > LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE); > MultiXactState->oldestOffset = oldestOffset; > MultiXactState->oldestOffsetKnown = oldestOffsetKnown; > MultiXactState->offsetStopLimit = offsetStopLimit; > LWLockRelease(MultiXactGenLock); > > so, if find_multixact_start() failed - a "should never happen" occurance > - we install a wrong stop limit. It does get 'repaired' upon the next > suceeding find_multixact_start() in SetOffsetVacuumLimit() or a restart > though. > > Adding a 'prevOffsetStopLimit' and using it seems like a ~5 line patch. So let's do that, then. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Dec 10, 2015 at 9:32 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Dec 10, 2015 at 9:04 AM, Andres Freund <andres@anarazel.de> wrote: >> On 2015-12-10 08:55:54 -0500, Robert Haas wrote: >>> Maybe. But I think we could use a little more vigorous discussion of >>> that issue, since Andres doesn't seem to be very convinced by your >>> analysis, and I don't currently understand what you've fixed because I >>> can't, as mentioned several times, follow your patch stack. >> >> The issue at hand is that the following block >> oldestOffsetKnown = >> find_multixact_start(oldestMultiXactId, &oldestOffset); >> >> ... >> else if (prevOldestOffsetKnown) >> { >> /* >> * If we failed to get the oldest offset this time, but we have a >> * value from a previous pass through this function, use the old value >> * rather than automatically forcing it. >> */ >> oldestOffset = prevOldestOffset; >> oldestOffsetKnown = true; >> } >> in SetOffsetVacuumLimit() fails to restore offsetStopLimit, which then >> is set in shared memory: >> /* Install the computed values */ >> LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE); >> MultiXactState->oldestOffset = oldestOffset; >> MultiXactState->oldestOffsetKnown = oldestOffsetKnown; >> MultiXactState->offsetStopLimit = offsetStopLimit; >> LWLockRelease(MultiXactGenLock); >> >> so, if find_multixact_start() failed - a "should never happen" occurance >> - we install a wrong stop limit. It does get 'repaired' upon the next >> suceeding find_multixact_start() in SetOffsetVacuumLimit() or a restart >> though. >> >> Adding a 'prevOffsetStopLimit' and using it seems like a ~5 line patch. > > So let's do that, then. Who is going to take care of this? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Dec 10, 2015 at 08:55:54AM -0500, Robert Haas wrote: > I don't know have anything to add to what others have said in response > to this point, except this: the whole point of using a source code > management system is to tell you what changed and when. What you are > proposing to do makes it unusable for that purpose. Based on your comments, I'm calling the patch series returned with feedback. I built the series around the goal of making history maximally reviewable for persons not insiders to commit 4f627f8. Having spent 90% of my 2015 PostgreSQL contribution time finding or fixing committed defects, my judgment of how best to achieve that is no shout from the peanut gallery. (Neither is your judgment.) In particular, I had in view two works, RLS and pg_audit, that used the post-commit repair strategy you've advocated. But you gave me a fair chance to make the case, and you stayed convinced that my repairs oppose my goal. I can now follow your development of that belief, which is enough.
Noah, Robert, All On 2015-12-11 11:20:21 -0500, Robert Haas wrote: > On Thu, Dec 10, 2015 at 9:32 AM, Robert Haas <robertmhaas@gmail.com> wrote: > >> Adding a 'prevOffsetStopLimit' and using it seems like a ~5 line patch. > > > > So let's do that, then. > > Who is going to take care of this? Here's two patches: 1) The fix to SetOffsetVacuumLimit(). I've tested this by introducing a probabilistic "return false;" to find_multixact_start(), and torturing postgres by burning through billions of multixactids of various sizes. Behaves about as bad^Wgood as without the induced failures; before the patch there were moments of spurious warnings/errors when ->offsetStopLimit was out of whack. 2) A subset of the comment changes from Noah's repository. Some of the comment changes didn't make sense without the removal SlruScanDirCbFindEarliest(), a small number of other changes I couldn't fully agree with. Noah, are you ok with pushing that subset of your changes? Is "Slightly edited subset of a larger patchset by Noah Misch" an OK attribution? Noah, on a first glance I think e50cca0ae ("Remove the SlruScanDirCbFindEarliest() test from TruncateMultiXact().") is a good idea. So I do encourage you to submit that as a separate patch. Regards, Andres
Attachment
On Sat, Dec 12, 2015 at 12:02 PM, Andres Freund <andres@anarazel.de> wrote: > Noah, Robert, All > > On 2015-12-11 11:20:21 -0500, Robert Haas wrote: >> On Thu, Dec 10, 2015 at 9:32 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> Adding a 'prevOffsetStopLimit' and using it seems like a ~5 line patch. >> > >> > So let's do that, then. >> >> Who is going to take care of this? > > Here's two patches: > > 1) The fix to SetOffsetVacuumLimit(). > > I've tested this by introducing a probabilistic "return false;" to > find_multixact_start(), and torturing postgres by burning through > billions of multixactids of various sizes. Behaves about as > bad^Wgood as without the induced failures; before the patch there > were moments of spurious warnings/errors when ->offsetStopLimit was > out of whack. I find the commit message you wrote a little difficult to read, and propose the following version instead, which reads better to me: Previously, if find_multixact_start() failed, SetOffsetVacuumLimit() would install 0 into MultiXactState->offsetStopLimit. Luckily, there are no known cases where find_multixact_start() will return an error in 9.5 and above. But if it were to happen, for example due to filesystem permission issues, it'd be somewhat bad: GetNewMultiXactId() could continue allocating mxids even if close to a wraparound, or it could erroneously stop allocating mxids, even if no wraparound is looming. The wrong value would be corrected the next time SetOffsetVacuumLimit() is called, or by a restart. (I have no comments on the substance of either patch and have reviewed the first one to a negligible degree - it doesn't look obviously wrong - and the second one not at all.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company