Re: Multi-xacts and our process problem - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Multi-xacts and our process problem |
Date | |
Msg-id | CA+TgmoZ=VURRy7799nd2kE_5UeDEcFd_BkUUMx0drN=BWHEeLw@mail.gmail.com Whole thread Raw |
In response to | Re: Multi-xacts and our process problem (Peter Geoghegan <pg@heroku.com>) |
Responses |
Re: Multi-xacts and our process problem
|
List | pgsql-hackers |
On Tue, May 12, 2015 at 3:12 AM, Peter Geoghegan <pg@heroku.com> wrote: > On Mon, May 11, 2015 at 11:42 PM, Noah Misch <noah@leadboat.com> wrote: >> it came out that most people had identified fklocks as the highest-risk 9.3 >> patch. Here's an idea. Shortly after the 9.5 release notes draft, let's take >> a secret ballot to identify the changes threatening the most damage through >> undiscovered bugs. (Let's say the electorate consists of every committer and >> every person who reviewed at least one patch during the release cycle.) >> Publish the three top vote totals. This serves a few purposes. It sends a >> message to each original committer that the community doubts his handling of >> the change. The secret ballot helps voters be honest, and seven votes against >> your commit is hard to ignore. It's a hint to that committer to drum up more >> reviews and testing, to pick a simpler project next time, or even to revert. >> The poll results would also help target beta testing and post-commit reviews. >> For example, I would plan to complete a full post-commit review of one patch >> in the list. > > The highest risk item identified for 9.4 was the B-Tree bug fix > patches, IIRC. It was certainly mentioned this time last year as the > most likely candidate (during the 2014 developer meeting). I'm > suspicious of this kind of ballot. While 9.4 has not been out for that > long, evidence that that B-Tree stuff is in any way destabilizing is > still thin on the ground, a year later. > > Anyone that identified fklocks as the highest risk 9.3 item shouldn't > be too proud of their correct prediction. If you just look at the > release notes, it's completely obvious, even to someone who doesn't > know what a MultiXact is. I think that's rather facile, and I really don't see how you would know that from looking at those release notes. I thought multixacts had risk, but obviously nobody came close to predicting how bad things were going to be. If they had, I'm pretty sure we would have pulled the patch. The fact that the 9.4 btree changes weren't equally destabilizing doesn't mean that they weren't risky. There was a risk that the Cuban missile crisis would start a nuclear war; in the end, it didn't, but that doesn't mean there was no risk. Part of what went wrong with multixacts is neither Alvaro nor anyone who reviewed the patch gave adequate thought to the vacuum requirements. There was a whole series of things that needed to be done there which just weren't done. I think if it had been realized how much work remained to do there, and how necessary it was for every single bit of machinery that we have for freezing xmin to also exist for freezing xmax, we would not have gone forward. Conceptual failures, where there is a whole class of work that you just don't even realize needs to be done, are much more damaging than mechanical errors, where you realize that something needs to be done but you don't do it correctly. As an example, take Tom's patch to speed up the parameter setup hooks for PL/pgsql. Here's his initial analysis: http://www.postgresql.org/message-id/4146.1425872254@sss.pgh.pa.us Then a lot of arguing about non-technical points ensued, followed eventually by this: http://www.postgresql.org/message-id/25506.1426029880@sss.pgh.pa.us We all have ideas like that - things that initially seem like good ideas, but there's some crucial conceptual point that we're missing that means that the patch doesn't just need bug fixes; but rather the whole idea needs to be reconsidered. If we find that out before release, we can pull the whole thing back. If we find it out after release, things get a lot harder. In the case of multixacts, we didn't realize that we'd overlooked significant pieces of work until after the thing was shipped. Another crucial difference between the multixact patch and many other patches is that it wasn't a feature you could turn off. For example, if BRIN has bugs, you can almost certainly avoid hitting them by not using BRIN. And many people won't, so even if the feature turns out to be horrifically buggy, 90%+ of our users will not even notice. ALTER TABLE .. SET LOGGED/UNLOGGED may easily have bugs that eat your data, but if you don't use it, then you won't be affected. Of the major user-visible features committed to 9.5 that could hose our users more broadly, I'd put RLS and UPSERT pretty high on the list. We might be lucky enough that any breakage there is confined to users of those features, but the code is not as contained as it is for something like BRIN, so there is a risk of breaking other stuff. Departing from what's user-visible, Heikki's WAL format changes could break recovery badly for everyone and we could just be screwed. That risk is particularly acute because we really can't change the WAL format once the release is shipped. If it's broken, we're probably in big trouble. Multixacts, too, fell into this category of things that cannot be turned off: they touched the heap storage format, and anyone who used foreign keys (which is nearly everyone) really had no choice but to use them. Finally, the multixact patch fell prey to reverse bikeshed syndrome. It was a big complicated patch that most people couldn't really understand (because it was big and complicated) so we just ignored it. I certainly did that. I may have participated in some mailing list threads, but I didn't really understand what was going on in detail and I didn't study it in a level of detail that would have led me to find problems. I was nervous about it, but instead of digging into that, I just assumed it was probably OK. I think many other people probably did likewise. The fact that it's been a long time since we've done something that caused a serious, hard-to-fix reliability problems likely contributed to our sense that things wouldn't go too far wrong. All of these things combined in an explosive fashion. If the patch had been simple enough to be broadly understandable, or if it had been something that could plausibly have come with an "off" switch, or if anyone had realized that there were whole areas that had not been thought through carefully, the consequences would have been much less serious. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: