Re: Multi-xacts and our process problem - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Multi-xacts and our process problem
Date
Msg-id CA+TgmoZ=VURRy7799nd2kE_5UeDEcFd_BkUUMx0drN=BWHEeLw@mail.gmail.com
Whole thread Raw
In response to Re: Multi-xacts and our process problem  (Peter Geoghegan <pg@heroku.com>)
Responses Re: Multi-xacts and our process problem
List pgsql-hackers
On Tue, May 12, 2015 at 3:12 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Mon, May 11, 2015 at 11:42 PM, Noah Misch <noah@leadboat.com> wrote:
>> it came out that most people had identified fklocks as the highest-risk 9.3
>> patch.  Here's an idea.  Shortly after the 9.5 release notes draft, let's take
>> a secret ballot to identify the changes threatening the most damage through
>> undiscovered bugs.  (Let's say the electorate consists of every committer and
>> every person who reviewed at least one patch during the release cycle.)
>> Publish the three top vote totals.  This serves a few purposes.  It sends a
>> message to each original committer that the community doubts his handling of
>> the change.  The secret ballot helps voters be honest, and seven votes against
>> your commit is hard to ignore.  It's a hint to that committer to drum up more
>> reviews and testing, to pick a simpler project next time, or even to revert.
>> The poll results would also help target beta testing and post-commit reviews.
>> For example, I would plan to complete a full post-commit review of one patch
>> in the list.
>
> The highest risk item identified for 9.4 was the B-Tree bug fix
> patches, IIRC. It was certainly mentioned this time last year as the
> most likely candidate (during the 2014 developer meeting). I'm
> suspicious of this kind of ballot. While 9.4 has not been out for that
> long, evidence that that B-Tree stuff is in any way destabilizing is
> still thin on the ground, a year later.
>
> Anyone that identified fklocks as the highest risk 9.3 item shouldn't
> be too proud of their correct prediction. If you just look at the
> release notes, it's completely obvious, even to someone who doesn't
> know what a MultiXact is.

I think that's rather facile, and I really don't see how you would
know that from looking at those release notes.  I thought multixacts
had risk, but obviously nobody came close to predicting how bad things
were going to be.  If they had, I'm pretty sure we would have pulled
the patch.  The fact that the 9.4 btree changes weren't equally
destabilizing doesn't mean that they weren't risky.  There was a risk
that the Cuban missile crisis would start a nuclear war; in the end,
it didn't, but that doesn't mean there was no risk.

Part of what went wrong with multixacts is neither Alvaro nor anyone
who reviewed the patch gave adequate thought to the vacuum
requirements.  There was a whole series of things that needed to be
done there which just weren't done.  I think if it had been realized
how much work remained to do there, and how necessary it was for every
single bit of machinery that we have for freezing xmin to also exist
for freezing xmax, we would not have gone forward.  Conceptual
failures, where there is a whole class of work that you just don't
even realize needs to be done, are much more damaging than mechanical
errors, where you realize that something needs to be done but you
don't do it correctly.

As an example, take Tom's patch to speed up the parameter setup hooks
for PL/pgsql.  Here's his initial analysis:

http://www.postgresql.org/message-id/4146.1425872254@sss.pgh.pa.us

Then a lot of arguing about non-technical points ensued, followed
eventually by this:

http://www.postgresql.org/message-id/25506.1426029880@sss.pgh.pa.us

We all have ideas like that - things that initially seem like good
ideas, but there's some crucial conceptual point that we're missing
that means that the patch doesn't just need bug fixes; but rather the
whole idea needs to be reconsidered.  If we find that out before
release, we can pull the whole thing back.  If we find it out after
release, things get a lot harder.  In the case of multixacts, we
didn't realize that we'd overlooked significant pieces of work until
after the thing was shipped.

Another crucial difference between the multixact patch and many other
patches is that it wasn't a feature you could turn off.  For example,
if BRIN has bugs, you can almost certainly avoid hitting them by not
using BRIN.  And many people won't, so even if the feature turns out
to be horrifically buggy, 90%+ of our users will not even notice.
ALTER TABLE .. SET LOGGED/UNLOGGED may easily have bugs that eat your
data, but if you don't use it, then you won't be affected.  Of the
major user-visible features committed to 9.5 that could hose our users
more broadly, I'd put RLS and UPSERT pretty high on the list.  We
might be lucky enough that any breakage there is confined to users of
those features, but the code is not as contained as it is for
something like BRIN, so there is a risk of breaking other stuff.
Departing from what's user-visible, Heikki's WAL format changes could
break recovery badly for everyone and we could just be screwed.  That
risk is particularly acute because we really can't change the WAL
format once the release is shipped.  If it's broken, we're probably in
big trouble.  Multixacts, too, fell into this category of things that
cannot be turned off: they touched the heap storage format, and anyone
who used foreign keys (which is nearly everyone) really had no choice
but to use them.

Finally, the multixact patch fell prey to reverse bikeshed syndrome.
It was a big complicated patch that most people couldn't really
understand (because it was big and complicated) so we just ignored it.
I certainly did that.  I may have participated in some mailing list
threads, but I didn't really understand what was going on in detail
and I didn't study it in a level of detail that would have led me to
find problems.  I was nervous about it, but instead of digging into
that, I just assumed it was probably OK.  I think many other people
probably did likewise.  The fact that it's been a long time since
we've done something that caused a serious, hard-to-fix reliability
problems likely contributed to our sense that things wouldn't go too
far wrong.

All of these things combined in an explosive fashion.  If the patch
had been simple enough to be broadly understandable, or if it had been
something that could plausibly have come with an "off" switch, or if
anyone had realized that there were whole areas that had not been
thought through carefully, the consequences would have been much less
serious.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Multi-xacts and our process problem
Next
From: Andrew Dunstan
Date:
Subject: Re: pg_basebackup vs. Windows and tablespaces