Re: [CORE] Restore-reliability mode - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: [CORE] Restore-reliability mode
Date
Msg-id CANP8+jJd+7hncAmHUZCETxSPf0Ef9uKh227LBK4xSA9p1k0AYw@mail.gmail.com
Whole thread Raw
In response to Re: [CORE] Restore-reliability mode  (Bruce Momjian <bruce@momjian.us>)
Responses Re: [CORE] Restore-reliability mode  (Bruce Momjian <bruce@momjian.us>)
List pgsql-hackers
On 5 June 2015 at 16:05, Bruce Momjian <bruce@momjian.us> wrote:

Please address some of the specific issues I mentioned. 

I can discuss them but not because I am involved directly. I take responsibility as a committer and have an interest from that perspective.

In my role at 2ndQuadrant, I approved all of the time Alvaro and Andres spent on submitting, reviewing and fixing bugs - at this point that has cost something close to fifty thousand dollars just on this feature and subsequent actions. (I believe the feature was originally funded, but we never saw a penny of that, though others did.)
 
The problem
with the multi-xact case is that we just kept fixing bugs as people
found them, and did not do a holistic review of the code. 

I observed much discussion and review. The bugs we've had have all been fairly straightforwardly fixed. There haven't been any design-level oversights or head-palm moments. It's complex software that had complex behaviour that caused problems. The problem has been that anything on-disk causes more problems when errors occur. We should review carefully anything that alters the way on-disk structures work, like the WAL changes, UPSERTs new mechanism etc..

From my side, it is only recently I got some clear answers to my questions about how it worked. I think it is very important that major features have extensive README type documentation with them so the underlying principles used in the development are clear. I would define the measure of a good feature as whether another committer can read the code comments and get a good feel. A bad feature is one where committers walk away from it, saying I don't really get it and I can't read an explanation of why it does that. Tom's most significant contribution is his long descriptive comments on what the problem is that need to be solved, the options and the method chosen. Clarity of thought is what solves bugs.

Overall, I don't see the need to stop the normal release process and do a holistic review. But I do think we should check each feature to see whether it is fully documented or whether we are simply trusting one of us to be around to fix it.

I am just saying we need to ask the
reliability question _first_.

Agreed
 
Let me restate something that has appeared in many replies to my ideas
--- I am not asking for infinite or unbounded review, but I am asking
that we make sure reliability gets the proper focus in relation to our
time pressures.  Our balance was so off a month ago that I feel only a
full stop on time pressure would allow us to refocus because people are
not good at focusing on multiple things. It is sometimes necessary to
stop everything to get people's attention, and to help them remember
that without reliability, a database is useless.

Here, I think we are talking about different types of reliability. PostgreSQL software is well ahead of most industry measures of quality; these recent bugs have done nothing to damage that, other than a few people woke up and said "Wow! Postgres had a bug??!?!?". The presence of bugs is common and if we have grown unused to them, we should be wary of that, though not tolerant.

PostgreSQL is now reliable in the sense that we have many features that ensure availability even in the face of software problems and bug induced corruption. Those have helped us get out of the current situations, giving users a workaround while bugs are fixed. So the impact of database software bugs is not what it once was.

Reliable delivery of new versions of software is important too. New versions often contain new features that fix real world problems, just as much as bug fixes do, hence why I don't wish to divert from the normal process and schedule.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: [CORE] Restore-reliability mode
Next
From: Jim Nasby
Date:
Subject: Re: [CORE] Restore-reliability mode