Re: Restore-reliability mode - Mailing list pgsql-hackers

From Noah Misch
Subject Re: Restore-reliability mode
Date
Msg-id 20150606195805.GA118899@tornado.leadboat.com
Whole thread Raw
In response to Re: Restore-reliability mode  (Simon Riggs <simon@2ndQuadrant.com>)
Responses Re: Restore-reliability mode  (Michael Paquier <michael.paquier@gmail.com>)
Re: Restore-reliability mode  (Peter Geoghegan <pg@heroku.com>)
Re: Restore-reliability mode  (Bruce Momjian <bruce@momjian.us>)
Re: Restore-reliability mode  (Alvaro Herrera <alvherre@2ndquadrant.com>)
List pgsql-hackers
On Fri, Jun 05, 2015 at 08:25:34AM +0100, Simon Riggs wrote:
> This whole idea of "feature development" vs reliability is bogus. It
> implies people that work on features don't care about reliability. Given
> the fact that many of the features are actually about increasing database
> reliability in the event of crashes and corruptions it just makes no sense.

I'm contrasting work that helps to keep our existing promises ("reliability")
with work that makes new promises ("features").  In software development, we
invariably hazard old promises to make new promises; our success hinges on
electing neither too little nor too much risk.  Two years ago, PostgreSQL's
track record had placed it in a good position to invest in new, high-risk,
high-reward promises.  We did that, and we emerged solvent yet carrying an
elevated debt service ratio.  It's time to reduce risk somewhat.

You write about a different sense of "reliability."  (Had I anticipated this
misunderstanding, I might have written "Restore-probity mode.")  None of this
was about classifying people, most of whom allocate substantial time to each
kind of work.

> How will we participate in cleanup efforts? How do we know when something
> has been "cleaned up", how will we measure our success or failure? I think
> we should be clear that wasting N months on cleanup can *fail* to achieve a
> useful objective. Without a clear plan it almost certainly will do so. The
> flip side is that wasting N months will cause great amusement and dancing
> amongst those people who wish to pull ahead of our open source project and
> we should take care not to hand them a victory from an overreaction.

I agree with all that.  We should likewise take care not to become insolvent
from an underreaction.

> So lets do our normal things, not do a "total stop" for an indefinite
> period. If someone has specific things that in their opinion need to be
> addressed, list them and we can talk about doing them, together.

I recommend these four exit criteria:

1. Non-author committer review of foreign keys locks/multixact durability.  Done when that committer certifies, as if
hewere committing the patch  himself today, that the code will not eat data.
 

2. Non-author committer review of row-level security.  Done when that  committer certifies that the code keeps its
promisesand that the  documentation bounds those promises accurately.
 

3. Second committer review of the src/backend/access changes for INSERT ... ON  CONFLICT DO NOTHING/UPDATE.  (Bugs
affectingfolks who don't use the new  syntax are most likely to fall in that portion.)  Unlike the previous two
criteria,a review without certification is sufficient.
 

4. Non-author committer certifying that the 9.5 WAL format changes will not  eat your data.  The patch lists Andres and
Alvaroas reviewers; if they  already reviewed it enough to make that certification, this one is easy.
 

That ties up four people.  For everyone else:

- Fix bugs those reviews find.  This will start slow but will grow to keep everyone busy.  Committers won't certify
code,and thus we can't declare victory, until these bugs are fixed.  The rest of this list, in contrast, calls out
topicsto sample from, not topics to exhaust.
 

- Turn current buildfarm members green.

- Write, review and commit more automated test machinery to PostgreSQL.  Test whatever excites you.  If you need ideas,
Craigposted some good ones upthread.  Here are a few more: - Add a debug mode that calls sched_yield() in
SpinLockRelease();see   6322.1406219591@sss.pgh.pa.us. - Improve TAP suite (src/test/perl/TestLib.pm) logging.
Currently,these   suites redirect much output to /dev/null.  Instead, log that output and   teach the buildfarm to
capturethe log. - Call VALGRIND_MAKE_MEM_NOACCESS() on a shared buffer when its local pin   count falls to zero.  Under
CLOBBER_FREED_MEMORY,wipe a shared buffer   when its global pin count falls to zero. - With assertions enabled, or
perhapsin a new debug mode, have   pg_do_encoding_conversion() and pg_server_to_any() check the data for a   no-op
conversioninstead of assuming the data is valid.
 

- Add buildfarm members.  This entails reporting any bugs that prevent an initial passing run.  Once you have a passing
run,schedule regular runs. Examples of useful additions: - "./configure ac_cv_func_getopt_long=no,
ac_cv_func_snprintf=no..." to   enable all the replacement code regardless of the current platform's need   for it.
Thishelps distinguish "Windows bug" from "replacement code bug." - --disable-integer-datetimes, --disable-float8-byval,
disable-float4-byval,  --disable-spinlocks, --disable-atomics, disable-thread-safety,   --disable-largefile, #define
RANDOMIZE_ALLOCATED_MEMORY- Any OS or CPU architecture other than x86 GNU/Linux, even ones already   represented.
 

- Write, review and commit fixes for the bugs that come to light by way of these new automated tests.

- Anything else targeted to make PostgreSQL keep the promises it has already made to our users.



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Initializing initFileRelationIds list via write is unsafe
Next
From: Petr Korobeinikov
Date:
Subject: Re: psql :: support for \ev viewname and \sv viewname