Thread: postpone next week's release

postpone next week's release

From
Robert Haas
Date:
Hi,

I think we should postpone next week's release.  I have been hard at
work on the multixact-related bugs that were reported in 9.4.2 and
9.3.7, and the subsequent bugs found by code-reading, but getting them
all fixed by Monday doesn't seem realistic.  Such fixes should have
careful review, and not be dashed into the tree under time pressure.

We could do the release anyway to relieve the pain caused by the
fsync-pgdata hard-failure problem, but it seems to me that if we do
that, we're just going to end up having to do yet another release
almost right away.  I think it would be better to wait and do one
release that fixes both sets of issues.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: postpone next week's release

From
Bruce Momjian
Date:
On Fri, May 29, 2015 at 02:02:43PM -0400, Robert Haas wrote:
> Hi,
> 
> I think we should postpone next week's release.  I have been hard at
> work on the multixact-related bugs that were reported in 9.4.2 and
> 9.3.7, and the subsequent bugs found by code-reading, but getting them
> all fixed by Monday doesn't seem realistic.  Such fixes should have
> careful review, and not be dashed into the tree under time pressure.
> 
> We could do the release anyway to relieve the pain caused by the
> fsync-pgdata hard-failure problem, but it seems to me that if we do
> that, we're just going to end up having to do yet another release
> almost right away.  I think it would be better to wait and do one
> release that fixes both sets of issues.

It does seem wise to make sure we have all these items fixed.  We have
PR'ed the recovery failure issue so I think we are good at this point. 
I see having to put out another multi-xact-only fix release the week
after as being a bigger negative.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: postpone next week's release

From
Stephen Frost
Date:
* Robert Haas (robertmhaas@gmail.com) wrote:
> I think we should postpone next week's release.  I have been hard at
> work on the multixact-related bugs that were reported in 9.4.2 and
> 9.3.7, and the subsequent bugs found by code-reading, but getting them
> all fixed by Monday doesn't seem realistic.  Such fixes should have
> careful review, and not be dashed into the tree under time pressure.
>
> We could do the release anyway to relieve the pain caused by the
> fsync-pgdata hard-failure problem, but it seems to me that if we do
> that, we're just going to end up having to do yet another release
> almost right away.  I think it would be better to wait and do one
> release that fixes both sets of issues.

Agreed.

I just caution that we appreciate PGCon coming up and that we do our
best to avoid running into a case where we have to push it further due
to everyone being at the conference.
Thanks!
    Stephen

Re: [CORE] postpone next week's release

From
Bruce Momjian
Date:
On Fri, May 29, 2015 at 02:54:31PM -0400, Stephen Frost wrote:
> * Robert Haas (robertmhaas@gmail.com) wrote:
> > I think we should postpone next week's release.  I have been hard at
> > work on the multixact-related bugs that were reported in 9.4.2 and
> > 9.3.7, and the subsequent bugs found by code-reading, but getting them
> > all fixed by Monday doesn't seem realistic.  Such fixes should have
> > careful review, and not be dashed into the tree under time pressure.
> > 
> > We could do the release anyway to relieve the pain caused by the
> > fsync-pgdata hard-failure problem, but it seems to me that if we do
> > that, we're just going to end up having to do yet another release
> > almost right away.  I think it would be better to wait and do one
> > release that fixes both sets of issues.
> 
> Agreed.
> 
> I just caution that we appreciate PGCon coming up and that we do our
> best to avoid running into a case where we have to push it further due
> to everyone being at the conference.

This brings up the issue of when we want to do 9.5 beta.  Ideas?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: [CORE] postpone next week's release

From
Magnus Hagander
Date:
On Fri, May 29, 2015 at 8:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Hi,

I think we should postpone next week's release.  I have been hard at
work on the multixact-related bugs that were reported in 9.4.2 and
9.3.7, and the subsequent bugs found by code-reading, but getting them
all fixed by Monday doesn't seem realistic.  Such fixes should have
careful review, and not be dashed into the tree under time pressure.

We could do the release anyway to relieve the pain caused by the
fsync-pgdata hard-failure problem, but it seems to me that if we do
that, we're just going to end up having to do yet another release
almost right away.  I think it would be better to wait and do one
release that fixes both sets of issues.

Thoughts?

I'm a bit split on this.

We *definitely* don't want to release the multixact fix without it being carefully reviewed, that's the part I'm not split about :) And I fully appreciate we can't have that done by monday.

However, the file-permission thing seems to hit quite a few people (have we ever had this many bug reports after a minor release), which means wed really want to get that out quickly.

Do you have any feeling of how likely people are to actually hit the multixact one? I've followed some of that impressive debugging you guys did, and I know it's a pretty critical bug if you hit it, but how wide-spread will it be?

I guess one option we could do is encourage packagers to push updated packages (-2 versions) basically. But if we do that, perhaps we might as well release anyway?

AIUI, the permission thing won't actually be very likely to affect Windows users. And Windows packages are the ones that take by far the most work to make. Perhaps we should consider skipping making packages of that version on Windows, and then plan to push yet another minor one or two weeks later, that goes out on all platforms?

--

Re: [CORE] postpone next week's release

From
Magnus Hagander
Date:
On Fri, May 29, 2015 at 8:54 PM, Stephen Frost <sfrost@snowman.net> wrote:
* Robert Haas (robertmhaas@gmail.com) wrote:
> I think we should postpone next week's release.  I have been hard at
> work on the multixact-related bugs that were reported in 9.4.2 and
> 9.3.7, and the subsequent bugs found by code-reading, but getting them
> all fixed by Monday doesn't seem realistic.  Such fixes should have
> careful review, and not be dashed into the tree under time pressure.
>
> We could do the release anyway to relieve the pain caused by the
> fsync-pgdata hard-failure problem, but it seems to me that if we do
> that, we're just going to end up having to do yet another release
> almost right away.  I think it would be better to wait and do one
> release that fixes both sets of issues.

Agreed.

I just caution that we appreciate PGCon coming up and that we do our
best to avoid running into a case where we have to push it further due
to everyone being at the conference.

If we plan it, we certainly *can* make a release during pgcon. If that's what the reasonable timing comes down to, I think getting these fixes out definitely has to be considered more important than the conference, so a few of us will just have to take a break... 


--

Re: [CORE] postpone next week's release

From
Robert Haas
Date:
On Fri, May 29, 2015 at 3:09 PM, Magnus Hagander <magnus@hagander.net> wrote:
> Do you have any feeling of how likely people are to actually hit the
> multixact one? I've followed some of that impressive debugging you guys did,
> and I know it's a pretty critical bug if you hit it, but how wide-spread
> will it be?

That precise problem has been reported a few times, but it may not be
widespread.  I don't know.  My bigger concern is that, at present,
taking a base backup is broken.  I haven't figured out the exact
reproduction scenario, but I think it's something like this:

- begin base backup
- checkpoint happens, truncating pg_multixact
- at this point pg_multixact gets copied
- end base backup

I think what will happen on replay is that replaying the checkpoint,
it will try to reference pg_multixact files that don't exist any more
and die with a fatal error.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [CORE] postpone next week's release

From
Stephen Frost
Date:
* Magnus Hagander (magnus@hagander.net) wrote:
> On Fri, May 29, 2015 at 8:54 PM, Stephen Frost <sfrost@snowman.net> wrote:
>
> > * Robert Haas (robertmhaas@gmail.com) wrote:
> > > I think we should postpone next week's release.  I have been hard at
> > > work on the multixact-related bugs that were reported in 9.4.2 and
> > > 9.3.7, and the subsequent bugs found by code-reading, but getting them
> > > all fixed by Monday doesn't seem realistic.  Such fixes should have
> > > careful review, and not be dashed into the tree under time pressure.
> > >
> > > We could do the release anyway to relieve the pain caused by the
> > > fsync-pgdata hard-failure problem, but it seems to me that if we do
> > > that, we're just going to end up having to do yet another release
> > > almost right away.  I think it would be better to wait and do one
> > > release that fixes both sets of issues.
> >
> > Agreed.
> >
> > I just caution that we appreciate PGCon coming up and that we do our
> > best to avoid running into a case where we have to push it further due
> > to everyone being at the conference.
>
> If we plan it, we certainly *can* make a release during pgcon. If that's
> what the reasonable timing comes down to, I think getting these fixes out
> definitely has to be considered more important than the conference, so a
> few of us will just have to take a break...

I don't disagree with you about any of that, just wanted to make mention
of the timing.
Thanks!
    Stephen

Re: [CORE] postpone next week's release

From
"Joshua D. Drake"
Date:
On 05/29/2015 12:18 PM, Robert Haas wrote:
>
> On Fri, May 29, 2015 at 3:09 PM, Magnus Hagander <magnus@hagander.net> wrote:
>> Do you have any feeling of how likely people are to actually hit the
>> multixact one? I've followed some of that impressive debugging you guys did,
>> and I know it's a pretty critical bug if you hit it, but how wide-spread
>> will it be?
>
> That precise problem has been reported a few times, but it may not be
> widespread.  I don't know.  My bigger concern is that, at present,
> taking a base backup is broken.

This I think is the bigger issue. They both are horrible but basebackup 
being broken is rather... egregious.

JD


-- 
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.



Re: [CORE] postpone next week's release

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> On Fri, May 29, 2015 at 8:54 PM, Stephen Frost <sfrost@snowman.net> wrote:
>> I just caution that we appreciate PGCon coming up and that we do our
>> best to avoid running into a case where we have to push it further due
>> to everyone being at the conference.

> If we plan it, we certainly *can* make a release during pgcon. If that's
> what the reasonable timing comes down to, I think getting these fixes out
> definitely has to be considered more important than the conference, so a
> few of us will just have to take a break...

I think there's no way that we wait more than one additional week to push
the fsync fix.  So the problem is not with scheduling the update releases,
it's with whether we can also fit in a 9.5 beta release before PGCon.

(I can't see doing a beta *during* PGCon week.  I for one am going to be
on an airplane at the time I'd normally have to be Doing Release Stuff.)

I know Josh doesn't like to do beta1 releases concurrently with back
branches because it confuses the PR messaging.  But we could make an
exception perhaps; or do all those releases the same week but announce
the beta the day after the bugfix releases.

Or we just let the beta slide till after PGCon, but then I think we're
missing some excitement factor.
        regards, tom lane



Re: [CORE] postpone next week's release

From
Magnus Hagander
Date:
On Fri, May 29, 2015 at 9:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Magnus Hagander <magnus@hagander.net> writes:
> On Fri, May 29, 2015 at 8:54 PM, Stephen Frost <sfrost@snowman.net> wrote:
>> I just caution that we appreciate PGCon coming up and that we do our
>> best to avoid running into a case where we have to push it further due
>> to everyone being at the conference.

> If we plan it, we certainly *can* make a release during pgcon. If that's
> what the reasonable timing comes down to, I think getting these fixes out
> definitely has to be considered more important than the conference, so a
> few of us will just have to take a break...

I think there's no way that we wait more than one additional week to push
the fsync fix.  So the problem is not with scheduling the update releases,
it's with whether we can also fit in a 9.5 beta release before PGCon.

I think 9.5 beta has to stand back. The question is what we do with the potentially two minor releases. Then we can slot in the beta whenever.

If we do the minor as currently planned, can we do another one the week after to deal with the multixact issues? (scheduling wise we're going to have to do one the week after *regardless*, the question is if we can make two different ones, or if we need to fold them into one)


(I can't see doing a beta *during* PGCon week.  I for one am going to be
on an airplane at the time I'd normally have to be Doing Release Stuff.)

Agreed. We can push a *minor* during pgcon, but not beta.


I know Josh doesn't like to do beta1 releases concurrently with back
branches because it confuses the PR messaging.  But we could make an
exception perhaps; or do all those releases the same week but announce
the beta the day after the bugfix releases.


I can't comment on the PR parts, I'll leave that to Josh.

 

Or we just let the beta slide till after PGCon, but then I think we're
missing some excitement factor.

Well, most of the people going to pgcon know it already. And most of the excitement affects people who are not at pgcon (simply based on that most of our users are not at pgcon). If doing it the week after pgcon is what ends up making sense once weve figured out what to do with the minors, then so be it, IMNSHO.


--

Re: [CORE] postpone next week's release

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> (I can't see doing a beta *during* PGCon week.  I for one am going to be
> on an airplane at the time I'd normally have to be Doing Release Stuff.)
[...]
> Or we just let the beta slide till after PGCon, but then I think we're
> missing some excitement factor.

Personally, I'd be all for a "watch Tom do the 9.5 beta release!"
Unconference slot...

:)

(mostly kidding, but I'm 100% sure it'd draw a huge crowd..)
Thanks!
    Stephen

Re: [CORE] postpone next week's release

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> On Fri, May 29, 2015 at 9:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I think there's no way that we wait more than one additional week to push
>> the fsync fix.  So the problem is not with scheduling the update releases,
>> it's with whether we can also fit in a 9.5 beta release before PGCon.

> I think 9.5 beta has to stand back. The question is what we do with the
> potentially two minor releases. Then we can slot in the beta whenever.

> If we do the minor as currently planned, can we do another one the week
> after to deal with the multixact issues? (scheduling wise we're going to
> have to do one the week after *regardless*, the question is if we can make
> two different ones, or if we need to fold them into one)

I suppose we could, but it doubles the amount of release gruntwork
involved, and it doesn't exactly make us look good to our users either.

I believe Christoph indicated that he was going to cherry-pick the fsync
patch and push out an intermediate Debian package with that fix, so at
least for that community there is not an urgent reason to get out a set
of releases with only the fsync fixes and not the multixact fixes.  I'm
not clear though on how many of the other reports we heard came from
Debian users.  (Some of them did, but maybe not all.)
        regards, tom lane



Re: [CORE] postpone next week's release

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Magnus Hagander <magnus@hagander.net> writes:
> > On Fri, May 29, 2015 at 9:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I think there's no way that we wait more than one additional week to push
> >> the fsync fix.  So the problem is not with scheduling the update releases,
> >> it's with whether we can also fit in a 9.5 beta release before PGCon.
>
> > I think 9.5 beta has to stand back. The question is what we do with the
> > potentially two minor releases. Then we can slot in the beta whenever.
>
> > If we do the minor as currently planned, can we do another one the week
> > after to deal with the multixact issues? (scheduling wise we're going to
> > have to do one the week after *regardless*, the question is if we can make
> > two different ones, or if we need to fold them into one)
>
> I suppose we could, but it doubles the amount of release gruntwork
> involved, and it doesn't exactly make us look good to our users either.

Agreed.  Makes it look like we can't manage to figure out our bugs and
put fixes for them together in sensible releases..
Thanks!
    Stephen

Re: [CORE] postpone next week's release

From
Magnus Hagander
Date:
On Fri, May 29, 2015 at 9:46 PM, Stephen Frost <sfrost@snowman.net> wrote:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Magnus Hagander <magnus@hagander.net> writes:
> > On Fri, May 29, 2015 at 9:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I think there's no way that we wait more than one additional week to push
> >> the fsync fix.  So the problem is not with scheduling the update releases,
> >> it's with whether we can also fit in a 9.5 beta release before PGCon.
>
> > I think 9.5 beta has to stand back. The question is what we do with the
> > potentially two minor releases. Then we can slot in the beta whenever.
>
> > If we do the minor as currently planned, can we do another one the week
> > after to deal with the multixact issues? (scheduling wise we're going to
> > have to do one the week after *regardless*, the question is if we can make
> > two different ones, or if we need to fold them into one)
>
> I suppose we could, but it doubles the amount of release gruntwork
> involved, and it doesn't exactly make us look good to our users either.

Agreed.  Makes it look like we can't manage to figure out our bugs and
put fixes for them together in sensible releases..

The flipside of that is that we have a bug fix that's preventing peoples databases from starting, and we're the intentionally delaying the shipment of it. Though i guess a mitigating fact there is that it is very easy to manually recover from that. But it's painful if your db server restarts awhen you're not around... 

--

Re: [CORE] postpone next week's release

From
Stephen Frost
Date:
* Magnus Hagander (magnus@hagander.net) wrote:
> On Fri, May 29, 2015 at 9:46 PM, Stephen Frost <sfrost@snowman.net> wrote:
>
> > * Tom Lane (tgl@sss.pgh.pa.us) wrote:
> > > Magnus Hagander <magnus@hagander.net> writes:
> > > > On Fri, May 29, 2015 at 9:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > > >> I think there's no way that we wait more than one additional week to
> > push
> > > >> the fsync fix.  So the problem is not with scheduling the update
> > releases,
> > > >> it's with whether we can also fit in a 9.5 beta release before PGCon.
> > >
> > > > I think 9.5 beta has to stand back. The question is what we do with the
> > > > potentially two minor releases. Then we can slot in the beta whenever.
> > >
> > > > If we do the minor as currently planned, can we do another one the week
> > > > after to deal with the multixact issues? (scheduling wise we're going
> > to
> > > > have to do one the week after *regardless*, the question is if we can
> > make
> > > > two different ones, or if we need to fold them into one)
> > >
> > > I suppose we could, but it doubles the amount of release gruntwork
> > > involved, and it doesn't exactly make us look good to our users either.
> >
> > Agreed.  Makes it look like we can't manage to figure out our bugs and
> > put fixes for them together in sensible releases..
> >
>
> The flipside of that is that we have a bug fix that's preventing peoples
> databases from starting, and we're the intentionally delaying the shipment
> of it. Though i guess a mitigating fact there is that it is very easy to
> manually recover from that. But it's painful if your db server restarts
> awhen you're not around...

And we have *another* fix for a *data corruption* bug which is coming in
the following *week*.

Yes, I think delaying a week to get both in is better than putting out a
fix for one bug when we *know* there's a data corruption bug sitting in
that code, and we're putting out a fix for it the following week.

If we were talking about a month-long delay, that'd be one thing, but
that isn't the impression I've got about what we're talking about.
Thanks!
    Stephen

Re: [CORE] postpone next week's release

From
Bruce Momjian
Date:
On Fri, May 29, 2015 at 03:32:57PM -0400, Tom Lane wrote:
> I know Josh doesn't like to do beta1 releases concurrently with back
> branches because it confuses the PR messaging.  But we could make an
> exception perhaps; or do all those releases the same week but announce
> the beta the day after the bugfix releases.
> 
> Or we just let the beta slide till after PGCon, but then I think we're
> missing some excitement factor.

I am unclear if we are anywhere near ready for beta1 even in June.  Are
we?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: [CORE] postpone next week's release

From
Stephen Frost
Date:
* Bruce Momjian (bruce@momjian.us) wrote:
> On Fri, May 29, 2015 at 03:32:57PM -0400, Tom Lane wrote:
> > I know Josh doesn't like to do beta1 releases concurrently with back
> > branches because it confuses the PR messaging.  But we could make an
> > exception perhaps; or do all those releases the same week but announce
> > the beta the day after the bugfix releases.
> >
> > Or we just let the beta slide till after PGCon, but then I think we're
> > missing some excitement factor.
>
> I am unclear if we are anywhere near ready for beta1 even in June.  Are
> we?

I'm all about having that discussion...  but can we do it on another
thread or at least wait til we've decided about the back-branch
releases?  They are clearly the more important issue to consider.
Thanks!
    Stephen

Re: [CORE] postpone next week's release

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> * Bruce Momjian (bruce@momjian.us) wrote:
>> I am unclear if we are anywhere near ready for beta1 even in June.  Are
>> we?

> I'm all about having that discussion...  but can we do it on another
> thread or at least wait til we've decided about the back-branch
> releases?  They are clearly the more important issue to consider.

It's the same discussion though, ie what releases are we expecting to
get out in the next couple of weeks.

It's possible that we ought to give up on a pre-conference beta.
Certainly a whole lot of time that I'd hoped would go into reviewing
9.5 feature commits has instead gone into back-branch bug chasing this
week.
        regards, tom lane



Re: [CORE] postpone next week's release

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> It's possible that we ought to give up on a pre-conference beta.
> Certainly a whole lot of time that I'd hoped would go into reviewing
> 9.5 feature commits has instead gone into back-branch bug chasing this
> week.

I guess that's what I'm getting at.  We need to take care of the
back-branches and that means pushing beta back.  I fully expect a good
discussion on when to release beta when we get closer on that, but we're
not going to be close while we have outstanding big back-branch bugs.
Thanks!
    Stephen

Re: [CORE] postpone next week's release

From
Bruce Momjian
Date:
On Fri, May 29, 2015 at 04:01:00PM -0400, Tom Lane wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> > * Bruce Momjian (bruce@momjian.us) wrote:
> >> I am unclear if we are anywhere near ready for beta1 even in June.  Are
> >> we?
> 
> > I'm all about having that discussion...  but can we do it on another
> > thread or at least wait til we've decided about the back-branch
> > releases?  They are clearly the more important issue to consider.
> 
> It's the same discussion though, ie what releases are we expecting to
> get out in the next couple of weeks.

Agreed.  If we want to put out beta1 before PGCon, I need to start on
the release notes on Monday.

> It's possible that we ought to give up on a pre-conference beta.
> Certainly a whole lot of time that I'd hoped would go into reviewing
> 9.5 feature commits has instead gone into back-branch bug chasing this
> week.

Based on what has transpired in the past two weeks, I am thinking we
need to move _slower_, not faster.  I am concerned we have focused so
much on new features that we have taken our eye off of reliability.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: [CORE] postpone next week's release

From
"Joshua D. Drake"
Date:
On 05/29/2015 01:03 PM, Stephen Frost wrote:
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> It's possible that we ought to give up on a pre-conference beta.
>> Certainly a whole lot of time that I'd hoped would go into reviewing
>> 9.5 feature commits has instead gone into back-branch bug chasing this
>> week.
>
> I guess that's what I'm getting at.  We need to take care of the
> back-branches and that means pushing beta back.

+1

JD


-- 
The most kicking donkey PostgreSQL Infrastructure company in existence.
The oldest, the most experienced, the consulting company to the stars.
Command Prompt, Inc. http://www.commandprompt.com/ +1 -503-667-4564 -
24x7 - 365 - Proactive and Managed Professional Services!



Re: [CORE] postpone next week's release

From
Robert Haas
Date:
On Fri, May 29, 2015 at 4:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> It's possible that we ought to give up on a pre-conference beta.
> Certainly a whole lot of time that I'd hoped would go into reviewing
> 9.5 feature commits has instead gone into back-branch bug chasing this
> week.

I'm personally kind of astonished that we're even thinking about beta
so soon.  I mean, we at least need to go through the stuff listed
here, I think:

https://wiki.postgresql.org/wiki/PostgreSQL_9.5_Open_Items

The bigger issue is: what's NOT on that list that should be?  I think
we need to devote some cycles to figuring that out, and I sure haven't
had any this week.

In any case, I think the negative PR that we're going to get from not
getting this multixact stuff taken care of is going to far outweigh
any positive PR from getting 9.5beta1 out a little sooner, especially
if 9.5beta1 is bug-ridden because we gave it no time to settle.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [CORE] postpone next week's release

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I'm personally kind of astonished that we're even thinking about beta
> so soon.  I mean, we at least need to go through the stuff listed
> here, I think:
> https://wiki.postgresql.org/wiki/PostgreSQL_9.5_Open_Items

Well, maybe we ought to call it an alpha not a beta, but I think we ought
to put out some kind of release that we can encourage people to test.
What you are suggesting is that we serialize resolution of the known
issues with discovery of new issues, and that's not an efficient use of
time.  Especially seeing that we're approaching the summer season where
we won't get much input at all.
        regards, tom lane



Re: [CORE] postpone next week's release

From
Andres Freund
Date:
On 2015-05-29 16:37:00 -0400, Tom Lane wrote:
> Well, maybe we ought to call it an alpha not a beta, but I think we ought
> to put out some kind of release that we can encourage people to test.

I also do think it's important that we put out a beta (or alpha)
relatively soon. Both because we actually need input to find out what
works and what doesn't and also because it pushes us to tie up loose
ends.

A beta with open items isn't that bad a thing? There's many bigger
projects doing 4-8 betas releases before a major one; and most of them
have open items at the indvidual beta's release times.

I think we should define/document it so that there's no hard goal of
being compatible for beta releases and that the compatibility goal
starts with the first release candidate, and not the betas.



Re: [CORE] postpone next week's release

From
Bruce Momjian
Date:
On Fri, May 29, 2015 at 11:04:59PM +0200, Andres Freund wrote:
> On 2015-05-29 16:37:00 -0400, Tom Lane wrote:
> > Well, maybe we ought to call it an alpha not a beta, but I think we ought
> > to put out some kind of release that we can encourage people to test.
> 
> I also do think it's important that we put out a beta (or alpha)
> relatively soon. Both because we actually need input to find out what
> works and what doesn't and also because it pushes us to tie up loose
> ends.
> 
> A beta with open items isn't that bad a thing? There's many bigger
> projects doing 4-8 betas releases before a major one; and most of them
> have open items at the indvidual beta's release times.
> 
> I think we should define/document it so that there's no hard goal of
> being compatible for beta releases and that the compatibility goal
> starts with the first release candidate, and not the betas.

Do we need release notes for an alpha?  Once I do the release notes, it
is possible to miss subtle changes in the code that aren't mentioned in
commit messages.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: [CORE] postpone next week's release

From
Andres Freund
Date:
On May 29, 2015 2:12:24 PM PDT, Bruce Momjian <bruce@momjian.us> wrote:
>On Fri, May 29, 2015 at 11:04:59PM +0200, Andres Freund wrote:
>> On 2015-05-29 16:37:00 -0400, Tom Lane wrote:
>> > Well, maybe we ought to call it an alpha not a beta, but I think we
>ought
>> > to put out some kind of release that we can encourage people to
>test.
>> 
>> I also do think it's important that we put out a beta (or alpha)
>> relatively soon. Both because we actually need input to find out what
>> works and what doesn't and also because it pushes us to tie up loose
>> ends.
>> 
>> A beta with open items isn't that bad a thing? There's many bigger
>> projects doing 4-8 betas releases before a major one; and most of
>them
>> have open items at the indvidual beta's release times.
>> 
>> I think we should define/document it so that there's no hard goal of
>> being compatible for beta releases and that the compatibility goal
>> starts with the first release candidate, and not the betas.
>
>Do we need release notes for an alpha?  Once I do the release notes, it
>is possible to miss subtle changes in the code that aren't mentioned in
>commit messages.

Yes I think so. Otherwise it's pretty useless for people not following closely. I see little point in explicitly
delayingrelease note work any further.
 

Andres



--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.



Re: [CORE] postpone next week's release

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Do we need release notes for an alpha?  Once I do the release notes, it
> is possible to miss subtle changes in the code that aren't mentioned in
> commit messages.

If the commit message isn't clear about something, you'd likely miss the
issue anyway, no?  Anyway, once the release notes are in the tree, we
could expect that anyone committing a user-visible semantics change should
update the release notes themselves.
        regards, tom lane



Re: [CORE] postpone next week's release

From
Robert Haas
Date:
On Fri, May 29, 2015 at 4:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I'm personally kind of astonished that we're even thinking about beta
>> so soon.  I mean, we at least need to go through the stuff listed
>> here, I think:
>> https://wiki.postgresql.org/wiki/PostgreSQL_9.5_Open_Items
>
> Well, maybe we ought to call it an alpha not a beta, but I think we ought
> to put out some kind of release that we can encourage people to test.
> What you are suggesting is that we serialize resolution of the known
> issues with discovery of new issues, and that's not an efficient use of
> time.  Especially seeing that we're approaching the summer season where
> we won't get much input at all.

Well, I think we ought to take at least a few weeks to try to do a bit
of code review and clean up what we can from the open items list.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [CORE] postpone next week's release

From
Bruce Momjian
Date:
On Fri, May 29, 2015 at 05:37:13PM -0400, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Do we need release notes for an alpha?  Once I do the release notes, it
> > is possible to miss subtle changes in the code that aren't mentioned in
> > commit messages.
> 
> If the commit message isn't clear about something, you'd likely miss the
> issue anyway, no?  Anyway, once the release notes are in the tree, we

I often do research in the git tree to get details on the feature beyond
just looking at the commit or the patch.

> could expect that anyone committing a user-visible semantics change should
> update the release notes themselves.

Yes, that would be nice.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: [CORE] postpone next week's release

From
Andres Freund
Date:
On 2015-05-29 18:02:36 -0400, Robert Haas wrote:
> Well, I think we ought to take at least a few weeks to try to do a bit
> of code review and clean up what we can from the open items list.

Why? A large portion of the input required to go from beta towards a
release is from actual users. To see when things break, what confuses
them and such.

I don't see why that requires that there are no minor entries in the
open items list - and that's what currently is on it.  Neither does it
seem to be a problem to do code review concurrently to user beta
testing.  We obviously can't start a beta if things crash left and
right, but I don't think that's the situation right now?



Re: [CORE] postpone next week's release

From
Stephen Frost
Date:
* Andres Freund (andres@anarazel.de) wrote:
> On 2015-05-29 18:02:36 -0400, Robert Haas wrote:
> > Well, I think we ought to take at least a few weeks to try to do a bit
> > of code review and clean up what we can from the open items list.
>
> Why? A large portion of the input required to go from beta towards a
> release is from actual users. To see when things break, what confuses
> them and such.
>
> I don't see why that requires that there are no minor entries in the
> open items list - and that's what currently is on it.  Neither does it
> seem to be a problem to do code review concurrently to user beta
> testing.  We obviously can't start a beta if things crash left and
> right, but I don't think that's the situation right now?

Agreed.
Thanks!
    Stephen

Re: [CORE] postpone next week's release

From
Robert Haas
Date:
On Fri, May 29, 2015 at 6:33 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-05-29 18:02:36 -0400, Robert Haas wrote:
>> Well, I think we ought to take at least a few weeks to try to do a bit
>> of code review and clean up what we can from the open items list.
>
> Why? A large portion of the input required to go from beta towards a
> release is from actual users. To see when things break, what confuses
> them and such.

I have two concerns:

1. I'm concerned that once we release beta, any idea about reverting a
feature or fixing something that is broken will get harder, because
people will say "well, we can't do that after we've released a beta".
I confess to particularly wanting a solution to the item listed as
"custom-join has no way to construct Plan nodes of child Path nodes",
the history of which I'll avoid recapitulating until I'm sure I can do
it while maintaining my blood pressure at safe levels.

2. Also, if we're going to make significant multixact-related changes
to 9.5 to try to improve reliability, as you proposed on the other
thread, then it would be nice to do that before beta, so that it gets
tested.  Of course, someone is bound to point out that we could make
those changes in time for beta2, and people could test that.  But in
practice I think that'll just mean that stuff is only out there for
let's say 2 months before we put it in a major release, which ain't
much.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [CORE] postpone next week's release

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, May 29, 2015 at 6:33 PM, Andres Freund <andres@anarazel.de> wrote:
>> Why? A large portion of the input required to go from beta towards a
>> release is from actual users. To see when things break, what confuses
>> them and such.

> I have two concerns:

> 1. I'm concerned that once we release beta, any idea about reverting a
> feature or fixing something that is broken will get harder, because
> people will say "well, we can't do that after we've released a beta".
> I confess to particularly wanting a solution to the item listed as
> "custom-join has no way to construct Plan nodes of child Path nodes",
> the history of which I'll avoid recapitulating until I'm sure I can do
> it while maintaining my blood pressure at safe levels.

> 2. Also, if we're going to make significant multixact-related changes
> to 9.5 to try to improve reliability, as you proposed on the other
> thread, then it would be nice to do that before beta, so that it gets
> tested.  Of course, someone is bound to point out that we could make
> those changes in time for beta2, and people could test that.  But in
> practice I think that'll just mean that stuff is only out there for
> let's say 2 months before we put it in a major release, which ain't
> much.

I think your position is completely nuts.  The GROUPING SETS code is
desperately in need of testing.  The custom-plan code is desperately
in need of fixing and testing.  The multixact code is desperately
in need of testing.  The open-items list has several other problems
besides those.  All of those problems are independent.  If we insist
on tackling them serially rather than in parallel, 9.5 might not come
out till 2017.

I agree that we are not in a position to promise features won't change.
So let's call it an alpha not a beta --- but for heaven's sake let's
try to move forward on all these issues, not just some of them.
        regards, tom lane



Re: [CORE] postpone next week's release

From
Andres Freund
Date:
On May 29, 2015 8:56:40 PM PDT, Robert Haas <robertmhaas@gmail.com> wrote:
>On Fri, May 29, 2015 at 6:33 PM, Andres Freund <andres@anarazel.de>
>wrote:
>> On 2015-05-29 18:02:36 -0400, Robert Haas wrote:
>>> Well, I think we ought to take at least a few weeks to try to do a
>bit
>>> of code review and clean up what we can from the open items list.
>>
>> Why? A large portion of the input required to go from beta towards a
>> release is from actual users. To see when things break, what confuses
>> them and such.
>
>I have two concerns:
>
>1. I'm concerned that once we release beta, any idea about reverting a
>feature or fixing something that is broken will get harder, because
>people will say "well, we can't do that after we've released a beta".
>I confess to particularly wanting a solution to the item listed as
>"custom-join has no way to construct Plan nodes of child Path nodes",
>the history of which I'll avoid recapitulating until I'm sure I can do
>it while maintaining my blood pressure at safe levels.

I think we should just document that this a beta and that changes are to be expected. And have a release candidate once
that'snot the case.
 

I agree that it'd be very good of the custom join issue gets fixed. But I don't see a beta prohibiting it.
Independentlyfrom that in going to ask a Citus colleague to make sure that pg-shard can use this.
 


>2. Also, if we're going to make significant multixact-related changes
>to 9.5 to try to improve reliability, as you proposed on the other
>thread, then it would be nice to do that before beta, so that it gets
>tested.  Of course, someone is bound to point out that we could make
>those changes in time for beta2, and people could test that.  But in
>practice I think that'll just mean that stuff is only out there for
>let's say 2 months before we put it in a major release, which ain't
>much.


There seems to be enough other stuff in die need of testing that I don't think that's sufficient cause, even though I
understandthe sentiment.
 

Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.



Re: [CORE] postpone next week's release

From
Andres Freund
Date:
On May 29, 2015 9:08:07 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>I think your position is completely nuts. 

Yeehaa.

> The GROUPING SETS code is
>desperately in need of testing.  The custom-plan code is desperately
>in need of fixing and testing.  The multixact code is desperately
>in need of testing.  

And the array/plpgsql changes and upsert, and...

Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.



Re: [CORE] postpone next week's release

From
Noah Misch
Date:
On Fri, May 29, 2015 at 04:01:00PM -0400, Tom Lane wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> > * Bruce Momjian (bruce@momjian.us) wrote:
> >> I am unclear if we are anywhere near ready for beta1 even in June.  Are
> >> we?
> 
> > I'm all about having that discussion...  but can we do it on another
> > thread or at least wait til we've decided about the back-branch
> > releases?  They are clearly the more important issue to consider.
> 
> It's the same discussion though, ie what releases are we expecting to
> get out in the next couple of weeks.

+1 for Stephen's thought to decide about back-branch releases first and to
Magnus's sentiment upthread that beta has to stand back while we schedule
them.  In other words, the feedback between these two scheduling decisions
ought to be one-way: bringing today's supported branches to a state we can be
content about deserves first pick from the calendar.



Re: [CORE] postpone next week's release

From
Bruce Momjian
Date:
On Sat, May 30, 2015 at 12:08:07AM -0400, Tom Lane wrote:
> desperately in need of testing.  The custom-plan code is desperately
> in need of fixing and testing.  The multixact code is desperately
> in need of testing.  The open-items list has several other problems
> besides those.  All of those problems are independent.  If we insist
> on tackling them serially rather than in parallel, 9.5 might not come
> out till 2017.

2017?  Really?  Is there any need for that hyperbole?  

Frankly, based on how I feel now, I would have no problem doing 9.5 in
2016 and saying we have a lot of retooling to do.  We could say we have
gotten too far out ahead of ourselves and we need to regroup and
restructure the code.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: [CORE] postpone next week's release

From
Bruce Momjian
Date:
On Sat, May 30, 2015 at 08:56:53AM -0400, Bruce Momjian wrote:
> On Sat, May 30, 2015 at 12:08:07AM -0400, Tom Lane wrote:
> > desperately in need of testing.  The custom-plan code is desperately
> > in need of fixing and testing.  The multixact code is desperately
> > in need of testing.  The open-items list has several other problems
> > besides those.  All of those problems are independent.  If we insist
> > on tackling them serially rather than in parallel, 9.5 might not come
> > out till 2017.
> 
> 2017?  Really?  Is there any need for that hyperbole?  
> 
> Frankly, based on how I feel now, I would have no problem doing 9.5 in
> 2016 and saying we have a lot of retooling to do.  We could say we have
> gotten too far out ahead of ourselves and we need to regroup and
> restructure the code.

Actually, barrelling ahead to get releases out is how we got into this
mess in the first place.  I would vote we put the 9.5 release on hold
while we do an honest assessment of where we are.  In hindsight, we
should have known to do this even before 9.4 was released.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: [CORE] postpone next week's release

From
Robert Haas
Date:
On Sat, May 30, 2015 at 12:08 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I think your position is completely nuts.  The GROUPING SETS code is
> desperately in need of testing.  The custom-plan code is desperately
> in need of fixing and testing.  The multixact code is desperately
> in need of testing.  The open-items list has several other problems
> besides those.  All of those problems are independent.  If we insist
> on tackling them serially rather than in parallel, 9.5 might not come
> out till 2017.

If that means it's stable, +1 from me.

I dispute, on every level, the notion that not releasing a beta means
that we can't work on things in parallel.  We can work on all of the
things on the open items list in parallel right now.  We can also
test.  And in fact, we should test.  It's entirely appropriate to test
our own stuff before we ask other people to test it.  It's also
appropriate to fix the things that we already know are broken before
we ask other people to test it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [CORE] postpone next week's release

From
"Joshua D. Drake"
Date:
On 05/30/2015 06:11 AM, Bruce Momjian wrote:

>> 2017?  Really?  Is there any need for that hyperbole?
>>
>> Frankly, based on how I feel now, I would have no problem doing 9.5 in
>> 2016 and saying we have a lot of retooling to do.  We could say we have
>> gotten too far out ahead of ourselves and we need to regroup and
>> restructure the code.
>
> Actually, barrelling ahead to get releases out is how we got into this
> mess in the first place.  I would vote we put the 9.5 release on hold
> while we do an honest assessment of where we are.  In hindsight, we
> should have known to do this even before 9.4 was released.
>

It seems that we all are forgetting one of the fundamental concepts of 
open source development:

Q. When will release X be?
A. When it is done.

A delay because of quality concerns shows the integrity of the project.

Sincerely,

JD


-- 
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.



Re: [CORE] postpone next week's release

From
Bruce Momjian
Date:
On Sat, May 30, 2015 at 10:06:52AM -0400, Robert Haas wrote:
> If that means it's stable, +1 from me.
> 
> I dispute, on every level, the notion that not releasing a beta means
> that we can't work on things in parallel.  We can work on all of the
> things on the open items list in parallel right now.  We can also
> test.  And in fact, we should test.  It's entirely appropriate to test
> our own stuff before we ask other people to test it.  It's also
> appropriate to fix the things that we already know are broken before
> we ask other people to test it.

Let me share something that people have told me privately but don't want
to state publicly (at least with attribution), and that is that we have
seen great increases in feature development (often funded), without a
corresponding increase development efforts focused on stability.  The
fact Alvaro has had to almost single-handedly fix multi-xact bug until
very recently is testament to that.

The bottom line is that we just can't keep going on like this.  The fact
we put out a release two weeks ago, then need to put out a fix release
for that, but we have more multi-xact bugs to fix and can't decide if we
should do one or two minor releases, and are pushing out an alpha of 9.5
because we know we aren't ready for a beta, just confirms my analysis.

I hate to be the bearer of bad news, but I think bad news is what we
must face.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: [CORE] postpone next week's release

From
Robert Haas
Date:
On Sat, May 30, 2015 at 11:45 AM, Bruce Momjian <bruce@momjian.us> wrote:
> On Sat, May 30, 2015 at 10:06:52AM -0400, Robert Haas wrote:
>> If that means it's stable, +1 from me.
>>
>> I dispute, on every level, the notion that not releasing a beta means
>> that we can't work on things in parallel.  We can work on all of the
>> things on the open items list in parallel right now.  We can also
>> test.  And in fact, we should test.  It's entirely appropriate to test
>> our own stuff before we ask other people to test it.  It's also
>> appropriate to fix the things that we already know are broken before
>> we ask other people to test it.
>
> Let me share something that people have told me privately but don't want
> to state publicly (at least with attribution), and that is that we have
> seen great increases in feature development (often funded), without a
> corresponding increase development efforts focused on stability.  The
> fact Alvaro has had to almost single-handedly fix multi-xact bug until
> very recently is testament to that.

It's clear - at least to me - that we need to put more resources into
stabilizing the new multixact system. This is killing us.  If we can't
stabilize this, people will go use some other database.

Equally importantly, we need to make sure that we never release
something comparably broken ever again.  And that's why I'm not
sanguine about shipping what we've got without adequate reflection.

What, in this release, could break things badly?  RLS? Grouping sets?
Heikki's WAL format changes?  That last one sounds really scary to me;
it's painful if not impossible to fix the WAL format in a minor
release.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [CORE] postpone next week's release

From
Peter Geoghegan
Date:
On Sat, May 30, 2015 at 5:56 AM, Bruce Momjian <bruce@momjian.us> wrote:
> Frankly, based on how I feel now, I would have no problem doing 9.5 in
> 2016 and saying we have a lot of retooling to do.  We could say we have
> gotten too far out ahead of ourselves and we need to regroup and
> restructure the code.

I wouldn't mind doing that, but I think it's premature to conclude
that it's necessary to wait quite that long to release.

-- 
Peter Geoghegan



Re: [CORE] postpone next week's release

From
Peter Geoghegan
Date:
On Sat, May 30, 2015 at 11:10 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Let me share something that people have told me privately but don't want
>> to state publicly (at least with attribution), and that is that we have
>> seen great increases in feature development (often funded), without a
>> corresponding increase development efforts focused on stability.  The
>> fact Alvaro has had to almost single-handedly fix multi-xact bug until
>> very recently is testament to that.
>
> It's clear - at least to me - that we need to put more resources into
> stabilizing the new multixact system. This is killing us.  If we can't
> stabilize this, people will go use some other database.

+1. I don't grok the MultiXact code as some people do, but even still,
I think problems have been ongoing for so long now that we must change
course. FWIW, my perception from afar is that the problems haven't
really tapered off, and we'd be better off taking a fresh approach.

> Equally importantly, we need to make sure that we never release
> something comparably broken ever again.  And that's why I'm not
> sanguine about shipping what we've got without adequate reflection.

As you said, there was a failure to appreciate the interactions with
VACUUM. That should have made us more introspective about what we
didn't know and couldn't know during during 9.3 development, but it
didn't.

> What, in this release, could break things badly?  RLS? Grouping sets?
> Heikki's WAL format changes?  That last one sounds really scary to me;
> it's painful if not impossible to fix the WAL format in a minor
> release.

I think we actually have learned some lessons here. MultiXacts were a
somewhat unusual case for a couple of reasons that I need not rehash.

In contrast, Heikki's WAL format changes (just for example) are
fundamentally just a restructuring to the existing format. Sure, there
could be bugs, but I think that it's fundamentally different to the
9.3 MultiXact stuff, in that the MultiXact stuff appears to be
stubbornly difficult to stabilize over months and years. That feels
like something that is unlikely to be true for anything that made it
into 9.5.
-- 
Peter Geoghegan



Re: [CORE] postpone next week's release

From
Andres Freund
Date:
Hi Bruce, Everyone,

On 2015-05-30 11:45:59 -0400, Bruce Momjian wrote:
> Let me share something that people have told me privately but don't want
> to state publicly (at least with attribution), and that is that we have
> seen great increases in feature development (often funded), without a
> corresponding increase development efforts focused on stability.

Yes, I have seen and heard that too. What I think is also important that
in turn our adoption has outpaced feature development (and thus
transitively stability work).

> The bottom line is that we just can't keep going on like this.  The fact
> we put out a release two weeks ago, then need to put out a fix release
> for that, but we have more multi-xact bugs to fix and can't decide if we
> should do one or two minor releases, and are pushing out an alpha of 9.5
> because we know we aren't ready for a beta, just confirms my analysis.

I don't think that alone confirms very much.

> I hate to be the bearer of bad news, but I think bad news is what we
> must face.

Well, the question is what we do with that observation. Personally I
think it's not a new one. This point has been made repeatedly, including
at most if not all developer meetings I attended. I definitely had
conversations around it both in person, on IM and on list.


I don't think it's primarily a problem of lack of review; although that
is a large problem.  I think the biggest systematic problem is that the
compound complexity of postgres has increased dramatically over the
years.  Features have added complexity little by little, each not
incrementally not looking that bad.  But very little has been done to
manage complexity. Since 8.0 the codesize has roughly doubled, but
little has been done to manage the increased complexity. Few new
abstractions have been introduced and the structure of the code is
largely the same.

As a somewhat extreme example, let's look at StartupXLOG(). In 8.0 it
was ~500 LOC, in master it's ~1500.  The interactions in 8.0 were
complex, they have gotten much more complex since.  It fullfills lots of
different roles, all in one function:

(roughly in the order things happen, but simplified)
* Read the control file/determine whether we crashed
* recovery.conf handling
* backup label handling
* tablespace map handling (huh, I missed that this was added directly to StartupXLOG. What a bad idea)
* Determine whether we're doing archive recovery, read the relevant checkpoint if so
* relcache init file removal
* timeline switch handling
* Loading the checkpoint we're starting from
* Initialization of a lot of subsystems
* crash recovery/replay * Including pgstat, unlogged table, exported snapshot handling * iff hot standby, some more
subsystemsare initialized here * hot standby state handling * replay process intialization * crash replay itself,
including  * progress tracking   * recovery pause handling   * nextxid tracking   * timeline increase handling   * hot
standbystate handling * unlogged relations handling * archive recovery handling * creation/initialization of the end of
recoverycheckpoint * timeline increment if failover
 
* subsystem initialization iff !hot_standby
* end of recovery actions

Yes. that's one routine. And, to make things even funnier, half of that
routine isn't exercised by our tests.

You can argue that this is an outlier, but I don't think so. Heapam, the
planner, etc. have similar cases.

And I think this, to some degree, explains a lot of the multixact
problems. While there were a few "simple bugs", most of them were
interactions between the various subsystems that are rather intricate.


So, I think we have built up a lot of technical debt. And very little
effort has been made to fix that; and in the cases where people have the
reception has often been cool, because refactoring things obviously will
destabilize in the short term, even if it fixes problems in the long
term.  I don't think that's sustainable.

We can't improve the situation by just delaying the 9.5 release or
something like that. We need to actively work on making the codebase
easier to understand and better tested. But that is actual development
work, and shouldn't happen at the tail end of a release.


Regards,

Andres



Re: [CORE] postpone next week's release

From
Andres Freund
Date:
On 2015-05-30 14:10:36 -0400, Robert Haas wrote:
> It's clear - at least to me - that we need to put more resources into
> stabilizing the new multixact system. This is killing us.  If we can't
> stabilize this, people will go use some other database.

I agree. Perhaps I don't see things quite as direly, but then I didn't
just spend weeks on the issue. I remember that I was incredibly
frustrated around 9.3.2 because I'd spent weeks on fixing issued around
this and it just never seemed to stop.

> Equally importantly, we need to make sure that we never release
> something comparably broken ever again.  And that's why I'm not
> sanguine about shipping what we've got without adequate reflection.

I think you're inferring something wrong here. A beta/alpha *is* getting
feedback on how good/bad things are. It's just one source of such
information, but we don't have that many others.

As explained in the email I sent before this, I think a lot of the
problems come from too complex code (with barely any testing). But we're
not going to be able to clean this up in 9.5. This will be a longer term
effort.

If we, without further changes, decide to let the release slip to, say,
Q1 2016, the only thing that'll happen is to happen that 9.6 will have
larger, more complex features. With barely any additional review and
testing done. There was very little, if any, additional testing/review
outside jsonb due to the 9.4 slippage.

I don't think the problems have much to do with the release schedule.

> What, in this release, could break things badly?

> RLS?

Mostly localized to users of the feature. Niche use case.

> Grouping sets?

Few changes to code unless grouping sets are used.

> Heikki's WAL format changes?

Yes, that's quite invasive. On the other hand, I can't think of another
feature that had as much invested in tooling to detect problem.

What's more:
* Upsert - it's probably the most complex feature in 9.5. It's quite localized though.
* The locking changes, a good amount of potential for subtle problems
* The signal handling, sinval, client communication changes. Little to none problems so far, but it's complex stuff.
Thesechanges are an example of potential for problems due to changes to reduce complexity...
 

Greetings,

Andres Freund



Re: [CORE] postpone next week's release

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> * The signal handling, sinval, client communication changes. Little to
>   none problems so far, but it's complex stuff. These changes are an
>   example of potential for problems due to changes to reduce
>   complexity...

As far as that goes, it's quite clear from the buildfarm that the
atomics stuff is not very stable on non-mainstream architectures.
        regards, tom lane



Re: [CORE] postpone next week's release

From
Andres Freund
Date:
On May 30, 2015 2:19:00 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>Andres Freund <andres@anarazel.de> writes:
>> * The signal handling, sinval, client communication changes. Little
>to
>>   none problems so far, but it's complex stuff. These changes are an
>>   example of potential for problems due to changes to reduce
>>   complexity...
>
>As far as that goes, it's quite clear from the buildfarm that the
>atomics stuff is not very stable on non-mainstream architectures.

Is that the case? So far it seems to primarily be a problem of the, old, barrier emulation being buggy (non reentrant).
Andthat being visible due to the new barrier in the latch code.
 

If not be surprised if there were more bugs, don't get me wrong, this is highly platform dependant stuff.


--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.



Re: [CORE] postpone next week's release

From
David Steele
Date:
On 5/30/15 2:10 PM, Robert Haas wrote:
> What, in this release, could break things badly?  RLS? Grouping sets?
> Heikki's WAL format changes?  That last one sounds really scary to me;
> it's painful if not impossible to fix the WAL format in a minor
> release.

I would argue Heikki's WAL stuff is a perfect case for releasing a
public alpha/beta soon.  I'd love to test PgBackRest with an "official"
9.5dev build.  The PgBackRest test suite has lots of tests that run on
versions 8.3+ and might well shake out any bugs that are lying around.

In fact, I've added a new feature based on monitoring the thread and I'm
interested to see how that pans out.

--
- David Steele
david@pgmasters.net


Re: [CORE] postpone next week's release

From
"Joshua D. Drake"
Date:
On 05/30/2015 03:48 PM, David Steele wrote:
> On 5/30/15 2:10 PM, Robert Haas wrote:
>> What, in this release, could break things badly?  RLS? Grouping sets?
>> Heikki's WAL format changes?  That last one sounds really scary to me;
>> it's painful if not impossible to fix the WAL format in a minor
>> release.
>
> I would argue Heikki's WAL stuff is a perfect case for releasing a
> public alpha/beta soon.  I'd love to test PgBackRest with an "official"
> 9.5dev build.  The PgBackRest test suite has lots of tests that run on
> versions 8.3+ and might well shake out any bugs that are lying around.

You are right. Clone git, run it nightly automated and please, please 
report anything you find. There is no reason for a tagged release for 
that. Consider it a custom, purpose built, build-test farm.

Sincerely,

JD



-- 
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.



Re: [CORE] postpone next week's release

From
David Steele
Date:
On 5/30/15 8:38 PM, Joshua D. Drake wrote:
>
> On 05/30/2015 03:48 PM, David Steele wrote:
>> On 5/30/15 2:10 PM, Robert Haas wrote:
>>> What, in this release, could break things badly?  RLS? Grouping sets?
>>> Heikki's WAL format changes?  That last one sounds really scary to me;
>>> it's painful if not impossible to fix the WAL format in a minor
>>> release.
>>
>> I would argue Heikki's WAL stuff is a perfect case for releasing a
>> public alpha/beta soon.  I'd love to test PgBackRest with an "official"
>> 9.5dev build.  The PgBackRest test suite has lots of tests that run on
>> versions 8.3+ and might well shake out any bugs that are lying around.
>
> You are right. Clone git, run it nightly automated and please, please
> report anything you find. There is no reason for a tagged release for
> that. Consider it a custom, purpose built, build-test farm.

Sure - I can write code to do that.  But then why release a beta at all?

--
- David Steele
david@pgmasters.net


Re: [CORE] postpone next week's release

From
Bruce Momjian
Date:
On Sat, May 30, 2015 at 12:26:11PM -0700, Peter Geoghegan wrote:
> On Sat, May 30, 2015 at 5:56 AM, Bruce Momjian <bruce@momjian.us> wrote:
> > Frankly, based on how I feel now, I would have no problem doing 9.5 in
> > 2016 and saying we have a lot of retooling to do.  We could say we have
> > gotten too far out ahead of ourselves and we need to regroup and
> > restructure the code.
> 
> I wouldn't mind doing that, but I think it's premature to conclude
> that it's necessary to wait quite that long to release.

I agree it probably wouldn't take until 2016, but if does take until
2016, we have to be fine with that.  What I am saying is we can't just
continue to focus on hitting target dates and assume everything will be
fine, because it isn't.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: [CORE] postpone next week's release

From
"Joshua D. Drake"
Date:
On 05/30/2015 06:51 PM, David Steele wrote:
> On 5/30/15 8:38 PM, Joshua D. Drake wrote:
>>
>> On 05/30/2015 03:48 PM, David Steele wrote:
>>> On 5/30/15 2:10 PM, Robert Haas wrote:
>>>> What, in this release, could break things badly?  RLS? Grouping sets?
>>>> Heikki's WAL format changes?  That last one sounds really scary to me;
>>>> it's painful if not impossible to fix the WAL format in a minor
>>>> release.
>>>
>>> I would argue Heikki's WAL stuff is a perfect case for releasing a
>>> public alpha/beta soon.  I'd love to test PgBackRest with an "official"
>>> 9.5dev build.  The PgBackRest test suite has lots of tests that run on
>>> versions 8.3+ and might well shake out any bugs that are lying around.
>>
>> You are right. Clone git, run it nightly automated and please, please
>> report anything you find. There is no reason for a tagged release for
>> that. Consider it a custom, purpose built, build-test farm.
>
> Sure - I can write code to do that.  But then why release a beta at all?

1. Continuous testing (especially automated) is a great thing (see 
Buildfarm)

2. The rules for patches change a bit when we move to Beta

3. We may be able to fix a problem now (or soon) that you might catch 
before Beta.

Sincerely,

J

-- 
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.



Re: postpone next week's release

From
"David G. Johnston"
Date:
On Saturday, May 30, 2015, Bruce Momjian <bruce@momjian.us> wrote:
On Sat, May 30, 2015 at 12:26:11PM -0700, Peter Geoghegan wrote:
> On Sat, May 30, 2015 at 5:56 AM, Bruce Momjian <bruce@momjian.us> wrote:
> > Frankly, based on how I feel now, I would have no problem doing 9.5 in
> > 2016 and saying we have a lot of retooling to do.  We could say we have
> > gotten too far out ahead of ourselves and we need to regroup and
> > restructure the code.
>
> I wouldn't mind doing that, but I think it's premature to conclude
> that it's necessary to wait quite that long to release.

I agree it probably wouldn't take until 2016, but if does take until
2016, we have to be fine with that.  What I am saying is we can't just
continue to focus on hitting target dates and assume everything will be
fine, because it isn't.


On a slightly tangential note: I'm not prepared to defend doing so but it seems worth at least considering whether we should continue supporting 9.0 beyond this October.

I don't think it should be be de-supported until at least a couple of 9.5 point releases have been found to be stable.

David J. 

Re: [CORE] postpone next week's release

From
Bruce Momjian
Date:
On Sat, May 30, 2015 at 10:47:27PM +0200, Andres Freund wrote:
> > The bottom line is that we just can't keep going on like this.  The fact
> > we put out a release two weeks ago, then need to put out a fix release
> > for that, but we have more multi-xact bugs to fix and can't decide if we
> > should do one or two minor releases, and are pushing out an alpha of 9.5
> > because we know we aren't ready for a beta, just confirms my analysis.
> 
> I don't think that alone confirms very much.

Huh?  In what world is that release timeline ever reasonable?  It points
to a serious problem.

> > I hate to be the bearer of bad news, but I think bad news is what we
> > must face.
> 
> Well, the question is what we do with that observation. Personally I
> think it's not a new one. This point has been made repeatedly, including
> at most if not all developer meetings I attended. I definitely had
> conversations around it both in person, on IM and on list.

Well, I think we stop what we are doing, focus on restructuring,
testing, and reviewing areas that historically have had problems, and
when we are done, we can look to go to 9.5 beta.  What we don't want to
do is to push out more code and get back into a
wack-a-bug-as-they-are-found mode, which obviously did not serve us well
for multi-xact, and which is what releasing a beta will do, and of
course, more commit-fests, and more features.  

If we have to totally stop feature development until we are all happy
with the code we have, so be it.  If people feel they have to get into
cleanup mode or they will never get to add a feature to Postgres again,
so be it.  If people say, heh, I am not going to do anything and just
come back when cleanup is done (by someone else), then we will end up
with a smaller but more dedicated development team, and I am fine with
that too.  I am suggesting that until everyone is happy with the code we
have, we should not move forward.  Forget 9.5 feature testing --- we
don't even have 9.3 and 9.4 working to my satisfaction yet, and I bet
others share my opinion.  We do not want to look back on this period and
say _this_ is when Postgres lost its reputation for reliability, and
when other databases took that reputation from us.

> I don't think it's primarily a problem of lack of review; although that
> is a large problem.  I think the biggest systematic problem is that the
> compound complexity of postgres has increased dramatically over the
> years.  Features have added complexity little by little, each not
> incrementally not looking that bad.  But very little has been done to
> manage complexity. Since 8.0 the codesize has roughly doubled, but
> little has been done to manage the increased complexity. Few new
> abstractions have been introduced and the structure of the code is
> largely the same.
> 
> As a somewhat extreme example, let's look at StartupXLOG(). In 8.0 it
> was ~500 LOC, in master it's ~1500.  The interactions in 8.0 were
> complex, they have gotten much more complex since.  It fullfills lots of
> different roles, all in one function:

Yep, great please to start our work.

> So, I think we have built up a lot of technical debt. And very little
> effort has been made to fix that; and in the cases where people have the
> reception has often been cool, because refactoring things obviously will
> destabilize in the short term, even if it fixes problems in the long
> term.  I don't think that's sustainable.

Agreed.

> We can't improve the situation by just delaying the 9.5 release or
> something like that. We need to actively work on making the codebase
> easier to understand and better tested. But that is actual development
> work, and shouldn't happen at the tail end of a release.

It should start right now, and then, once we are happy with our code, we
can take periodic breaks to revisit the exact issues you describe.  What
I am saying is that we shouldn't wait until after 9.5 beta or after 9.5
final, or after the next commitfest or whatever.  We have already waited
too long to do this.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: [CORE] postpone next week's release

From
Michael Paquier
Date:
On Sun, May 31, 2015 at 11:48 AM, Bruce Momjian wrote:
> On Sat, May 30, 2015 at 10:47:27PM +0200, Andres Freund wrote:
>> So, I think we have built up a lot of technical debt. And very little
>> effort has been made to fix that; and in the cases where people have the
>> reception has often been cool, because refactoring things obviously will
>> destabilize in the short term, even if it fixes problems in the long
>> term.  I don't think that's sustainable.
>
> Agreed.

+1. Complexity has increased, and we are actually never at 100% sure
that a given bug fix does not have side effects on other things, hence
I think that a portion of this technical debt is the lack of
regression test coverage, for both existing features and platforms
(like Windows). The thing is that complexity has increased, but for
example for many features we lack test coverage, thinking mainly
replication-related stuff here. Of course we will never get to a level
of 100% of confidence with just the test coverage and the buildfarm,
but we should at least try to get closer to such a goal.

Those are things I am really willing to work on in the very short term
for what it's worth (of course not only that as
reviewing/refactoring/testing existing things is as well damn
important). Now improving the test coverage requires new
infrastructure, so those are new features, and that's perhaps not
dedicated to 9.5, except if we consider that this is part of this
technical debt accumulated among the years. Honestly I think it is.

>> We can't improve the situation by just delaying the 9.5 release or
>> something like that. We need to actively work on making the codebase
>> easier to understand and better tested. But that is actual development
>> work, and shouldn't happen at the tail end of a release.
>
> It should start right now, and then, once we are happy with our code, we
> can take periodic breaks to revisit the exact issues you describe.  What
> I am saying is that we shouldn't wait until after 9.5 beta or after 9.5
> final, or after the next commitfest or whatever.  We have already waited
> too long to do this.

Definitely.
-- 
Michael



Re: [CORE] postpone next week's release

From
Robert Haas
Date:
On Sat, May 30, 2015 at 3:46 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> What, in this release, could break things badly?  RLS? Grouping sets?
>> Heikki's WAL format changes?  That last one sounds really scary to me;
>> it's painful if not impossible to fix the WAL format in a minor
>> release.
>
> I think we actually have learned some lessons here. MultiXacts were a
> somewhat unusual case for a couple of reasons that I need not rehash.
>
> In contrast, Heikki's WAL format changes (just for example) are
> fundamentally just a restructuring to the existing format. Sure, there
> could be bugs, but I think that it's fundamentally different to the
> 9.3 MultiXact stuff, in that the MultiXact stuff appears to be
> stubbornly difficult to stabilize over months and years. That feels
> like something that is unlikely to be true for anything that made it
> into 9.5.

I hope you're right.  But I don't think any of us foresaw just how bad
the MultiXact thing was likely to be either.

In fact, I think to some extent we may STILL be in denial about how bad it is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [CORE] postpone next week's release

From
Bruce Momjian
Date:
On Sun, May 31, 2015 at 08:15:38PM +0900, Michael Paquier wrote:
> On Sun, May 31, 2015 at 11:48 AM, Bruce Momjian wrote:
> > On Sat, May 30, 2015 at 10:47:27PM +0200, Andres Freund wrote:
> >> So, I think we have built up a lot of technical debt. And very little
> >> effort has been made to fix that; and in the cases where people have the
> >> reception has often been cool, because refactoring things obviously will
> >> destabilize in the short term, even if it fixes problems in the long
> >> term.  I don't think that's sustainable.
> >
> > Agreed.
> 
> +1. Complexity has increased, and we are actually never at 100% sure
> that a given bug fix does not have side effects on other things, hence
> I think that a portion of this technical debt is the lack of
> regression test coverage, for both existing features and platforms
> (like Windows). The thing is that complexity has increased, but for
> example for many features we lack test coverage, thinking mainly
> replication-related stuff here. Of course we will never get to a level
> of 100% of confidence with just the test coverage and the buildfarm,
> but we should at least try to get closer to such a goal.

FYI, I realize that one additional thing that has discouraged code
reorganization is the additional backpatch overhead.  I think we now
need to accept that our reorganization-adverse approach might have cost
us some reliability, and that reorganization is going to add work to
backpatching.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: [CORE] postpone next week's release

From
Bruce Momjian
Date:
On Sun, May 31, 2015 at 09:50:25AM -0400, Bruce Momjian wrote:
> > +1. Complexity has increased, and we are actually never at 100% sure
> > that a given bug fix does not have side effects on other things, hence
> > I think that a portion of this technical debt is the lack of
> > regression test coverage, for both existing features and platforms
> > (like Windows). The thing is that complexity has increased, but for
> > example for many features we lack test coverage, thinking mainly
> > replication-related stuff here. Of course we will never get to a level
> > of 100% of confidence with just the test coverage and the buildfarm,
> > but we should at least try to get closer to such a goal.
> 
> FYI, I realize that one additional thing that has discouraged code
> reorganization is the additional backpatch overhead.  I think we now
> need to accept that our reorganization-adverse approach might have cost
> us some reliability, and that reorganization is going to add work to
> backpatching.

Actually, code reorganization in HEAD might cause backpatching to be
more buggy, reducing reliability --- obviously we need to have a
discussion about that.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: [CORE] postpone next week's release

From
Noah Misch
Date:
On Sat, May 30, 2015 at 09:51:04PM -0400, David Steele wrote:
> On 5/30/15 8:38 PM, Joshua D. Drake wrote:
> > On 05/30/2015 03:48 PM, David Steele wrote:
> >> I would argue Heikki's WAL stuff is a perfect case for releasing a
> >> public alpha/beta soon.  I'd love to test PgBackRest with an "official"
> >> 9.5dev build.  The PgBackRest test suite has lots of tests that run on
> >> versions 8.3+ and might well shake out any bugs that are lying around.
> > 
> > You are right. Clone git, run it nightly automated and please, please
> > report anything you find. There is no reason for a tagged release for
> > that. Consider it a custom, purpose built, build-test farm.
> 
> Sure - I can write code to do that.  But then why release a beta at all?

It's largely for the benefit of folks planning manual, or otherwise high-cost,
testing.  If you budget for just one big test per year, make it a test of
beta1.  For inexpensive testing, you may as well ignore beta and test git
master daily or weekly.



Re: [CORE] postpone next week's release

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
>> FYI, I realize that one additional thing that has discouraged code
>> reorganization is the additional backpatch overhead.  I think we now
>> need to accept that our reorganization-adverse approach might have cost
>> us some reliability, and that reorganization is going to add work to
>> backpatching.

> Actually, code reorganization in HEAD might cause backpatching to be
> more buggy, reducing reliability --- obviously we need to have a
> discussion about that.

Commit 6b700301c36e380eb4972ab72c0e914cae60f9fd is a recent real example.
Not that that should dissuade us from ever doing any reorganizations,
but it's foolish to discount back-patching costs.
        regards, tom lane



Re: [CORE] postpone next week's release

From
David Steele
Date:
On 5/31/15 11:49 AM, Noah Misch wrote:
> On Sat, May 30, 2015 at 09:51:04PM -0400, David Steele wrote:
>> On 5/30/15 8:38 PM, Joshua D. Drake wrote:
>>> On 05/30/2015 03:48 PM, David Steele wrote:
>>>> I would argue Heikki's WAL stuff is a perfect case for releasing a
>>>> public alpha/beta soon.  I'd love to test PgBackRest with an "official"
>>>> 9.5dev build.  The PgBackRest test suite has lots of tests that run on
>>>> versions 8.3+ and might well shake out any bugs that are lying around.
>>>
>>> You are right. Clone git, run it nightly automated and please, please
>>> report anything you find. There is no reason for a tagged release for
>>> that. Consider it a custom, purpose built, build-test farm.
>>
>> Sure - I can write code to do that.  But then why release a beta at all?
>
> It's largely for the benefit of folks planning manual, or otherwise high-cost,
> testing.  If you budget for just one big test per year, make it a test of
> beta1.  For inexpensive testing, you may as well ignore beta and test git
> master daily or weekly.

I've gotten to the point of (relatively) high-cost coding/testing.  The
removal of checkpoint_segments and pause_on_recovery are leading to
refactoring of not only the regressions tests but the actual backup
code.  9.5 and 8.3 are the only versions that require exceptions in the
code base.

I've already done basic testing against 9.5 by disabling certain tests.Now I'm at the point where I need to start
modifyingcode to take new 
9.5 features/changes into account and make sure the regression tests
work for 8.3-9.5 with the fewest number of exceptions possible.

From the perspective of backup/restore testing, 9.5 has the most changes
since 9.0.  I'd like to know that the API at least is stable before
investing the time in new development.

Perhaps I'm just misunderstanding the nature of the discussion.

--
- David Steele
david@pgmasters.net


Re: [CORE] postpone next week's release

From
Bruce Momjian
Date:
On Sun, May 31, 2015 at 11:55:44AM -0400, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> >> FYI, I realize that one additional thing that has discouraged code
> >> reorganization is the additional backpatch overhead.  I think we now
> >> need to accept that our reorganization-adverse approach might have cost
> >> us some reliability, and that reorganization is going to add work to
> >> backpatching.
> 
> > Actually, code reorganization in HEAD might cause backpatching to be
> > more buggy, reducing reliability --- obviously we need to have a
> > discussion about that.
> 
> Commit 6b700301c36e380eb4972ab72c0e914cae60f9fd is a recent real example.
> Not that that should dissuade us from ever doing any reorganizations,
> but it's foolish to discount back-patching costs.

Yep.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: [CORE] postpone next week's release

From
Andres Freund
Date:
On 2015-05-31 11:55:44 -0400, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> >> FYI, I realize that one additional thing that has discouraged code
> >> reorganization is the additional backpatch overhead.  I think we now
> >> need to accept that our reorganization-adverse approach might have cost
> >> us some reliability, and that reorganization is going to add work to
> >> backpatching.
> 
> > Actually, code reorganization in HEAD might cause backpatching to be
> > more buggy, reducing reliability --- obviously we need to have a
> > discussion about that.
> 
> Commit 6b700301c36e380eb4972ab72c0e914cae60f9fd is a recent real example.
> Not that that should dissuade us from ever doing any reorganizations,
> but it's foolish to discount back-patching costs.

On the other hand, that code is a complete maintenance nightmare. If
there weren't literally dozens of places that needed to be touched to
add a single parameter, it'd be far less likely for such a mistake to be
made. Right now significant portions of the file differ between the
branches, despite primarily minor feature additions...



Re: [CORE] postpone next week's release

From
Michael Paquier
Date:
On Sun, May 31, 2015 at 11:03 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Sun, May 31, 2015 at 09:50:25AM -0400, Bruce Momjian wrote:
>> > +1. Complexity has increased, and we are actually never at 100% sure
>> > that a given bug fix does not have side effects on other things, hence
>> > I think that a portion of this technical debt is the lack of
>> > regression test coverage, for both existing features and platforms
>> > (like Windows). The thing is that complexity has increased, but for
>> > example for many features we lack test coverage, thinking mainly
>> > replication-related stuff here. Of course we will never get to a level
>> > of 100% of confidence with just the test coverage and the buildfarm,
>> > but we should at least try to get closer to such a goal.
>>
>> FYI, I realize that one additional thing that has discouraged code
>> reorganization is the additional backpatch overhead.  I think we now
>> need to accept that our reorganization-adverse approach might have cost
>> us some reliability, and that reorganization is going to add work to
>> backpatching.
>
> Actually, code reorganization in HEAD might cause backpatching to be
> more buggy, reducing reliability --- obviously we need to have a
> discussion about that.

As a result, IMO all the folks gathering to PGCon (won't be there
sorry, but I read the MLs) should have a talk about that and define a
clear list of items to tackle in terms of reorganization for 9.5, and
then update this page:
https://wiki.postgresql.org/wiki/PostgreSQL_9.5_Open_Items
This does not prevent to move on with all the current items and
continue reviewing existing features that have been pushed of course.
-- 
Michael



Re: [CORE] postpone next week's release

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> On Fri, May 29, 2015 at 8:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I think we should postpone next week's release.

> I'm a bit split on this.

> We *definitely* don't want to release the multixact fix without it being
> carefully reviewed, that's the part I'm not split about :) And I fully
> appreciate we can't have that done by monday.

> However, the file-permission thing seems to hit quite a few people (have we
> ever had this many bug reports after a minor release), which means wed
> really want to get that out quickly.

After dithering over the weekend, the majority view on -core seems to be
that we should go ahead with making a release today for the fsync issue.
We'll plan another release next week, or whenever the dust seems to have
settled on the multixact issue(s).
        regards, tom lane



Re: [CORE] postpone next week's release

From
Jim Nasby
Date:
On 5/29/15 5:28 PM, Bruce Momjian wrote:
>> could expect that anyone committing a user-visible semantics change should
>> >update the release notes themselves.
> Yes, that would be nice.

FWIW, I've always wondered why we don't create an empty next-version 
release notes as part of stamping a major release and expect patch 
authors to add to it. I realize that likely creates merge conflicts, but 
that seems less work than doing it all at the end. (Or maybe each patch 
just creates a file and the final process is pulling all the files 
together.)
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: [CORE] postpone next week's release

From
Tom Lane
Date:
Jim Nasby <Jim.Nasby@bluetreble.com> writes:
> FWIW, I've always wondered why we don't create an empty next-version 
> release notes as part of stamping a major release and expect patch 
> authors to add to it. I realize that likely creates merge conflicts, but 
> that seems less work than doing it all at the end. (Or maybe each patch 
> just creates a file and the final process is pulling all the files 
> together.)

There are good reasons to write the release notes all in one batch:
otherwise you don't get any uniformity of editorial style.
        regards, tom lane



Re: [CORE] postpone next week's release

From
Andres Freund
Date:
On 2015-06-01 12:32:21 -0400, Tom Lane wrote:
> There are good reasons to write the release notes all in one batch:
> otherwise you don't get any uniformity of editorial style.

I agree that that's a good reason for major releases, I do however
wonder if it'd not be a good idea to do differently for backpatched
bugfixes. It's imo a good thing to force committers to write a release
notice at the same time they're backpatching. The memory is fresh, and
the commit message is more likely to contain pertinent details.




Re: [CORE] postpone next week's release

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2015-06-01 12:32:21 -0400, Tom Lane wrote:
>> There are good reasons to write the release notes all in one batch:
>> otherwise you don't get any uniformity of editorial style.

> I agree that that's a good reason for major releases, I do however
> wonder if it'd not be a good idea to do differently for backpatched
> bugfixes. It's imo a good thing to force committers to write a release
> notice at the same time they're backpatching. The memory is fresh, and
> the commit message is more likely to contain pertinent details.

We do expect committers to write commit log messages that contain
appropriate raw material for the release notes.  That's not the same
as expecting them to prepare an actual, sgml-marked-up, release note
entry that's in good English and occupies a reasonable amount of space
relative to other items.

Jim's point about merge problems is very pertinent as well.  In the
first place, if we had running release notes like that, they'd often
differ from one branch to the next, making back-patching rather annoying.
In the second place, SGML is so bulky that the patch context you'd be
working with would frequently look like not much more than
    </para>   </listitem>
   <listitem>    <para>

making it very easy for the hunks to be misapplied.

Lastly, we have recently adopted a practice of labeling release note
entries with the associated commit hashes.  I dunno how much value that
really has, but it would be entirely impossible to write such labels
in advance of pushing the fixes.
        regards, tom lane



Re: [CORE] postpone next week's release

From
Josh Berkus
Date:
All,

Just my $0.02 on PR: it has never been a PR problem to do multiple
update releases, as long as we could provide a good reason for doing so
(like: fix A is available now and we didn't want to hold it back waiting
for fix B).

It's always a practical question of (a) packaging and (b) deployment.
That is, we can get packager fatigue where some updates don't get
packaged, and we can get user fatigue where they start ignoring updates.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: [CORE] postpone next week's release

From
Noah Misch
Date:
On Sun, May 31, 2015 at 12:09:16PM -0400, David Steele wrote:
> On 5/31/15 11:49 AM, Noah Misch wrote:
> > On Sat, May 30, 2015 at 09:51:04PM -0400, David Steele wrote:
> >> Sure - I can write code to do that.  But then why release a beta at all?
> > 
> > It's largely for the benefit of folks planning manual, or otherwise high-cost,
> > testing.  If you budget for just one big test per year, make it a test of
> > beta1.  For inexpensive testing, you may as well ignore beta and test git
> > master daily or weekly.
> 
> I've gotten to the point of (relatively) high-cost coding/testing.  The
> removal of checkpoint_segments and pause_on_recovery are leading to
> refactoring of not only the regressions tests but the actual backup
> code.  9.5 and 8.3 are the only versions that require exceptions in the
> code base.
> 
> I've already done basic testing against 9.5 by disabling certain tests.
>  Now I'm at the point where I need to start modifying code to take new
> 9.5 features/changes into account and make sure the regression tests
> work for 8.3-9.5 with the fewest number of exceptions possible.

Release of beta1 is the cue for that sort of work.

> From the perspective of backup/restore testing, 9.5 has the most changes
> since 9.0.  I'd like to know that the API at least is stable before
> investing the time in new development.

Its API will be as good as pgsql-hackers could make it; beta1 is also a call
for help discovering API problems we overlooked.  Subsequent API changes are
usually reactions to beta test reports.



Restore-reliability mode

From
Noah Misch
Date:
Subject changed from "Re: [CORE] postpone next week's release".

On Sat, May 30, 2015 at 10:48:45PM -0400, Bruce Momjian wrote:
> Well, I think we stop what we are doing, focus on restructuring,
> testing, and reviewing areas that historically have had problems, and
> when we are done, we can look to go to 9.5 beta.  What we don't want to
> do is to push out more code and get back into a
> wack-a-bug-as-they-are-found mode, which obviously did not serve us well
> for multi-xact, and which is what releasing a beta will do, and of
> course, more commit-fests, and more features.  
> 
> If we have to totally stop feature development until we are all happy
> with the code we have, so be it.  If people feel they have to get into
> cleanup mode or they will never get to add a feature to Postgres again,
> so be it.  If people say, heh, I am not going to do anything and just
> come back when cleanup is done (by someone else), then we will end up
> with a smaller but more dedicated development team, and I am fine with
> that too.  I am suggesting that until everyone is happy with the code we
> have, we should not move forward.

I like the essence of this proposal.  Two suggestions.  We can't achieve or
even robustly measure "everyone is happy with the code," so let's pick
concrete exit criteria.  Given criteria framed like "Files A,B,C and patches
X,Y,Z have a sign-off from a committer other than their original committer."
anyone can monitor progress and find specific ways to contribute.  Second, I
would define the subject matter as "bug fixes, testing and review", not
"restructuring, testing and review."  Different code structures are clearest
to different hackers.  Restructuring, on average, adds bugs even more quickly
than feature development adds them.

Thanks,
nm



Re: Restore-reliability mode

From
Geoff Winkless
Date:
<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On 3 June 2015 at 14:50, Noah Misch <span
dir="ltr"><<ahref="mailto:noah@leadboat.com" target="_blank">noah@leadboat.com</a>></span> wrote:<br
/><blockquoteclass="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><divclass=""
id=":6l2"style="overflow:hidden"> I<div class="gmail_default"
style="font-family:verdana,sans-serif;font-size:small;display:inline">​​</div>would define the subject matter as "bug
fixes,testing and review", not<br /> "restructuring, testing and review."  Different code structures are clearest<br />
todifferent hackers.  Restructuring, on average, adds bugs even more quickly<br /> than feature development adds
them.<br/></div></blockquote></div><br /><div class="gmail_default"
style="font-family:verdana,sans-serif;font-size:small">​+1to this. Rewriting or restructuring code because you don't
trustit (even though you have no reported real-world bugs)​ is a terrible idea. </div><div class="gmail_default"
style="font-family:verdana,sans-serif;font-size:small"><br/></div><div class="gmail_default"
style="font-family:verdana,sans-serif;font-size:small">Stoppingall feature development to do it is even
worse.</div><divclass="gmail_default" style="font-family:verdana,sans-serif;font-size:small"><br /></div><div
class="gmail_default"style="style"><font face="verdana, sans-serif">I know you're not talking about rewriting, but I
think<a
href="http://www.joelonsoftware.com/articles/fog0000000069.html">http://www.joelonsoftware.com/articles/fog0000000069.html</a>
isalways worth a re-read, if only because it's funny :)</font><br /></div><div class="gmail_default"
style="style"><fontface="verdana, sans-serif"><br /></font></div><div class="gmail_default" style="style"><font
face="verdana,sans-serif">I would always 100% support a decision to push back new releases because of bugfixes for
<i>known</i> issues,but if you think you <i>might </i>be able to find bugs in code you don't like, you should do that
onyour own time. Iff you find actual bugs, <i>then </i>you talk about halting new releases.</font></div><div
class="gmail_default"style="style"><font face="verdana, sans-serif"><br /></font></div><div class="gmail_default"
style="style"><spanstyle="font-family:verdana,sans-serif">Geoff</span></div></div></div> 

Re: Restore-reliability mode

From
Andres Freund
Date:
On 2015-06-03 09:50:49 -0400, Noah Misch wrote:
> Second, I would define the subject matter as "bug fixes, testing and
> review", not "restructuring, testing and review."  Different code
> structures are clearest to different hackers.  Restructuring, on
> average, adds bugs even more quickly than feature development adds
> them.

I can't agree with this. While I agree with not doing large
restructuring for 9.5, I think we can't affort not to refactor for
clarity, even if that introduces bugs. Noticeable parts of our code have
to frequently be modified for new features and are badly structured at
the same time. While restructuring will may temporarily increase the
number of bugs in the short term, it'll decrease the number of bugs long
term while increasing the number of potential contributors and new
features.  That's obviously not to say we should just refactor for the
sake of it.



Re: Restore-reliability mode

From
"Joshua D. Drake"
Date:
On 06/03/2015 07:18 AM, Andres Freund wrote:
>
> On 2015-06-03 09:50:49 -0400, Noah Misch wrote:
>> Second, I would define the subject matter as "bug fixes, testing and
>> review", not "restructuring, testing and review."  Different code
>> structures are clearest to different hackers.  Restructuring, on
>> average, adds bugs even more quickly than feature development adds
>> them.
>
> I can't agree with this. While I agree with not doing large
> restructuring for 9.5, I think we can't affort not to refactor for
> clarity, even if that introduces bugs. Noticeable parts of our code have
> to frequently be modified for new features and are badly structured at
> the same time. While restructuring will may temporarily increase the
> number of bugs in the short term, it'll decrease the number of bugs long
> term while increasing the number of potential contributors and new
> features.  That's obviously not to say we should just refactor for the
> sake of it.
>

Our project has been continuing to increase momentum over the last few 
years and our adoption has increased at an amazing rate. It is important 
to remember that we have users. These users have needs that must be met 
else those users will move on to a different technology.

I agree that we need to postpone this release. I also agree that there 
is likely re-factoring to be done. I have also never met a programmer 
who doesn't think something needs to be re-factored. The majority of 
programmers I know all suffer from NIH and want to change how things are 
implemented.

If we are going to re-factor, it should not be considered global and 
should be attacked with specific goals in mind. If those goals are not 
specifically defined and agreed on, we will get very pretty code with 
very little use for our users. Then our users will leave because they 
are busy waiting on us to re-factor.

In short, we must balance this effort with the needs of the code versus 
the needs of our users.

Sincerely,

JD

-- 
The most kicking donkey PostgreSQL Infrastructure company in existence.
The oldest, the most experienced, the consulting company to the stars.
Command Prompt, Inc. http://www.commandprompt.com/ +1 -503-667-4564 -
24x7 - 365 - Proactive and Managed Professional Services!



Re: [CORE] Restore-reliability mode

From
Josh Berkus
Date:
On 06/03/2015 06:50 AM, Noah Misch wrote:
> Subject changed from "Re: [CORE] postpone next week's release".
> 
> On Sat, May 30, 2015 at 10:48:45PM -0400, Bruce Momjian wrote:
>> If we have to totally stop feature development until we are all happy
>> with the code we have, so be it.  If people feel they have to get into
>> cleanup mode or they will never get to add a feature to Postgres again,
>> so be it.  If people say, heh, I am not going to do anything and just
>> come back when cleanup is done (by someone else), then we will end up
>> with a smaller but more dedicated development team, and I am fine with
>> that too.  I am suggesting that until everyone is happy with the code we
>> have, we should not move forward.
> 
> I like the essence of this proposal.  Two suggestions.  We can't achieve or
> even robustly measure "everyone is happy with the code," so let's pick
> concrete exit criteria.  Given criteria framed like "Files A,B,C and patches
> X,Y,Z have a sign-off from a committer other than their original committer."
> anyone can monitor progress and find specific ways to contribute.  Second, I
> would define the subject matter as "bug fixes, testing and review", not
> "restructuring, testing and review."  Different code structures are clearest
> to different hackers.  Restructuring, on average, adds bugs even more quickly
> than feature development adds them.

So, historically, this is what the period between feature freeze and
beta1 was for; the "consolidation" phase was supposed to deal with this.The problem over the last few years, by my
observation,has been that
 
consolidation has been left to just a few people (usually just Bruce &
Tom or Tom & Robert) and our code base is now much to large for that.

The way other projects deal with this is having continuous testing as
stuff comes in, and *more* testing that just our regression tests (e.g.
acceptance tests, integration tests, performance tests, etc.). So our
other issue has been that our code complexity has been growing faster
than our test suite.  Part of that is that this community has never
placed much value in automated testing or testers, so people who are
interested in it find other projects to contribute to.

I would argue that if we delay 9.5 in order to do a 100% manual review
of code, without adding any new automated tests or other non-manual
tools for improving stability, then it's a waste of time; we might as
well just release the beta, and our users will find more issues than we
will.  I am concerned that if we declare a cleanup period, especially in
the middle of the summer, all that will happen is that the project will
go to sleep for an extra three months.

I will also point out that there is a major adoption cost to delaying
9.5.   Right now users are excited about UPSERT, big data, and extra
JSON features. If they have to wait another 7 months, they'll be a lot
less excited, and we'll lose more potential users to the new databases
and the MySQL forks.  It could also delay the BDR project (Simon/Craig
can speak to this) which would suck.

Reliability of having a release every year is important as well as
database reliability ... and for a lot of the new webdev generation,
PostgreSQL is already the most reliable piece of software infrastructure
they use.  So if we're going to have a cleanup delay, then let's please
make it an *intensive* cleanup delay, with specific goals, milestones,
and a schedule.  Otherwise, don't bother.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: [CORE] Restore-reliability mode

From
Andres Freund
Date:
On 2015-06-03 10:21:28 -0700, Josh Berkus wrote:
> So, historically, this is what the period between feature freeze and
> beta1 was for; the "consolidation" phase was supposed to deal with this.
>  The problem over the last few years, by my observation, has been that
> consolidation has been left to just a few people (usually just Bruce &
> Tom or Tom & Robert) and our code base is now much to large for that.
> 
> The way other projects deal with this is having continuous testing as
> stuff comes in, and *more* testing that just our regression tests (e.g.
> acceptance tests, integration tests, performance tests, etc.). So our
> other issue has been that our code complexity has been growing faster
> than our test suite.  Part of that is that this community has never
> placed much value in automated testing or testers, so people who are
> interested in it find other projects to contribute to.
> 
> I would argue that if we delay 9.5 in order to do a 100% manual review
> of code, without adding any new automated tests or other non-manual
> tools for improving stability, then it's a waste of time; we might as
> well just release the beta, and our users will find more issues than we
> will.  I am concerned that if we declare a cleanup period, especially in
> the middle of the summer, all that will happen is that the project will
> go to sleep for an extra three months.
> 
> I will also point out that there is a major adoption cost to delaying
> 9.5.   Right now users are excited about UPSERT, big data, and extra
> JSON features. If they have to wait another 7 months, they'll be a lot
> less excited, and we'll lose more potential users to the new databases
> and the MySQL forks.  It could also delay the BDR project (Simon/Craig
> can speak to this) which would suck.
> 
> Reliability of having a release every year is important as well as
> database reliability ... and for a lot of the new webdev generation,
> PostgreSQL is already the most reliable piece of software infrastructure
> they use.  So if we're going to have a cleanup delay, then let's please
> make it an *intensive* cleanup delay, with specific goals, milestones,
> and a schedule.  Otherwise, don't bother.

+very many



Re: [CORE] postpone next week's release

From
Stefan Kaltenbrunner
Date:
On 05/31/2015 03:51 AM, David Steele wrote:
> On 5/30/15 8:38 PM, Joshua D. Drake wrote:
>>
>> On 05/30/2015 03:48 PM, David Steele wrote:
>>> On 5/30/15 2:10 PM, Robert Haas wrote:
>>>> What, in this release, could break things badly?  RLS? Grouping sets?
>>>> Heikki's WAL format changes?  That last one sounds really scary to me;
>>>> it's painful if not impossible to fix the WAL format in a minor
>>>> release.
>>>
>>> I would argue Heikki's WAL stuff is a perfect case for releasing a
>>> public alpha/beta soon.  I'd love to test PgBackRest with an "official"
>>> 9.5dev build.  The PgBackRest test suite has lots of tests that run on
>>> versions 8.3+ and might well shake out any bugs that are lying around.
>>
>> You are right. Clone git, run it nightly automated and please, please
>> report anything you find. There is no reason for a tagged release for
>> that. Consider it a custom, purpose built, build-test farm.
> 
> Sure - I can write code to do that.  But then why release a beta at all?

FWIW: we also carry "official" snapshots on the download site (
https://ftp.postgresql.org/pub/snapshot/dev/) that you could use if you
dont want git directly - those even receive some form of QA (for a
snapshot to be posted it is required to pass a full buildfarm run on the
buildbox).



Stefan



Re: [CORE] postpone next week's release

From
Heikki Linnakangas
Date:
On 05/30/2015 11:47 PM, Andres Freund wrote:
> I don't think it's primarily a problem of lack of review; although that
> is a large problem.  I think the biggest systematic problem is that the
> compound complexity of postgres has increased dramatically over the
> years.  Features have added complexity little by little, each not
> incrementally not looking that bad.  But very little has been done to
> manage complexity. Since 8.0 the codesize has roughly doubled, but
> little has been done to manage the increased complexity. Few new
> abstractions have been introduced and the structure of the code is
> largely the same.
>
> As a somewhat extreme example, let's look at StartupXLOG(). In 8.0 it
> was ~500 LOC, in master it's ~1500.  The interactions in 8.0 were
> complex, they have gotten much more complex since.  It fullfills lots of
> different roles, all in one function:
>
> (roughly in the order things happen, but simplified)
> * Read the control file/determine whether we crashed
> * recovery.conf handling
> * backup label handling
> * tablespace map handling (huh, I missed that this was added directly to
>    StartupXLOG. What a bad idea)
> * Determine whether we're doing archive recovery, read the relevant
>    checkpoint if so
> * relcache init file removal
> * timeline switch handling
> * Loading the checkpoint we're starting from
> * Initialization of a lot of subsystems
> * crash recovery/replay
>    * Including pgstat, unlogged table, exported snapshot handling
>    * iff hot standby, some more subsystems are initialized here
>    * hot standby state handling
>    * replay process intialization
>    * crash replay itself, including
>      * progress tracking
>      * recovery pause handling
>      * nextxid tracking
>      * timeline increase handling
>      * hot standby state handling
>    * unlogged relations handling
>    * archive recovery handling
>    * creation/initialization of the end of recovery checkpoint
>    * timeline increment if failover
> * subsystem initialization iff !hot_standby
> * end of recovery actions
>
> Yes. that's one routine. And, to make things even funnier, half of that
> routine isn't exercised by our tests.
>
> You can argue that this is an outlier, but I don't think so. Heapam, the
> planner, etc. have similar cases.
>
> And I think this, to some degree, explains a lot of the multixact
> problems. While there were a few "simple bugs", most of them were
> interactions between the various subsystems that are rather intricate.

I think this explanation is wrong. I agree that there are many places 
that would be good to refactor - like StartupXLOG() - but the multixact 
code was not too bad in that regard. IIRC the patch included some 
refactoring, it added some new helper functions in heapam.c, for 
example. You can argue that it didn't do enough of it, but that was not 
the big issue.

The big issue was at the architecture level. Basically, we liked 
vacuuming of XIDs and clog so much that we decided that it'd be nice if 
you had to vacuum multixids too, in order to not lose data. Many of the 
bugs and issues were not new - we had multixids before - but we upped 
the ante and turned minor locking bugs into data loss. And that had 
nothing to do with the code structure - we'd have similar issues if we 
had rewritten everything java, with the same design.

So, I'm all for refactoring and adding abstractions where it makes 
sense, but it's not going to solve design problems.

- Heikki




Re: [CORE] postpone next week's release

From
Andres Freund
Date:
On 2015-06-04 11:51:44 +0300, Heikki Linnakangas wrote:
> I think this explanation is wrong. I agree that there are many places that
> would be good to refactor - like StartupXLOG() - but the multixact code was
> not too bad in that regard. IIRC the patch included some refactoring, it
> added some new helper functions in heapam.c, for example. You can argue that
> it didn't do enough of it, but that was not the big issue.

Yea, but the bugs were more around the interactions to other parts of
the system. Like e.g. crash recovery, which now is about bug 7 or
so. And those are the ones that are hard to understand.

> The big issue was at the architecture level. Basically, we liked vacuuming
> of XIDs and clog so much that we decided that it'd be nice if you had to
> vacuum multixids too, in order to not lose data. Many of the bugs and issues
> were not new - we had multixids before - but we upped the ante and turned
> minor locking bugs into data loss. And that had nothing to do with the code
> structure - we'd have similar issues if we had rewritten everything java,
> with the same design.

I think we're probably just using slightly different terms here - for me
one very good way of fixing some structurally bad things *is* improving
the design.

If you look at the bugs around multixacts: The first few were around
ctid-chaining, hard to find and fix because there's about 8-10 places
implementing it with slight differences.  The next bunch were around
vacuuming, some of them oversights, a good bunch of them more
fundamental. Crash recovery wasn't thought about (lack of
testing/review), and more generally the new code tripped over bad old
decisions (hey, wraparound is ok!).  Then there were a bunch of stupid
bugs in crash-recovery (testing mainly), and larger scale bugs (hey, let's
access stuff during recovery).  Then there's the whole row level locking
code - which is by now among the hardest to understand code in
postgres - and voila it contained a bunch of oversights that were hard
to spot.

So yes, I think nicer code to work with would have prevented us from
making a significant portion of these. It might have also made us
realize earlier how significant the increase in complexity was.

> So, I'm all for refactoring and adding abstractions where it makes sense,
> but it's not going to solve design problems.

I personally don't really see the multixact changes being that bad on
the overall design. It pretty much just extended an earlier design. Now
that wasn't great, but I don't think too many people had realized that
at that point.  The biggest problem was underestimating the complexity.

Greetings,

Andres Freund



Re: [CORE] postpone next week's release

From
Heikki Linnakangas
Date:
On 06/04/2015 12:17 PM, Andres Freund wrote:
> On 2015-06-04 11:51:44 +0300, Heikki Linnakangas wrote:
>> So, I'm all for refactoring and adding abstractions where it makes sense,
>> but it's not going to solve design problems.
>
> I personally don't really see the multixact changes being that bad on
> the overall design. It pretty much just extended an earlier design. Now
> that wasn't great, but I don't think too many people had realized that
> at that point.  The biggest problem was underestimating the complexity.

Yeah, many of the issues were pre-existing, and would've been good to 
fix anyway.

The multixact issues remind me of the another similar thing we did: the 
visibility map. It too was non-critical when it was first introduced, 
but later we started using it for index-only-scans, and it suddenly 
became important that it's up-to-date and crash-safe. We did uncover 
some bugs in that area when index-only-scans were introduced, similar to 
the multixact bugs, only not as bad because it didn't lead to data loss. 
I don't have any point to make with that comparison, but it was similar 
in many ways.

- Heikki




Re: [CORE] postpone next week's release

From
Simon Riggs
Date:
On 30 May 2015 at 05:08, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, May 29, 2015 at 6:33 PM, Andres Freund <andres@anarazel.de> wrote:
>> Why? A large portion of the input required to go from beta towards a
>> release is from actual users. To see when things break, what confuses
>> them and such.

> I have two concerns:

> 1. I'm concerned that once we release beta, any idea about reverting a
> feature or fixing something that is broken will get harder, because
> people will say "well, we can't do that after we've released a beta".
> I confess to particularly wanting a solution to the item listed as
> "custom-join has no way to construct Plan nodes of child Path nodes",
> the history of which I'll avoid recapitulating until I'm sure I can do
> it while maintaining my blood pressure at safe levels.

> 2. Also, if we're going to make significant multixact-related changes
> to 9.5 to try to improve reliability, as you proposed on the other
> thread, then it would be nice to do that before beta, so that it gets
> tested.  Of course, someone is bound to point out that we could make
> those changes in time for beta2, and people could test that.  But in
> practice I think that'll just mean that stuff is only out there for
> let's say 2 months before we put it in a major release, which ain't
> much.

I think your position is completely nuts.  The GROUPING SETS code is
desperately in need of testing.  The custom-plan code is desperately
in need of fixing and testing.  The multixact code is desperately
in need of testing.  The open-items list has several other problems
besides those.  All of those problems are independent.  If we insist
on tackling them serially rather than in parallel, 9.5 might not come
out till 2017.

I agree that we are not in a position to promise features won't change.
So let's call it an alpha not a beta --- but for heaven's sake let's
try to move forward on all these issues, not just some of them.

I think releasing 9.5 in some form NOW will aid its software quality.

We've never linked Beta release date to final release date, so if the quality proves to be as poor as some people think then the list of bugs will show that and we release later. 

AFAIK beta period is exactly the time when we are allowed to pull features from the release. I welcome the idea that we test it, if its stable and it works we release it. If doesn't, we pull it.

Not releasing our software yet making a list of our fears doesn't work towards a solution. Our fears will make us shout at each other too, so I for one would rather skip that part and do some practical actions.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [CORE] Restore-reliability mode

From
Stephen Frost
Date:
Josh,

* Josh Berkus (josh@agliodbs.com) wrote:
> I would argue that if we delay 9.5 in order to do a 100% manual review
> of code, without adding any new automated tests or other non-manual
> tools for improving stability, then it's a waste of time; we might as
> well just release the beta, and our users will find more issues than we
> will.  I am concerned that if we declare a cleanup period, especially in
> the middle of the summer, all that will happen is that the project will
> go to sleep for an extra three months.

This is the exact same concern that I have.  A delay just to have a
delay is not useful.  I completely agree that we need more automated
testing, etc, though getting all of that set up and running could be
done at any time too- there's no reason to wait, nor do I believe
delaying 9.5 would make such automated testing appear.
Thanks!
    Stephen

Re: [CORE] Restore-reliability mode

From
Craig Ringer
Date:


On 4 June 2015 at 22:43, Stephen Frost <sfrost@snowman.net> wrote:
Josh,

* Josh Berkus (josh@agliodbs.com) wrote:
> I would argue that if we delay 9.5 in order to do a 100% manual review
> of code, without adding any new automated tests or other non-manual
> tools for improving stability, then it's a waste of time; we might as
> well just release the beta, and our users will find more issues than we
> will.  I am concerned that if we declare a cleanup period, especially in
> the middle of the summer, all that will happen is that the project will
> go to sleep for an extra three months.

This is the exact same concern that I have.  A delay just to have a
delay is not useful.  I completely agree that we need more automated
testing, etc, though getting all of that set up and running could be
done at any time too- there's no reason to wait, nor do I believe
delaying 9.5 would make such automated testing appear.


In terms of specific testing improvements, things I think we need to have covered and runnable on the buildfarm are:

* pg_dump and pg_restore testing (because it's scary we don't do this)
* WAL archiving based warm standby testing with promotion
* Two node streaming replication with promotion, both with a slot and with archive fallback
* Three node cascading streaming replication with middle node promotion then tail end node promotion
* Logical decoding streaming testing, comparing to expected decoded output
* DDL deparse test coverage for all operations
* pg_basebackup + start up from backup
* hard-kill the postmaster, start up from crashed datadir
* pg_start_backup, rsync, pg_stop_backup, start up in hot standby
* disk exhaustion tests both for pg_xlog and for the main datadir, showing we can recover OK when disk is filled then space is freed
* Tests of crash recovery during various DDL operations

Obviously some of these overlap, so one test can cover more than one item.

Implementing these requires stepping outside the comfortable zone of pg_regress and the isolationtester and having something that can manage multiple data directories. It's also hard to be sure you're testing the same thing each time - for example, when using streaming replication with archive fallback, it might be tricky to ensure that your replica falls behind and falls back to WAL archive each time. There's always SIGSTOP I guess.

While these are multi-node tests, at least in PostgreSQL we can just run on different ports, so there's no need to muck about with containers or VMs.

I already run some of these tests using Ansible for BDR, but I don't imagine that'd be acceptable in core. It's Python, and it's not especially well suited to use as a regression testing framework, it's just what I had to hand and already needed for other automation tasks.

Is pg_tap a reasonable starting point for this sort of testing?

Am I missing obvious and important tests?

How would a test that would've caught the multixact issues look?

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [CORE] Restore-reliability mode

From
Michael Paquier
Date:
On Fri, Jun 5, 2015 at 8:53 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
>
>
> On 4 June 2015 at 22:43, Stephen Frost <sfrost@snowman.net> wrote:
>>
>> Josh,
>>
>> * Josh Berkus (josh@agliodbs.com) wrote:
>> > I would argue that if we delay 9.5 in order to do a 100% manual review
>> > of code, without adding any new automated tests or other non-manual
>> > tools for improving stability, then it's a waste of time; we might as
>> > well just release the beta, and our users will find more issues than we
>> > will.  I am concerned that if we declare a cleanup period, especially in
>> > the middle of the summer, all that will happen is that the project will
>> > go to sleep for an extra three months.
>>
>> This is the exact same concern that I have.  A delay just to have a
>> delay is not useful.  I completely agree that we need more automated
>> testing, etc, though getting all of that set up and running could be
>> done at any time too- there's no reason to wait, nor do I believe
>> delaying 9.5 would make such automated testing appear.
>>
>
> In terms of specific testing improvements, things I think we need to have
> covered and runnable on the buildfarm are:
>
> * pg_dump and pg_restore testing (because it's scary we don't do this)

We do test it in some way with pg_upgrade using set of objects that
are not removed by the regression test suite. Extension dumps are
uncovered yet though.

> * WAL archiving based warm standby testing with promotion
> * Two node streaming replication with promotion, both with a slot and with
> archive fallback
> * Three node cascading streaming replication with middle node promotion then
> tail end node promotion
> * Logical decoding streaming testing, comparing to expected decoded output
> * hard-kill the postmaster, start up from crashed datadir
> * pg_basebackup + start up from backup
> * pg_start_backup, rsync, pg_stop_backup, start up in hot standby
> * Tests of crash recovery during various DDL operations

Well, steps in this direction are the point of this patch, the
replication test suite:
https://commitfest.postgresql.org/5/197/
And this one, addition of Windows support for TAP tests:
https://commitfest.postgresql.org/5/207/

> * DDL deparse test coverage for all operations

What do you have in mind except what is already in objectaddress.sql
and src/test/modules/test_dll_deparse/?

> * disk exhaustion tests both for pg_xlog and for the main datadir, showing
> we can recover OK when disk is filled then space is freed

This may be tricky. How would you emulate that?

> Is pg_tap a reasonable starting point for this sort of testing?

IMO, using the TAP machinery would be a good base for that. What lacks
is a basic set of perl routines that one can easily use to set of test
scenarios.

> How would a test that would've caught the multixact issues look?

I have not followed closely those discussions, not sure about that.

Regards,
-- 
Michael



Re: [CORE] Restore-reliability mode

From
Simon Riggs
Date:
On 3 June 2015 at 18:21, Josh Berkus <josh@agliodbs.com> wrote:
 
I would argue that if we delay 9.5 in order to do a 100% manual review
of code, without adding any new automated tests or other non-manual
tools for improving stability, then it's a waste of time; we might as
well just release the beta, and our users will find more issues than we
will.  I am concerned that if we declare a cleanup period, especially in
the middle of the summer, all that will happen is that the project will
go to sleep for an extra three months.

Agreed. Cleanup can occur while we release code for public testing.

Many eyeballs of Beta beats anything we can throw at it thru manual inspection. The whole problem of bugs is that they are mostly found by people trying to use the software. 
 
I will also point out that there is a major adoption cost to delaying
9.5.   Right now users are excited about UPSERT, big data, and extra
JSON features. If they have to wait another 7 months, they'll be a lot
less excited, and we'll lose more potential users to the new databases
and the MySQL forks. 

Reliability of having a release every year is important as well as
database reliability ... and for a lot of the new webdev generation,
PostgreSQL is already the most reliable piece of software infrastructure
they use.  So if we're going to have a cleanup delay, then let's please
make it an *intensive* cleanup delay, with specific goals, milestones,
and a schedule.  Otherwise, don't bother.

We've decided previously that having a fixed annual schedule was a good thing for the project. Getting the features that work into the hands of the people that want them is very important.

Discussing halting the development schedule publicly is very damaging. 

If there are features in doubt, lets do more work on them or just pull them now and return to the schedule. I don't really care which ones get canned as long as we return to the schedule.

Whatever we do must be exact and measurable. If its not, it means we haven't assembled enough evidence for action that is sufficiently directed to achieve the desired goal.


On 3 June 2015 at 18:21, Josh Berkus <josh@agliodbs.com> wrote:
 
  It could also delay the BDR project (Simon/Craig
can speak to this) which would suck.

Nothing being discussed here can/will slow down the BDR project since it is already a different thread of development. More so, 2ndQuadrant has zero income tied to the release of 9.5 or the commit of any feature, so as far as that company is concerned, the release could wait for 10 years.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Restore-reliability mode

From
Simon Riggs
Date:
On 3 June 2015 at 14:50, Noah Misch <noah@leadboat.com> wrote:
Subject changed from "Re: [CORE] postpone next week's release".

On Sat, May 30, 2015 at 10:48:45PM -0400, Bruce Momjian wrote:
> Well, I think we stop what we are doing, focus on restructuring,
> testing, and reviewing areas that historically have had problems, and
> when we are done, we can look to go to 9.5 beta.  What we don't want to
> do is to push out more code and get back into a
> wack-a-bug-as-they-are-found mode, which obviously did not serve us well
> for multi-xact, and which is what releasing a beta will do, and of
> course, more commit-fests, and more features.
>
> If we have to totally stop feature development until we are all happy
> with the code we have, so be it.  If people feel they have to get into
> cleanup mode or they will never get to add a feature to Postgres again,
> so be it.  If people say, heh, I am not going to do anything and just
> come back when cleanup is done (by someone else), then we will end up
> with a smaller but more dedicated development team, and I am fine with
> that too.  I am suggesting that until everyone is happy with the code we
> have, we should not move forward.

I like the essence of this proposal.  Two suggestions.  We can't achieve or
even robustly measure "everyone is happy with the code," so let's pick
concrete exit criteria.  Given criteria framed like "Files A,B,C and patches
X,Y,Z have a sign-off from a committer other than their original committer."
anyone can monitor progress and find specific ways to contribute.

I don't like the proposal, nor do I like the follow on comments made.

This whole idea of "feature development" vs reliability is bogus. It implies people that work on features don't care about reliability. Given the fact that many of the features are actually about increasing database reliability in the event of crashes and corruptions it just makes no sense.

How will we participate in cleanup efforts? How do we know when something has been "cleaned up", how will we measure our success or failure? I think we should be clear that wasting N months on cleanup can *fail* to achieve a useful objective. Without a clear plan it almost certainly will do so. The flip side is that wasting N months will cause great amusement and dancing amongst those people who wish to pull ahead of our open source project and we should take care not to hand them a victory from an overreaction.

Lastly, the idea that we allow developers to drift away and we're OK with that is just plain mad. I've spent a decade trying to grow the pool of skilled developers who can assist the project. Acting against that, in deed or just word, is highly counter productive for the project.

Let's just take a breath and think about this.

It is normal for us to spend a month or so consolidating our work. It is also normal for people that see major problems to call them out, effectively using the "Stop The Line" technique.   https://leanbuilds.wordpress.com/tag/stop-the-line/

So lets do our normal things, not do a "total stop" for an indefinite period. If someone has specific things that in their opinion need to be addressed, list them and we can talk about doing them, together. I thought that was what the Open Items list was for. Let's use it.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [CORE] Restore-reliability mode

From
Robert Haas
Date:
On Fri, Jun 5, 2015 at 2:50 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Agreed. Cleanup can occur while we release code for public testing.

The code is available for public testing right now.  Stamping it a
beta implies that we think it's something fairly stable that we'd be
pretty happy to release if things go well, which is a higher bar to
clear.

I can't help noticing for all the drumbeat of "let's release 9.5 beta
now", activity to clean up the items on this list seems quite
sluggish:

https://wiki.postgresql.org/wiki/PostgreSQL_9.5_Open_Items

I've seen Tom and a few other people doing some work that I would
describe as useful pre-beta stabilization, but I think there is a good
bit more that could be done, and that list is a good starting point.
I hope to have time to do some myself, but right now I am busy trying
to stabilize 9.3, along with Alvaro, Noah, Andres, and Thomas Munro,
and PGCon is coming up in just over a week.  I think we could afford
to give ourselves at least until a few weeks following PGCon to tidy
up.

I do agree that an indefinite development freeze with unclear
parameters for resuming development and unclear goals is a bad plan.
But I think giving ourselves a little more time to, say, turn the
buildfarm consistently green, and, say, fix the known but
currently-unfixed multixact bugs, and, say, fix the known bugs in 9.5
features is a good plan, and I hope you and others will support it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [CORE] Restore-reliability mode

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Jun 5, 2015 at 2:50 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> Agreed. Cleanup can occur while we release code for public testing.

> The code is available for public testing right now.

Only to people who have the time and ability to pull the code from git
and build from source.  I don't know exactly what fraction of interested
testers that excludes, but I bet it's significant.  The point of producing
packages would be to remove that barrier to testing.

> Stamping it a
> beta implies that we think it's something fairly stable that we'd be
> pretty happy to release if things go well, which is a higher bar to
> clear.

So let's call it an alpha, or some other way of setting expectations
appropriately.  But I think it's silly to maintain that the code is not in
a state where end-user testing is useful.  They just have to understand
that they can't trust it with production data.

> I can't help noticing for all the drumbeat of "let's release 9.5 beta
> now", activity to clean up the items on this list seems quite
> sluggish:
> https://wiki.postgresql.org/wiki/PostgreSQL_9.5_Open_Items

While we need to work on those items, I do not agree that getting that
list to empty has to happen before we release a test version.  I think
serializing effort in that way is simply bad project management.  And
it's not how we've operated in the past either: getting the open items
list to empty has always been understood as a prerequisite to RC versions,
not to betas.

To get to specifics instead of generalities: exactly which of the current
open items do you think is so bad that it precludes user testing?  I do
not see a beta-blocker in the lot.
        regards, tom lane



Re: [CORE] Restore-reliability mode

From
Bruce Momjian
Date:
On Fri, Jun  5, 2015 at 07:50:31AM +0100, Simon Riggs wrote:
> On 3 June 2015 at 18:21, Josh Berkus <josh@agliodbs.com> wrote:
>  
> 
>     I would argue that if we delay 9.5 in order to do a 100% manual review
>     of code, without adding any new automated tests or other non-manual
>     tools for improving stability, then it's a waste of time; we might as
>     well just release the beta, and our users will find more issues than we
>     will.  I am concerned that if we declare a cleanup period, especially in
>     the middle of the summer, all that will happen is that the project will
>     go to sleep for an extra three months.
> 
> 
> Agreed. Cleanup can occur while we release code for public testing.
> 
> Many eyeballs of Beta beats anything we can throw at it thru manual inspection.
> The whole problem of bugs is that they are mostly found by people trying to use
> the software. 

Please address some of the specific issues I mentioned.  The problem
with the multi-xact case is that we just kept fixing bugs as people
found them, and did not do a holistic review of the code.  I am saying
let's not _keep_ doing that and let's make sure we don't have any
systematic problems in our code where we just keep fixing things without
doing a thorough analysis.

To release 9.5 beta would be to get back into that cycle, and I am not
sure we are ready for that.  I think the fact we have multiple people
all reviewing the multi-xact code now (and not dealing with 9.5) is a
good sign.  If we were focused on 9.5 beta, I doubt this would have
happened.

I am saying let's make sure we are not deficient in other areas, then
let's move forward again.  I would love to think we can do multiple
things at once, but for multi-xact, serious review didn't happen for 18
months, so if slowing release development is what is required, I support
it.

> We've decided previously that having a fixed annual schedule was a good thing
> for the project. Getting the features that work into the hands of the people
> that want them is very important.

Yes, but let's not be a slave to the schedule if our reliability is
suffering, which it clearly has in the past 18 months.

> Discussing halting the development schedule publicly is very damaging. 

Agreed.

> If there are features in doubt, lets do more work on them or just pull them now
> and return to the schedule. I don't really care which ones get canned as long
> as we return to the schedule.

Again, please address my concerns above.  This is not about 9.5
features, but rather our overall focus on schedule vs. reliability, and
your arguments are reinforcing my idea that we do not have the proper
balance here.

> Whatever we do must be exact and measurable. If its not, it means we haven't
> assembled enough evidence for action that is sufficiently directed to achieve
> the desired goal.

Sure.  I think everyone agrees the multi-xact work is all good, so I am
asking what else needs this kind of research.  If there is nothing else,
we can move forward again --- I am just saying we need to ask the
reliability question _first_.

Let me restate something that has appeared in many replies to my ideas
--- I am not asking for infinite or unbounded review, but I am asking
that we make sure reliability gets the proper focus in relation to our
time pressures.  Our balance was so off a month ago that I feel only a
full stop on time pressure would allow us to refocus because people are
not good at focusing on multiple things. It is sometimes necessary to
stop everything to get people's attention, and to help them remember
that without reliability, a database is useless.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: [CORE] Restore-reliability mode

From
Simon Riggs
Date:
On 5 June 2015 at 15:00, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Jun 5, 2015 at 2:50 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Agreed. Cleanup can occur while we release code for public testing.

The code is available for public testing right now. 

People test when they get the signal from us, not before. While what you say is literally correct, that is not the point.
 
Stamping it a
beta implies that we think it's something fairly stable that we'd be
pretty happy to release if things go well, which is a higher bar to
clear.

We don't have a clear definition of what Beta means. For me, Beta has always meant "trial software, please test".

I don't think anybody will say anything bad about us if we release a beta and then later pull some of the features because we are not confident with them when AFTER testing the feature is shown to be below our normal standard; that will bring us credit, I feel. It is extremely common in software development to defer some of the features if their goals aren't met, or to change APIs and interfaces based upon user feedback.

Making decisions on what will definitely be in a release BEFORE testing and feedback seems foolhardy and certainly not scientific.

None of this means I disagree with assessments of the current state of the software, I'm saying that we should simply follow the normal process and stick to the schedule we have previously agreed, for all of the reasons cited when we agreed it.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [CORE] Restore-reliability mode

From
Robert Haas
Date:
On Fri, Jun 5, 2015 at 10:23 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Fri, Jun 5, 2015 at 2:50 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> Agreed. Cleanup can occur while we release code for public testing.
>
>> The code is available for public testing right now.
>
> Only to people who have the time and ability to pull the code from git
> and build from source.  I don't know exactly what fraction of interested
> testers that excludes, but I bet it's significant.  The point of producing
> packages would be to remove that barrier to testing.

Sure, I agree with that.

>> Stamping it a
>> beta implies that we think it's something fairly stable that we'd be
>> pretty happy to release if things go well, which is a higher bar to
>> clear.
>
> So let's call it an alpha, or some other way of setting expectations
> appropriately.  But I think it's silly to maintain that the code is not in
> a state where end-user testing is useful.  They just have to understand
> that they can't trust it with production data.

I don't maintain that end-user testing is unuseful at this point.  I
do maintain that it would be better to (1) finish fixing the known
multixact bugs and (2) clean up some of the open items before we make
a big push in that direction.  For example, consider this item from
the open items list:

http://www.postgresql.org/message-id/CAHGQGwEqWD=yNQE+ZojbpoxyWT3xLK52-V_q9S+XOfCKJd5egA@mail.gmail.com

Now this is a fundamental definitional issue about how RLS is supposed
to work.  I'm not going to deny that we COULD ship a release without
deciding what the behavior should be there, but I don't think it's a
good idea.  I am fine with the possibility that one of our new
features may, say, dump core someplace due to a NULL pointer deference
we haven't found yet.  Such bugs can always exist, but they are easily
fixed once found.  But if we're not clear on how a feature is supposed
to behave, which seems to be the case here, I favor trying to resolve
that issue before shipping anything.  Otherwise, we're saying "test
this, even though the final version will likely work differently".
That's not really helpful for us and will discourage testers from
doing anything at all.

Going through the open items, the other ones that seem to involve
definitional changes are:

1. FPW compression leaks information - The usefulness of the GUC may
depend on its PGC_*-ness.  We should decide what we want to do before
asking people to test it.

2. custom-join has no way to construct Plan nodes of child Path nodes
- The entire feature is a C API, and the API needs to be changed.  We
should finalize the API before asking people to test whether they can
use it for interesting things.

3. recovery_target_action = pause & hot_standby = off - Rumor has it
we replaced one surprising behavior with a different but
equally-surprising behavior.  We should decide what the right thing is
and make sure the code is doing that before calling it a release.

4. Arguable RLS security bug, EvalPlanQual() paranoia - This seems
like another question of what the expectations around RLS actually
are.

I would also argue that we really ought to make a decision about
"basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe"
before we get too close to final release.  Maybe it's not a
beta-blocker, exactly, but it doesn't seem like the sort of change
that should be rushed in too close to the end, because it looks sorta
complicated and scary.  (Those are the technical terms.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [CORE] Restore-reliability mode

From
Simon Riggs
Date:
On 5 June 2015 at 15:00, Robert Haas <robertmhaas@gmail.com> wrote:
 
I do agree that an indefinite development freeze with unclear
parameters for resuming development and unclear goals is a bad plan.
But I think giving ourselves a little more time to, say, turn the
buildfarm consistently green, and, say, fix the known but
currently-unfixed multixact bugs, and, say, fix the known bugs in 9.5
features is a good plan, and I hope you and others will support it.

Yes, its a good plan and I support that. That's just normal process. 

If you mean we should allow that to stall the release of Beta then I disagree. The presence of bugs clearly has nothing to do with the discovery of new ones and we should be looking to discover as many as possible as quickly as possible.

I can understand the argument to avoid releasing Beta because of Dev Meeting, so we should aim for June 25th Beta 1.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [CORE] Restore-reliability mode

From
Andres Freund
Date:
On 2015-06-05 11:05:14 -0400, Bruce Momjian wrote:
> To release 9.5 beta would be to get back into that cycle, and I am not
> sure we are ready for that.  I think the fact we have multiple people
> all reviewing the multi-xact code now (and not dealing with 9.5) is a
> good sign.  If we were focused on 9.5 beta, I doubt this would have
> happened.

At least form me that I'm working on multixacts right now has nothing to
do with to beta or not to beta.

And I don't understand why releasing an alpha or beta would detract from
that right now. We need more people doing crazy shit with our codebase,
not fewer.

None of the master-only issues is a blocker for an alpha, so besides
some release work within the next two weeks I don't see what'd detract
us that much?

> I am saying let's make sure we are not deficient in other areas, then
> let's move forward again.

I don't think we actually can do that. The problem of the multixact
stuff is precisely that it looked so innocent that a bunch of
experienced people just didn't see the problem. Omniscience is easy in
hindsight.

> I would love to think we can do multiple things at once, but for
> multi-xact, serious review didn't happen for 18 months, so if slowing
> release development is what is required, I support it.

FWIW, I can stomach a week or four of doing bugfix only stuff. After
that I'm simply not going to be efficient at that anymore. And I
seriously doubt that I'm the only one like that. Doing the same thing
for weeks makes you miss obvious stuff.


I don't think anything as localized as 'do nothing but bugfixes for a
while and then carry on' actually will solve the problem. We need to
find and reallocate resources to put more emphasis on review, robustness
and refactoring in the long term, not do panick-y stuff short term. This
isn't a problem that can be solved by focusing on bugfixing for a week
or four.

That means we have to convince employers to actually *pay* us (people
experienced with the codebase) to do work on these kind of things
instead of much-easier-to-market new features. A lot of
review/robustness work has been essentially done in our spare time,
after long days. Which means the employers need to get more people.

> Sure.  I think everyone agrees the multi-xact work is all good, so I am
> asking what else needs this kind of research.  If there is nothing else,
> we can move forward again --- I am just saying we need to ask the
> reliability question _first_.

I'm starting to get grumpy here. You've called for review in lots of
emails now. Let's get going then?

Greetings,

Andres Freund



Re: [CORE] Restore-reliability mode

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I don't maintain that end-user testing is unuseful at this point.  I
> do maintain that it would be better to (1) finish fixing the known
> multixact bugs and (2) clean up some of the open items before we make
> a big push in that direction.  For example, consider this item from
> the open items list:

> http://www.postgresql.org/message-id/CAHGQGwEqWD=yNQE+ZojbpoxyWT3xLK52-V_q9S+XOfCKJd5egA@mail.gmail.com

> Now this is a fundamental definitional issue about how RLS is supposed
> to work.  I'm not going to deny that we COULD ship a release without
> deciding what the behavior should be there, but I don't think it's a
> good idea.  I am fine with the possibility that one of our new
> features may, say, dump core someplace due to a NULL pointer deference
> we haven't found yet.  Such bugs can always exist, but they are easily
> fixed once found.  But if we're not clear on how a feature is supposed
> to behave, which seems to be the case here, I favor trying to resolve
> that issue before shipping anything.  Otherwise, we're saying "test
> this, even though the final version will likely work differently".
> That's not really helpful for us and will discourage testers from
> doing anything at all.

The other side of that coin is that we might get useful comments from
testers on how the feature ought to work.  I don't agree with the notion
that all feature details must be graven on stone tablets before we start
trying to get feedback from people outside the core development community.

The same point applies to the FDW C API questions, or to RLS, or to the
"expanded objects" work that I did.  (I'd really love it if the PostGIS
folk would try to use that sometime before it's too late to adjust the
definition...)  Now, you could argue that people likely to have useful
input on those issues are fully capable of working with git tip, and you'd
probably be right, but would they do so?  As Simon says nearby, publishing
an alpha/beta/whatever is our signal to the wider community that it's time
for them to start paying attention.  I do not think they will look at 9.5
until we do that; and I think it'll be our loss if they don't start
looking at these things soon.
        regards, tom lane



Re: [CORE] Restore-reliability mode

From
Alvaro Herrera
Date:
Michael Paquier wrote:
> On Fri, Jun 5, 2015 at 8:53 AM, Craig Ringer <craig@2ndquadrant.com> wrote:

> > In terms of specific testing improvements, things I think we need to have
> > covered and runnable on the buildfarm are:
> >
> > * pg_dump and pg_restore testing (because it's scary we don't do this)
> 
> We do test it in some way with pg_upgrade using set of objects that
> are not removed by the regression test suite. Extension dumps are
> uncovered yet though.

We could put more emphasis on having objects of all kinds remain in the
regression database, so that the pg_upgrade test covers more of this.

What happened with the extension tests patches you submitted?  They
seemed valuable to me, but I lost track.

> > * DDL deparse test coverage for all operations
> 
> What do you have in mind except what is already in objectaddress.sql
> and src/test/modules/test_dll_deparse/?

The current SQL scripts in that test do not cover all possible object
types, so there's a lot of the decoding capabilities that are currently
not exercised.  So one way to attack this would be to add more object
types to those files.  However, a completely different way is to have
the test process serial_schedule from src/test/regress and run
everything in there under deparse.  That would be even more useful,
because whenever some future DDL is added, we will automatically get
coverage.

> > How would a test that would've caught the multixact issues look?
> 
> I have not followed closely those discussions, not sure about that.

One issue with these bugs is that unless you use things such as
pg_burn_multixact, producing large enough numbers of multixacts takes a
long time.  I've wondered if we could somehow make those easier to
reproduce by lowering the range, and thus doing thousands of
wraparounds, freezing and truncations in reasonable time.  (For example,
change the typedefs to uint16 rather than uint32).  But then the issue
becomes that the test code is not exactly equivalent to the production
code, which could cause its own bugs.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [CORE] Restore-reliability mode

From
Andres Freund
Date:
On 2015-06-05 11:20:52 -0400, Robert Haas wrote:
> I don't maintain that end-user testing is unuseful at this point.

Unless I misunderstand you, and you're not saying that user level
testing wouldn't be helpful right now, I'm utterly baffled. There's
loads of user-exposed features that desperately need exposure.

Looking at https://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL I
don't see a single item that correlates with the ones on the open items
list list. Sure, it's incomplete. But that's a lot of stuff to test
already. And the authors of those features can work on fixing the issues
coming up.  Lots of those features have barely got any testing at this
point.

> do maintain that it would be better to (1) finish fixing the known
> multixact bugs and (2) clean up some of the open items before we make
> a big push in that direction.

There's maybe 3-4 people that can actually do something about the
existing issues on that list. The community is far bigger than
that. Right now everyone is sitting on the sidelines and twiddling their
thumbs or developing new stuff. At least that's my impression.

> 2. custom-join has no way to construct Plan nodes of child Path nodes
> - The entire feature is a C API, and the API needs to be changed.  We
> should finalize the API before asking people to test whether they can
> use it for interesting things.

I think any real world exposure of that API will result in much larger
changes than that.

> 3. recovery_target_action = pause & hot_standby = off - Rumor has it
> we replaced one surprising behavior with a different but
> equally-surprising behavior.  We should decide what the right thing is
> and make sure the code is doing that before calling it a release.

Fujii pushed the bugfix, restoring the old behaviour afaics. It's imo
still crazy, but at this point it doesn't look like a 9.5 discussion.

> 4. Arguable RLS security bug, EvalPlanQual() paranoia - This seems
> like another question of what the expectations around RLS actually
> are.

In the end that's minor from the end user's perspective.

> I would also argue that we really ought to make a decision about
> "basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe"
> before we get too close to final release.  Maybe it's not a
> beta-blocker, exactly, but it doesn't seem like the sort of change
> that should be rushed in too close to the end, because it looks sorta
> complicated and scary.  (Those are the technical terms.)

Yea, I'd really like to get that in at some point. I'll work on it as
soon I've finished the multixact truncation thingy.


Greetings,

Andres Freund



Re: [CORE] Restore-reliability mode

From
Bruce Momjian
Date:
On Fri, Jun  5, 2015 at 05:36:41PM +0200, Andres Freund wrote:
> I don't think anything as localized as 'do nothing but bugfixes for a
> while and then carry on' actually will solve the problem. We need to
> find and reallocate resources to put more emphasis on review, robustness
> and refactoring in the long term, not do panick-y stuff short term. This
> isn't a problem that can be solved by focusing on bugfixing for a week
> or four.

Fine.  We just need that refocus, and people usually can't refocus while
they are worried about other pressures, e.g. time --- its like trying to
adjust the GPS while driving --- not easy.

> That means we have to convince employers to actually *pay* us (people
> experienced with the codebase) to do work on these kind of things
> instead of much-easier-to-market new features. A lot of
> review/robustness work has been essentially done in our spare time,
> after long days. Which means the employers need to get more people.

Agreed --- that is a serious long-term need.

> > Sure.  I think everyone agrees the multi-xact work is all good, so I am
> > asking what else needs this kind of research.  If there is nothing else,
> > we can move forward again --- I am just saying we need to ask the
> > reliability question _first_.
> 
> I'm starting to get grumpy here. You've called for review in lots of
> emails now. Let's get going then?

I really don't know.  If people say we don't have anything like
multi-xact that we have avoided, then I have no further concerns.  I am
asking that such decisions be made independent of external time
pressures.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: [CORE] Restore-reliability mode

From
Simon Riggs
Date:
On 5 June 2015 at 16:05, Bruce Momjian <bruce@momjian.us> wrote:

Please address some of the specific issues I mentioned. 

I can discuss them but not because I am involved directly. I take responsibility as a committer and have an interest from that perspective.

In my role at 2ndQuadrant, I approved all of the time Alvaro and Andres spent on submitting, reviewing and fixing bugs - at this point that has cost something close to fifty thousand dollars just on this feature and subsequent actions. (I believe the feature was originally funded, but we never saw a penny of that, though others did.)
 
The problem
with the multi-xact case is that we just kept fixing bugs as people
found them, and did not do a holistic review of the code. 

I observed much discussion and review. The bugs we've had have all been fairly straightforwardly fixed. There haven't been any design-level oversights or head-palm moments. It's complex software that had complex behaviour that caused problems. The problem has been that anything on-disk causes more problems when errors occur. We should review carefully anything that alters the way on-disk structures work, like the WAL changes, UPSERTs new mechanism etc..

From my side, it is only recently I got some clear answers to my questions about how it worked. I think it is very important that major features have extensive README type documentation with them so the underlying principles used in the development are clear. I would define the measure of a good feature as whether another committer can read the code comments and get a good feel. A bad feature is one where committers walk away from it, saying I don't really get it and I can't read an explanation of why it does that. Tom's most significant contribution is his long descriptive comments on what the problem is that need to be solved, the options and the method chosen. Clarity of thought is what solves bugs.

Overall, I don't see the need to stop the normal release process and do a holistic review. But I do think we should check each feature to see whether it is fully documented or whether we are simply trusting one of us to be around to fix it.

I am just saying we need to ask the
reliability question _first_.

Agreed
 
Let me restate something that has appeared in many replies to my ideas
--- I am not asking for infinite or unbounded review, but I am asking
that we make sure reliability gets the proper focus in relation to our
time pressures.  Our balance was so off a month ago that I feel only a
full stop on time pressure would allow us to refocus because people are
not good at focusing on multiple things. It is sometimes necessary to
stop everything to get people's attention, and to help them remember
that without reliability, a database is useless.

Here, I think we are talking about different types of reliability. PostgreSQL software is well ahead of most industry measures of quality; these recent bugs have done nothing to damage that, other than a few people woke up and said "Wow! Postgres had a bug??!?!?". The presence of bugs is common and if we have grown unused to them, we should be wary of that, though not tolerant.

PostgreSQL is now reliable in the sense that we have many features that ensure availability even in the face of software problems and bug induced corruption. Those have helped us get out of the current situations, giving users a workaround while bugs are fixed. So the impact of database software bugs is not what it once was.

Reliable delivery of new versions of software is important too. New versions often contain new features that fix real world problems, just as much as bug fixes do, hence why I don't wish to divert from the normal process and schedule.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [CORE] Restore-reliability mode

From
Jim Nasby
Date:
On 6/4/15 11:28 PM, Michael Paquier wrote:
<list of things to test>
* More configuration variations; ./configure, initdb options, and *.conf
* More edge-case testing. (ie: what happens to each varlena as it 
approaches 1GB? 1B tables test. Etc.)
* More race-condition testing, like the tool Peter used heavily during 
ON CONFLICT development (written by Jeff Janes?)
* More non-SQL testing. For example, the logic in HeapTupleSatisfies* is 
quite complicated yet there's no tests dedicated to ensuring the logic 
is correct because it'd be extremely difficult (if not impossible) to 
construct those tests at a SQL level. Testing them with direct test 
calls to HeapTupleSatisfies* wouldn't be difficult, but we have no 
machinery to do C level testing.

>> Is pg_tap a reasonable starting point for this sort of testing?
> IMO, using the TAP machinery would be a good base for that. What lacks
> is a basic set of perl routines that one can easily use to set of test
> scenarios.

I think Stephen was referring specifically to pgTap (http://pgtap.org/).

Isn't our TAP framework just different output from pg_regress? Is there 
documentation on our TAP stuff?

>> >How would a test that would've caught the multixact issues look?
> I have not followed closely those discussions, not sure about that.

I've thought about this and unfortunately I think this may be a scenario 
that's just too complex to completely protect against with a test. What 
might help though is having better testing of edge cases (such as MXID 
wrap) and then combining that with other forms of testing, such as 
pg_upgrade and streaming rep. testing. Test things like "What happens if 
we pg_upgrade a cluster that's in danger of wraparound?"
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: [CORE] Restore-reliability mode

From
Jim Nasby
Date:
On 6/5/15 10:39 AM, Tom Lane wrote:
> The other side of that coin is that we might get useful comments from
> testers on how the feature ought to work.  I don't agree with the notion
> that all feature details must be graven on stone tablets before we start
> trying to get feedback from people outside the core development community.

+1

> The same point applies to the FDW C API questions, or to RLS, or to the
> "expanded objects" work that I did.  (I'd really love it if the PostGIS
> folk would try to use that sometime before it's too late to adjust the
> definition...)  Now, you could argue that people likely to have useful
> input on those issues are fully capable of working with git tip, and you'd
> probably be right, but would they do so?  As Simon says nearby, publishing
> an alpha/beta/whatever is our signal to the wider community that it's time
> for them to start paying attention.  I do not think they will look at 9.5
> until we do that; and I think it'll be our loss if they don't start
> looking at these things soon.

+1, but I also think we should have a better mechanism for soliciting 
user input on these things while design discussions are happening. ISTM 
that there's a lot of hand-waving that happens around use cases that 
could probably be clarified with end user input.

FWIW, I don't think the blocker here is git or building from source. If 
someone has that amount of time to invest it's not much different than 
grabbing a tarball.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: [CORE] Restore-reliability mode

From
Alvaro Herrera
Date:
Simon Riggs wrote:
> On 5 June 2015 at 15:00, Robert Haas <robertmhaas@gmail.com> wrote:

> > Stamping it a beta implies that we think it's something fairly
> > stable that we'd be pretty happy to release if things go well, which
> > is a higher bar to clear.
> 
> We don't have a clear definition of what Beta means. For me, Beta has
> always meant "trial software, please test".

I think that definition *is* the problem, actually.  To me, "beta" means
"trial software, please test, but final product will be very similar to
what you see here".  What we need to convey at this point is what you
said, but I think a better word for that is "alpha".  There may be more
mobility in there than in a beta, in users's perception, which is the
right impression we want to convey.

Another point is that historically, once we've released a beta, we're
pretty reluctant to bump catversion.  We're not ready for that at this
stage, which is one criteria that suggests to me that we're not ready
for beta.

So I think the right thing to do at this point is to get an alpha out,
shortly after releasing upcoming minors.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [CORE] Restore-reliability mode

From
Robert Haas
Date:
On Fri, Jun 5, 2015 at 11:18 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> We don't have a clear definition of what Beta means. For me, Beta has always
> meant "trial software, please test".
>
> I don't think anybody will say anything bad about us if we release a beta
> and then later pull some of the features because we are not confident with
> them when AFTER testing the feature is shown to be below our normal
> standard; that will bring us credit, I feel. It is extremely common in
> software development to defer some of the features if their goals aren't
> met, or to change APIs and interfaces based upon user feedback.

Yeah, but we usually haven't.  Tom, for example, has previously not
wanted to even bump catversion after beta1, which rules out a huge
variety of possible fixes and interface changes.  If we want to make a
policy decision to change our approach, we should be up-front about
that.

> None of this means I disagree with assessments of the current state of the
> software, I'm saying that we should simply follow the normal process and
> stick to the schedule we have previously agreed, for all of the reasons
> cited when we agreed it.

Well, to my way of looking at it, our feature freeze was later this
year than it has been in the past, so our beta will be later, too.  If
we want to stick with the schedule, we have to do that throughout.
Our typical schedule has been a two-month final CommitFest starting on
January 15th.  This year we had a three month final CommitFest
starting on February 15th.  So we finished the last CommitFest two
months later than has been typical.

Typically our beta has been in early May, 1-2 months after the end of
the last CommitFest.  If you add the same two months to that, you get
early July, which sounds reasonable, rather than early June, which
sounds rushed, especially since we have an urgent need to get minor
releases out the door to fix critical stability bugs right now, and
then we have PGCon, during which nobody's going to be looking at
anything.

It sounds to me like the original plan was to put out a beta in early
June, which would have been fine if we'd stuck to the traditional
2-month final CommitFest.  But we didn't.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [CORE] Restore-reliability mode

From
Josh Berkus
Date:
On 06/05/2015 07:23 AM, Tom Lane wrote:
> So let's call it an alpha, or some other way of setting expectations
> appropriately.  But I think it's silly to maintain that the code is not in
> a state where end-user testing is useful.  They just have to understand
> that they can't trust it with production data.

Yes ... that seems like a good compromise.

Frankly, I'm testing 9.5 already; having alpha packages would make that
testing easier for me, and maybe possible for others.

We'd need to take into account that our packagers are a bit overworked
this month due to update releases ...

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: [CORE] Restore-reliability mode

From
Peter Geoghegan
Date:
On Fri, Jun 5, 2015 at 8:51 AM, Andres Freund <andres@anarazel.de> wrote:
>> 4. Arguable RLS security bug, EvalPlanQual() paranoia - This seems
>> like another question of what the expectations around RLS actually
>> are.
>
> In the end that's minor from the end user's perspective.

I think that depends on what we ultimately decide to do about it,
which is something that I have yet to form an opinion on (although I
know we need to document the issue, at the very least). For example,
one idea that Stephen and I discussed privately was making security
barrier quals referencing other relations lock the referenced rows.
This was an informal throwing around of ideas, but it's possible that
something like that could end up happening.

-- 
Peter Geoghegan



Re: [CORE] Restore-reliability mode

From
Peter Geoghegan
Date:
On Fri, Jun 5, 2015 at 7:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I do agree that an indefinite development freeze with unclear
> parameters for resuming development and unclear goals is a bad plan.
> But I think giving ourselves a little more time to, say, turn the
> buildfarm consistently green, and, say, fix the known but
> currently-unfixed multixact bugs, and, say, fix the known bugs in 9.5
> features is a good plan, and I hope you and others will support it.

FWIW, I have 3 pending bug fixes for UPSERT. While those are pretty
benign issues, I'd be annoyed if they didn't get into the first 9.5
beta (or alpha, even).

-- 
Peter Geoghegan



Re: [CORE] Restore-reliability mode

From
Bruce Momjian
Date:
On Fri, Jun  5, 2015 at 04:54:56PM +0100, Simon Riggs wrote:
> On 5 June 2015 at 16:05, Bruce Momjian <bruce@momjian.us> wrote:
> 
> 
>     Please address some of the specific issues I mentioned. 
> 
> 
> I can discuss them but not because I am involved directly. I take
> responsibility as a committer and have an interest from that perspective.
> 
> In my role at 2ndQuadrant, I approved all of the time Alvaro and Andres spent
> on submitting, reviewing and fixing bugs - at this point that has cost
> something close to fifty thousand dollars just on this feature and subsequent
> actions. (I believe the feature was originally funded, but we never saw a penny
> of that, though others did.)

Yes, the burden has fallen heavily on Alvaro.  I personally am concerned
that many people were focusing on 9.5 rather than helping him.  I think
that was a mistake on our part and we need to take reliability problems
more seriously.

What has also concerned me is that there are so many 9.3/9.4 bugs in
this area that few of us can even understand what was fixed when, and we
are then having problems figuring out what bugs were present when
analyzing bug reports.  pg_upgrade has made this worse by allowing
multi-xact bugs to propagate across major versions, and pg_upgrade had
some multi-xact bugs of its own in early 9.3 releases. :-(

>     The problem
>     with the multi-xact case is that we just kept fixing bugs as people
>     found them, and did not do a holistic review of the code. 
> 
> 
> I observed much discussion and review. The bugs we've had have all been fairly
> straightforwardly fixed. There haven't been any design-level oversights or
> head-palm moments. It's complex software that had complex behaviour that caused
> problems. The problem has been that anything on-disk causes more problems when
> errors occur. We should review carefully anything that alters the way on-disk
> structures work, like the WAL changes, UPSERTs new mechanism etc..

Agreed.  However, I think a thorough review early on could have caught
many of these bugs before they were reported by users.  As proof, even
in the past few weeks, review is finding bugs before they are found by
users.

> From my side, it is only recently I got some clear answers to my questions
> about how it worked. I think it is very important that major features have
> extensive README type documentation with them so the underlying principles used
> in the development are clear. I would define the measure of a good feature as
> whether another committer can read the code comments and get a good feel. A bad
> feature is one where committers walk away from it, saying I don't really get it
> and I can't read an explanation of why it does that. Tom's most significant
> contribution is his long descriptive comments on what the problem is that need
> to be solved, the options and the method chosen. Clarity of thought is what
> solves bugs.

Yes, I think we should have done that early-on for multi-xact, and I am
hopeful we will learn to do that more often when complex features are
implemented, or when we identify areas that are more complex than we
thought.

> Overall, I don't see the need to stop the normal release process and do a
> holistic review. But I do think we should check each feature to see whether it
> is fully documented or whether we are simply trusting one of us to be around to
> fix it.

Agreed.  We just need to be honest that we are doing what we need for
reliability and not allow schedule and feature pressure to cause us to
skimp in this area.

>     I am just saying we need to ask the
>     reliability question _first_.
> 
> 
> Agreed
>  
> 
>     Let me restate something that has appeared in many replies to my ideas
>     --- I am not asking for infinite or unbounded review, but I am asking
>     that we make sure reliability gets the proper focus in relation to our
>     time pressures.  Our balance was so off a month ago that I feel only a
>     full stop on time pressure would allow us to refocus because people are
>     not good at focusing on multiple things. It is sometimes necessary to
>     stop everything to get people's attention, and to help them remember
>     that without reliability, a database is useless.
> 
> 
> Here, I think we are talking about different types of reliability. PostgreSQL
> software is well ahead of most industry measures of quality; these recent bugs
> have done nothing to damage that, other than a few people woke up and said
> "Wow! Postgres had a bug??!?!?". The presence of bugs is common and if we have
> grown unused to them, we should be wary of that, though not tolerant.

In going over the 9.5 commits, I was struck by a high volume of cleanups
and fixes, which is good.

> PostgreSQL is now reliable in the sense that we have many features that ensure
> availability even in the face of software problems and bug induced corruption.
> Those have helped us get out of the current situations, giving users a
> workaround while bugs are fixed. So the impact of database software bugs is not
> what it once was.

Uh, yes, we have avoided the worst of the impact from these bugs.  In my
understanding, each bug has X% chance of being serious, and you might go
for a long time before a serious bug is created, but the more bugs we
have, the more likely that one will serious.  The _volume_ of multi-xact
bugs should have triggered a review much sooner.

People think I want to stop feature development to review.  What I am
saying is that we need to stop development so we can be honest about
whether we need review, and where.  It is hard to be honest when time
and feature pressure are on you.  It shouldn't take long to make that
decision as a group.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: [CORE] Restore-reliability mode

From
Michael Paquier
Date:
On Sat, Jun 6, 2015 at 12:05 AM, Alvaro Herrera wrote:
> Michael Paquier wrote:
> What happened with the extension tests patches you submitted?  They
> seemed valuable to me, but I lost track.

Those ones are registered in the queue of 9.6:
https://commitfest.postgresql.org/5/187/
And this is the latest patch:
http://www.postgresql.org/message-id/CAB7nPqSQr1UjZ1h8=be1wBq3mMdmM38nrjBKvBJuM--tTTY=EA@mail.gmail.com
This patch extends prove_check by giving the possibility for a given
utility using t/ to add extra modules in t/extra that will be
installed and usable for its regression tests. This becomes more
interesting considering as well that pg_upgrade could be switched to
use the TAP infrastructure, where we could have modules dedicated to
only the tests of pg_upgrade (supporting TAP tests on Windows is a
necessary condition though before switching pg_upgrade).
-- 
Michael



Re: [CORE] Restore-reliability mode

From
Simon Riggs
Date:
On 5 June 2015 at 17:20, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Simon Riggs wrote:
> On 5 June 2015 at 15:00, Robert Haas <robertmhaas@gmail.com> wrote:

> > Stamping it a beta implies that we think it's something fairly
> > stable that we'd be pretty happy to release if things go well, which
> > is a higher bar to clear.
>
> We don't have a clear definition of what Beta means. For me, Beta has
> always meant "trial software, please test".

I think that definition *is* the problem, actually.  To me, "beta" means
"trial software, please test, but final product will be very similar to
what you see here".  What we need to convey at this point is what you
said, but I think a better word for that is "alpha".  There may be more
mobility in there than in a beta, in users's perception, which is the
right impression we want to convey.

Another point is that historically, once we've released a beta, we're
pretty reluctant to bump catversion.  We're not ready for that at this
stage, which is one criteria that suggests to me that we're not ready
for beta.

So I think the right thing to do at this point is to get an alpha out,
shortly after releasing upcoming minors.

OK, I can get behind that.

My only additional point is that it is a good idea to release an Alpha every time, not just this release.

And if its called Alpha, lets release it immediately. We can allow Alpha1, Alpha2 as needed, plus we allow catversion and file format changes between Alpha versions.

Proposed definitions

Alpha: This is trial software please actively test and report bugs. Your feedback is sought on usability and performance, which may result in changes to the features included here. Not all known issues have been resolved but work continues on resolving them. Multiple Alpha versions may be released before we move to Beta. We reserve the right to change internal API definitions, file formats and increment the catalog version between Alpha versions and Beta, so we do not guarantee and easy upgrade path from this version to later versions of this release.

Beta: This is trial software please actively test and report bugs and performance issues. Multiple Beta versions may be released before we move to Release Candidate. We will attempt to maintain APIs, file formats and catversions.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [CORE] Restore-reliability mode

From
Gavin Flower
Date:
On 06/06/15 21:07, Simon Riggs wrote:
> On 5 June 2015 at 17:20, Alvaro Herrera <alvherre@2ndquadrant.com 
> <mailto:alvherre@2ndquadrant.com>> wrote:
>
>     Simon Riggs wrote:
>     > On 5 June 2015 at 15:00, Robert Haas <robertmhaas@gmail.com
>     <mailto:robertmhaas@gmail.com>> wrote:
>
>     > > Stamping it a beta implies that we think it's something fairly
>     > > stable that we'd be pretty happy to release if things go well,
>     which
>     > > is a higher bar to clear.
>     >
>     > We don't have a clear definition of what Beta means. For me,
>     Beta has
>     > always meant "trial software, please test".
>
>     I think that definition *is* the problem, actually.  To me, "beta"
>     means
>     "trial software, please test, but final product will be very
>     similar to
>     what you see here".  What we need to convey at this point is what you
>     said, but I think a better word for that is "alpha". There may be more
>     mobility in there than in a beta, in users's perception, which is the
>     right impression we want to convey.
>
>     Another point is that historically, once we've released a beta, we're
>     pretty reluctant to bump catversion.  We're not ready for that at this
>     stage, which is one criteria that suggests to me that we're not ready
>     for beta.
>
>     So I think the right thing to do at this point is to get an alpha out,
>     shortly after releasing upcoming minors.
>
>
> OK, I can get behind that.
>
> My only additional point is that it is a good idea to release an Alpha 
> every time, not just this release.
>
> And if its called Alpha, lets release it immediately. We can allow 
> Alpha1, Alpha2 as needed, plus we allow catversion and file format 
> changes between Alpha versions.
>
> Proposed definitions
>
> Alpha: This is trial software please actively test and report bugs. 
> Your feedback is sought on usability and performance, which may result 
> in changes to the features included here. Not all known issues have 
> been resolved but work continues on resolving them. Multiple Alpha 
> versions may be released before we move to Beta. We reserve the right 
> to change internal API definitions, file formats and increment the 
> catalog version between Alpha versions and Beta, so we do not 
> guarantee and easy upgrade path from this version to later versions of 
> this release.
>
> Beta: This is trial software please actively test and report bugs and 
> performance issues. Multiple Beta versions may be released before we 
> move to Release Candidate. We will attempt to maintain APIs, file 
> formats and catversions.
>
> -- 
> Simon Riggs http://www.2ndQuadrant.com/ <http://www.2ndquadrant.com/>
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
As a 'user' I am very happy with the idea of having Alpha's, gives me a 
feeling that there will be less chance of problems being released in the 
final version.

Because not only does it give more chances to test, but might encourage 
more people to get involved in contributing, either ideas for minor 
tweaks or simple patches etc. (as being not quite finished, and an 
expectation that minor functional changes have a possibility of being 
accepted for the version, if there is sufficient merit).


Cheers,
Gavin



Re: [CORE] Restore-reliability mode

From
Magnus Hagander
Date:
On Sat, Jun 6, 2015 at 11:07 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 5 June 2015 at 17:20, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Simon Riggs wrote:
> On 5 June 2015 at 15:00, Robert Haas <robertmhaas@gmail.com> wrote:

> > Stamping it a beta implies that we think it's something fairly
> > stable that we'd be pretty happy to release if things go well, which
> > is a higher bar to clear.
>
> We don't have a clear definition of what Beta means. For me, Beta has
> always meant "trial software, please test".

I think that definition *is* the problem, actually.  To me, "beta" means
"trial software, please test, but final product will be very similar to
what you see here".  What we need to convey at this point is what you
said, but I think a better word for that is "alpha".  There may be more
mobility in there than in a beta, in users's perception, which is the
right impression we want to convey.

Another point is that historically, once we've released a beta, we're
pretty reluctant to bump catversion.  We're not ready for that at this
stage, which is one criteria that suggests to me that we're not ready
for beta.

So I think the right thing to do at this point is to get an alpha out,
shortly after releasing upcoming minors.

OK, I can get behind that.

My only additional point is that it is a good idea to release an Alpha every time, not just this release.

And if its called Alpha, lets release it immediately. We can allow Alpha1, Alpha2 as needed, plus we allow catversion and file format changes between Alpha versions.


If I'm not mistaken, we (Simon and me) actually discussed something else along this line a while ago that might be worth considering. That is, maybe we should consider time-based alpha releases. That is, we can just decide "we wrap an alpha every other Monday until we think we are good to go with beta". The reason for that is to get much quicker iteration on bugfixes, which would encourage people to use and test these versions. Report a bug and  if it was easy enough to fix, you have a wrapped release with the fix in 2 weeks top.

This would require that we can (at least mostly) automate the wrapping of an alpha release, but I'm pretty sure we can solve that problem. We can also, I think, get a way with doing the release notes for an alpha just as a wiki page and a lot less formal than others, meaning we don't need to hold up any process for that.

Package availability would depend on platform. For those platforms where package building is more or less entirely automatic already, this could probably also be easily automated. And for those that take a lot more work, such as the Windows installers, we could just go with wrapping every other or every third alpha. As this is not a production release, I don't see why we'd need to hold some back to cover for the rest.


 

Proposed definitions

Alpha: This is trial software please actively test and report bugs. Your feedback is sought on usability and performance, which may result in changes to the features included here. Not all known issues have been resolved but work continues on resolving them. Multiple Alpha versions may be released before we move to Beta. We reserve the right to change internal API definitions, file formats and increment the catalog version between Alpha versions and Beta, so we do not guarantee and easy upgrade path from this version to later versions of this release.

Beta: This is trial software please actively test and report bugs and performance issues. Multiple Beta versions may be released before we move to Release Candidate. We will attempt to maintain APIs, file formats and catversions.


These sound like good definitions. Might add to the beta one something like "whilst we will try to avoid it, pg_upgrade may be required between betas and from beta to rc versions". 

--

Re: [CORE] Restore-reliability mode

From
Devrim GÜNDÜZ
Date:
Hi,

On Sat, 2015-06-06 at 12:15 +0200, Magnus Hagander wrote:
> If I'm not mistaken, we (Simon and me) actually discussed something
> else along this line a while ago that might be worth considering. That
> is, maybe we should consider time-based alpha releases. That is, we
> can just decide "we wrap an alpha every other Monday until we think we
> are good to go with beta". The reason for that is to get much quicker
> iteration on bugfixes, which would encourage people to use and test
> these versions. Report a bug and  if it was easy enough to fix, you
> have a wrapped release with the fix in 2 weeks top.

+1. 

> Package availability would depend on platform. For those platforms
> where package building is more or less entirely automatic already,
> this could probably also be easily automated.

When we used to release more alphas years ago, I was releasing Alpha
RPMs for many platforms. I'll do it again if we keep doing it.

Regards,

-- 
Devrim GÜNDÜZ
Principal Systems Engineer @ EnterpriseDB: http://www.enterprisedb.com
PostgreSQL Danışmanı/Consultant, Red Hat Certified Engineer
Twitter: @DevrimGunduz , @DevrimGunduzTR





Re: [CORE] Restore-reliability mode

From
Geoff Winkless
Date:

To play devil's advocate for a moment, is there anyone who would genuinely be prepared to download and install an alpha release who would not already have downloaded one of the nightlies? I only ask because I assume that
​releasing ​
an alpha is not zero-developer-cost and I don't believe
​that​
 there's a large ​
number of people who 
would be happy to install something that's described as being buggy and subject to change but are put off by having to type "configure" and "make".
​ 

Further, it seems to me that the number of people who ​won't roll their own who are useful as bug-finders is even smaller.

I get the feeling that the argument appears to be "Bruce doesn't want to release a beta, Simon wants to release something. Let's release an alpha because it's sort-of half way in between" as a consensus compromise (I'm not deliberately picking on specific people, I'm aware you're not the only two involved and arguing for either side, but you do seem to be fairly polar opposite sides of the argument :) ); I don't really believe that releasing an alpha moves anything further forward from a testing point of view, and I'm fairly sure that it will have just as dele
terious effect on bugfixing as would a beta
​, with the added disadvantage of the extra developer cost.

​Geoff​



Re: [CORE] Restore-reliability mode

From
Sehrope Sarkuni
Date:
On Sat, Jun 6, 2015 at 6:47 AM, Geoff Winkless <pgsqladmin@geoff.dj> wrote:
> To play devil's advocate for a moment, is there anyone who would genuinely be prepared to download
> and install an alpha release who would not already have downloaded one of the nightlies? I only ask
> because I assume that  releasing an alpha is not zero-developer-cost and I don't believe  that
> there's a large number of people who would be happy to install something that's described as being
> buggy and subject to change but are put off by having to type "configure" and "make".

I fit into that category and I would guess there would be others as
well. Having system packages available via an "apt-get install ..."
lowers the bar significantly to try things out.

As an example, I installed the 9.4 beta as soon as it was available to
run a smoke test and try out some of the new jsonb features. I'll be
doing the same with a 9.5 alpha/beta (or whatever it's called), for
both similar testing and to try out UPSERT.

It's much easier to work into dev/test setups if there are system
packages as it's just a config change to an existing script. Building
from source would require a whole new workflow that I don't have time
to incorporate.

> Further, it seems to me that the number of people who won't roll their own who are useful as bug-finders is even
smaller.

That's probably true but they definitely won't find any bugs if they
don't test at all.

If it's possible to have automated packaging, even for just a subset
of platforms, I think that'd be useful.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/



Re: [CORE] Restore-reliability mode

From
Robert Haas
Date:
On Sat, Jun 6, 2015 at 6:47 AM, Geoff Winkless <pgsqladmin@geoff.dj> wrote:
> To play devil's advocate for a moment, is there anyone who would genuinely
> be prepared to download and install an alpha release who would not already
> have downloaded one of the nightlies? I only ask because I assume that
> releasing
> an alpha is not zero-developer-cost and I don't believe
> that
>  there's a large
> number of people who would be happy to install something that's described as
> being buggy and subject to change but are put off by having to type
> "configure" and "make".

This is pretty much why Peter Eisentraut gave up on doing alphas after
the 9.1 cycle.

Admittedly, what is being proposed here is somewhat different.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [CORE] Restore-reliability mode

From
Geoff Winkless
Date:
On 6 June 2015 at 13:41, Sehrope Sarkuni <sehrope@jackdb.com> wrote:
On Sat, Jun 6, 2015 at 6:47 AM, Geoff Winkless <pgsqladmin@geoff.dj> wrote:
> To play devil's advocate for a moment, is there anyone who would genuinely be prepared to download
> and install an alpha release who would not already have downloaded one of the nightlies? I only ask
> because I assume that  releasing an alpha is not zero-developer-cost and I don't believe  that
> there's a large number of people who would be happy to install something that's described as being
> buggy and subject to change but are put off by having to type "configure" and "make".

I fit into that category and I would guess there would be others as
well. Having system packages available via an "apt-get install ..."
lowers the bar significantly to try things out. 

​But it also lowers the bar to the extent that you get the people who won't read the todo list and end up complaining about the things that everyone already knows about​.
 
It's much easier to work into dev/test setups if there are system
packages as it's just a config change to an existing script. Building
from source would require a whole new workflow that I don't have time
to incorporate.

​Really? You genuinely don't have time to paste, say:

mkdir -p ~/src/pgdevel
cd ~/src/pgdevel
tar xjf postgresql-snapshot.tar.bz2
​mkdir bld
cd bld
../postgresql-9.5devel/configure $(pg_config --configure | sed -e 's/\(pg\|postgresql[-\/]\)\(doc-\)\?9\.[0-9]*\(dev\)\?/\1\29.5dev/g')
make wor
​ld​
​make check
make world-install
​​

​and yet you think you have enough time to provide more than a "looks like it's working" report to the developers?​

(NB the sed for the pg_config line will probably need work, it looks like it should work on the two types of system I have here but I have to admit I changed the config line manually when I built it)

 
> Further, it seems to me that the number of people who won't roll their own who are useful as bug-finders is even smaller.

That's probably true but they definitely won't find any bugs if they
don't test at all.

If it's possible to have automated packaging, even for just a subset
of platforms, I think that'd be useful.

Well yes, automated packaging of the nightly build, that doesn't involve the developers having to stop what they're doing to write official alpha release docs or any of the other stuff that goes along with doing a release, would be zero-impact on development (assuming the developers didn't have to build or maintain the auto-packager) and therefore any return (however small) would make it worthwhile.

Fancy building (and maintaining) the auto-packaging system, and managing a mailing list for its users?

Geoff

Re: [CORE] Restore-reliability mode

From
Sehrope Sarkuni
Date:
On Sat, Jun 6, 2015 at 10:35 AM, Geoff Winkless <pgsqladmin@geoff.dj> wrote:
> Really? You genuinely don't have time to paste, say:
>
> mkdir -p ~/src/pgdevel
> cd ~/src/pgdevel
> wget https://ftp.postgresql.org/pub/snapshot/dev/postgresql-snapshot.tar.bz2
> tar xjf postgresql-snapshot.tar.bz2
> mkdir bld
> cd bld
> ../postgresql-9.5devel/configure $(pg_config --configure | sed -e 's/\(pg\|postgresql[->
\/]\)\(doc-\)\?9\.[0-9]*\(dev\)\?/\1\29.5dev/g')
> make world
> make check
> make world-install
>
> and yet you think you have enough time to provide more than a "looks like it's working" report to the developers?

Adding steps to an existing process to fetch and build from source is
significantly more complicated then flipping a version number. And I'm
not trying to run PG's built in tests on my machine. I want to run the
tests for my applications, and ideally, my applications themselves.

If doing so leads me to find that something doesn't work then of
course I would research and report the cause. At that point it's
something that I know will directly effect me if it's not fixed!

> Well yes, automated packaging of the nightly build, that doesn't involve the developers having to stop what they're
doingto write official alpha release docs or any of the other stuff that goes along with doing a release, would be
zero-impacton development (assuming the developers didn't have to build or maintain the auto-packager) and therefore
anyreturn (however small) would make it worthwhile. 
> Fancy building (and maintaining) the auto-packaging system, and managing a mailing list for its users?

I don't have much experience in setting things like this up so I'm not
one to estimate the work load involved. If it existed though, I'd use
it.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/



Re: [CORE] Restore-reliability mode

From
Kevin Grittner
Date:
Robert Haas <robertmhaas@gmail.com> wrote:

> Tom, for example, has previously not wanted to even bump
> catversion after beta1, which rules out a huge variety of
> possible fixes and interface changes.  If we want to make a
> policy decision to change our approach, we should be up-front
> about that.

What?!?  There have been catversion bumps between the REL?_?_BETA1
tag and the REL?_?_0 tag for 8.2, 8.3, 9.0, 9.1, 9.3, and 9.4.
(That is, it has happend on 6 of the last 8 releases.)  I don't
think we're talking about any policy change here.  We try to avoid
a catversion bump after beta if we can; we're not that reluctant to
do so if needed.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [CORE] Restore-reliability mode

From
"Joshua D. Drake"
Date:
On 06/05/2015 08:07 PM, Bruce Momjian wrote:

>>  From my side, it is only recently I got some clear answers to my questions
>> about how it worked. I think it is very important that major features have
>> extensive README type documentation with them so the underlying principles used
>> in the development are clear. I would define the measure of a good feature as
>> whether another committer can read the code comments and get a good feel. A bad
>> feature is one where committers walk away from it, saying I don't really get it
>> and I can't read an explanation of why it does that. Tom's most significant
>> contribution is his long descriptive comments on what the problem is that need
>> to be solved, the options and the method chosen. Clarity of thought is what
>> solves bugs.
>
> Yes, I think we should have done that early-on for multi-xact, and I am
> hopeful we will learn to do that more often when complex features are
> implemented, or when we identify areas that are more complex than we
> thought.
>

I see this idea of the README as very useful. There are far more people 
like me in this community than Simon or Alvaro. I can test, I can break 
things, I can script up a harness but I need to be understand HOW and 
the README would help allow for that.

>
> People think I want to stop feature development to review.  What I am
> saying is that we need to stop development so we can be honest about
> whether we need review, and where.  It is hard to be honest when time
> and feature pressure are on you.  It shouldn't take long to make that
> decision as a group.
>

Right. This is all about taking a step back, a deep breath, an objective 
look and then digging in with a more productive and reliable manner.

Sincerely,

JD

-- 
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.



Re: [CORE] Restore-reliability mode

From
"Joshua D. Drake"
Date:
On 06/06/2015 07:33 AM, Robert Haas wrote:
>
> On Sat, Jun 6, 2015 at 6:47 AM, Geoff Winkless <pgsqladmin@geoff.dj> wrote:
>> To play devil's advocate for a moment, is there anyone who would genuinely
>> be prepared to download and install an alpha release who would not already
>> have downloaded one of the nightlies? I only ask because I assume that
>> releasing
>> an alpha is not zero-developer-cost and I don't believe
>> that
>>   there's a large
>> number of people who would be happy to install something that's described as
>> being buggy and subject to change but are put off by having to type
>> "configure" and "make".

Yes, me and everyone like me in feature set.

Compiling takes time, time that does not need to be spent. If I can push 
an alpha into a container and start testing, I will do so. If I have to:

git pull; configure --prefix; make -j8 install

Then I will likely move on to other things because my time (nor is any 
other's on this list) is not free.

If you add into this a test harness that I can execute from the alpha 
release (or another package) that allows me to instant report via 
buildfarm or just email a tarball to -hackers that is even better.

I know that I am not taking everything into account here but remember 
that most of our users are not -hackers. They are practitioners and a 
lot of them would love to help but just can't because a lot of the 
infrastructure has never been built and -hackers think like -hackers.


Sincerely,

JD


-- 
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.



Re: Restore-reliability mode

From
Noah Misch
Date:
On Fri, Jun 05, 2015 at 08:25:34AM +0100, Simon Riggs wrote:
> This whole idea of "feature development" vs reliability is bogus. It
> implies people that work on features don't care about reliability. Given
> the fact that many of the features are actually about increasing database
> reliability in the event of crashes and corruptions it just makes no sense.

I'm contrasting work that helps to keep our existing promises ("reliability")
with work that makes new promises ("features").  In software development, we
invariably hazard old promises to make new promises; our success hinges on
electing neither too little nor too much risk.  Two years ago, PostgreSQL's
track record had placed it in a good position to invest in new, high-risk,
high-reward promises.  We did that, and we emerged solvent yet carrying an
elevated debt service ratio.  It's time to reduce risk somewhat.

You write about a different sense of "reliability."  (Had I anticipated this
misunderstanding, I might have written "Restore-probity mode.")  None of this
was about classifying people, most of whom allocate substantial time to each
kind of work.

> How will we participate in cleanup efforts? How do we know when something
> has been "cleaned up", how will we measure our success or failure? I think
> we should be clear that wasting N months on cleanup can *fail* to achieve a
> useful objective. Without a clear plan it almost certainly will do so. The
> flip side is that wasting N months will cause great amusement and dancing
> amongst those people who wish to pull ahead of our open source project and
> we should take care not to hand them a victory from an overreaction.

I agree with all that.  We should likewise take care not to become insolvent
from an underreaction.

> So lets do our normal things, not do a "total stop" for an indefinite
> period. If someone has specific things that in their opinion need to be
> addressed, list them and we can talk about doing them, together.

I recommend these four exit criteria:

1. Non-author committer review of foreign keys locks/multixact durability.  Done when that committer certifies, as if
hewere committing the patch  himself today, that the code will not eat data.
 

2. Non-author committer review of row-level security.  Done when that  committer certifies that the code keeps its
promisesand that the  documentation bounds those promises accurately.
 

3. Second committer review of the src/backend/access changes for INSERT ... ON  CONFLICT DO NOTHING/UPDATE.  (Bugs
affectingfolks who don't use the new  syntax are most likely to fall in that portion.)  Unlike the previous two
criteria,a review without certification is sufficient.
 

4. Non-author committer certifying that the 9.5 WAL format changes will not  eat your data.  The patch lists Andres and
Alvaroas reviewers; if they  already reviewed it enough to make that certification, this one is easy.
 

That ties up four people.  For everyone else:

- Fix bugs those reviews find.  This will start slow but will grow to keep everyone busy.  Committers won't certify
code,and thus we can't declare victory, until these bugs are fixed.  The rest of this list, in contrast, calls out
topicsto sample from, not topics to exhaust.
 

- Turn current buildfarm members green.

- Write, review and commit more automated test machinery to PostgreSQL.  Test whatever excites you.  If you need ideas,
Craigposted some good ones upthread.  Here are a few more: - Add a debug mode that calls sched_yield() in
SpinLockRelease();see   6322.1406219591@sss.pgh.pa.us. - Improve TAP suite (src/test/perl/TestLib.pm) logging.
Currently,these   suites redirect much output to /dev/null.  Instead, log that output and   teach the buildfarm to
capturethe log. - Call VALGRIND_MAKE_MEM_NOACCESS() on a shared buffer when its local pin   count falls to zero.  Under
CLOBBER_FREED_MEMORY,wipe a shared buffer   when its global pin count falls to zero. - With assertions enabled, or
perhapsin a new debug mode, have   pg_do_encoding_conversion() and pg_server_to_any() check the data for a   no-op
conversioninstead of assuming the data is valid.
 

- Add buildfarm members.  This entails reporting any bugs that prevent an initial passing run.  Once you have a passing
run,schedule regular runs. Examples of useful additions: - "./configure ac_cv_func_getopt_long=no,
ac_cv_func_snprintf=no..." to   enable all the replacement code regardless of the current platform's need   for it.
Thishelps distinguish "Windows bug" from "replacement code bug." - --disable-integer-datetimes, --disable-float8-byval,
disable-float4-byval,  --disable-spinlocks, --disable-atomics, disable-thread-safety,   --disable-largefile, #define
RANDOMIZE_ALLOCATED_MEMORY- Any OS or CPU architecture other than x86 GNU/Linux, even ones already   represented.
 

- Write, review and commit fixes for the bugs that come to light by way of these new automated tests.

- Anything else targeted to make PostgreSQL keep the promises it has already made to our users.



Re: Restore-reliability mode

From
Michael Paquier
Date:
On Sun, Jun 7, 2015 at 4:58 AM, Noah Misch <noah@leadboat.com> wrote:
> - Write, review and commit more automated test machinery to PostgreSQL.  Test
>   whatever excites you.  If you need ideas, Craig posted some good ones
>   upthread.  Here are a few more:
>   - Improve TAP suite (src/test/perl/TestLib.pm) logging.  Currently, these
>     suites redirect much output to /dev/null.  Instead, log that output and
>     teach the buildfarm to capture the log.

We can capture the logs and redirect them by replacing
system_or_bail() with more calls to IPC::run. That would be a patch
simple enough. pg_rewind's tests should be switched to use that as
well.
-- 
Michael



Re: [CORE] Restore-reliability mode

From
Robert Haas
Date:
On Sat, Jun 6, 2015 at 12:33 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
> Robert Haas <robertmhaas@gmail.com> wrote:
>> Tom, for example, has previously not wanted to even bump
>> catversion after beta1, which rules out a huge variety of
>> possible fixes and interface changes.  If we want to make a
>> policy decision to change our approach, we should be up-front
>> about that.
>
> What?!?  There have been catversion bumps between the REL?_?_BETA1
> tag and the REL?_?_0 tag for 8.2, 8.3, 9.0, 9.1, 9.3, and 9.4.
> (That is, it has happend on 6 of the last 8 releases.)  I don't
> think we're talking about any policy change here.  We try to avoid
> a catversion bump after beta if we can; we're not that reluctant to
> do so if needed.

Perhaps we're honoring this more in the breech than in the observance,
but I'm not making up what Tom has said about this:

http://www.postgresql.org/message-id/27310.1251410965@sss.pgh.pa.us
http://www.postgresql.org/message-id/19174.1299782543@sss.pgh.pa.us
http://www.postgresql.org/message-id/3413.1301154369@sss.pgh.pa.us
http://www.postgresql.org/message-id/3261.1401915832@sss.pgh.pa.us

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [CORE] Restore-reliability mode

From
Peter Geoghegan
Date:
On Sat, Jun 6, 2015 at 7:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Perhaps we're honoring this more in the breech than in the observance,
> but I'm not making up what Tom has said about this:
>
> http://www.postgresql.org/message-id/27310.1251410965@sss.pgh.pa.us
> http://www.postgresql.org/message-id/19174.1299782543@sss.pgh.pa.us
> http://www.postgresql.org/message-id/3413.1301154369@sss.pgh.pa.us
> http://www.postgresql.org/message-id/3261.1401915832@sss.pgh.pa.us

Of course, not doing a catversion bump after beta1 doesn't necessarily
have much value in and of itself. *Promising* to not do a catversion
bump, and then usually keeping that promise definitely has a certain
value, but clearly we are incapable of that.

-- 
Peter Geoghegan



Re: [CORE] Restore-reliability mode

From
"Joshua D. Drake"
Date:
On 06/06/2015 07:14 PM, Peter Geoghegan wrote:
>
> On Sat, Jun 6, 2015 at 7:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Perhaps we're honoring this more in the breech than in the observance,
>> but I'm not making up what Tom has said about this:
>>
>> http://www.postgresql.org/message-id/27310.1251410965@sss.pgh.pa.us
>> http://www.postgresql.org/message-id/19174.1299782543@sss.pgh.pa.us
>> http://www.postgresql.org/message-id/3413.1301154369@sss.pgh.pa.us
>> http://www.postgresql.org/message-id/3261.1401915832@sss.pgh.pa.us
>
> Of course, not doing a catversion bump after beta1 doesn't necessarily
> have much value in and of itself. *Promising* to not do a catversion
> bump, and then usually keeping that promise definitely has a certain
> value, but clearly we are incapable of that.
>

It seems to me that a cat bump during Alpha or Beta should be absolutely 
fine and reservedly fine respectively. Where we should absolutely not 
cat bump unless there is absolutely no other choice is during and RC.

JD

-- 
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.



Re: [CORE] Restore-reliability mode

From
Kevin Grittner
Date:
Joshua D. Drake <jd@commandprompt.com> wrote:
> On 06/06/2015 07:14 PM, Peter Geoghegan wrote:
>> On Sat, Jun 6, 2015 at 7:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:

>>> Perhaps we're honoring this more in the breech than in the
>>> observance, but I'm not making up what Tom has said about this:
>>>
>>> http://www.postgresql.org/message-id/27310.1251410965@sss.pgh.pa.us

That's 9.0 release discussion:

| I think that the traditional criterion is that we don't release beta1
| as long as there are any known issues that might force an initdb.
| We were successful in avoiding a post-beta initdb this time, although
| IIRC the majority of release cycles have had one --- so maybe you
| could argue that that's not so important.  It would certainly be
| less important if we had working pg_migrator functionality to ease
| the pain of going from beta to final.

>>> http://www.postgresql.org/message-id/19174.1299782543@sss.pgh.pa.us

That's 9.1 release discussion:

| Historically we've declared it beta once we think we are done with
| initdb-forcing problems.

| In any case, the existence of pg_upgrade means that "might we need
| another initdb?" is not as strong a consideration as it once was, so
| I'm not sure if we should still use that as a criterion.  I don't know
| quite what "ready for beta" should mean otherwise, though.

>>> http://www.postgresql.org/message-id/3413.1301154369@sss.pgh.pa.us

Also 9.1, it is listed as one criterion:

| * No open issues that are expected to result in a catversion bump.
| (With pg_upgrade, this is not as critical as it used to be, but
| I still think catalog stability is a good indicator of a release's
| maturity)

>>> http://www.postgresql.org/message-id/3261.1401915832@sss.pgh.pa.us

Here we jump to 9.4 discussion:

| > Agreed. Additionally I also agree with Stefan that the price of a initdb
| > during beta isn't that high these days.
|
| Yeah, if nothing else it gives testers another opportunity to exercise
| pg_upgrade ;-).  The policy about post-beta1 initdb is "avoid if
| practical", not "avoid at all costs".

So I think these examples show that the policy has shifted from a
pretty strong requirement to "it's probably nice if" status, with
some benefits seen in pg_upgrade testing to actually having a bump.

>> Of course, not doing a catversion bump after beta1 doesn't necessarily
>> have much value in and of itself. *Promising* to not do a catversion
>> bump, and then usually keeping that promise definitely has a certain
>> value, but clearly we are incapable of that.

As someone who was able to bring up a new production application on
8.2 because it was all redundant data and not yet mission-critical,
I appreciate that in very rate circumstances that combination could
have benefit.  But really, how often are people in that position?

> It seems to me that a cat bump during Alpha or Beta should be absolutely
> fine and reservedly fine respectively. Where we should absolutely not
> cat bump unless there is absolutely no other choice is during and RC.

+1 on all of that.  And for a while now we've been talking about an
alpha test release, right?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [CORE] Restore-reliability mode

From
Jeff Janes
Date:
On Sat, Jun 6, 2015 at 7:35 AM, Geoff Winkless <pgsqladmin@geoff.dj> wrote:
On 6 June 2015 at 13:41, Sehrope Sarkuni <sehrope@jackdb.com> wrote:
 
It's much easier to work into dev/test setups if there are system
packages as it's just a config change to an existing script. Building
from source would require a whole new workflow that I don't have time
to incorporate.

​Really? You genuinely don't have time to paste, say:

mkdir -p ~/src/pgdevel
cd ~/src/pgdevel
tar xjf postgresql-snapshot.tar.bz2
​mkdir bld
cd bld
../postgresql-9.5devel/configure $(pg_config --configure | sed -e 's/\(pg\|postgresql[-\/]\)\(doc-\)\?9\.[0-9]*\(dev\)\?/\1\29.5dev/g')
make wor
​ld​
​make check
make world-install

I think this is rather uncharitable.  You don't include yum, zypper, or apt-get anywhere in there, and I vaguely recall it took me quite a bit of time to install all the prereqs in order to get it to compile several years ago when I first started trying to compile it.  And then even more time get it to compile with several of the config flags I wanted.  And then even more time to get the docs to compile.

And now after I got all of that, when I try your code, it still doesn't work.  $(pg_config ....) seems to not quote things the way that configure wants them quoted, or something.  And the package from which I was using pg_config uses more config options than I was set up for when compiling myself, so I either have to manually remove the flags, or find more dependencies (pam, xslt, ossp-uuid, tcl, tcl-dev, and counting).  This is not very fun, and I didn't even need to get bureaucratic approval to do any of this stuff, like a lot of people do.

And then when I try to install it, it tries to overwrite some of the files which were initially installed by the package manager in /usr/lib.  That doesn't seem good.  

And how do I, as a hypothetical package manager user, start this puppy up?  Where is pg_ctlcluster?  How does one do pg_upgrade between a package-controlled data directory and this new binary?

And then when I find a bug, how do I know it is a bug and not me doing something wrong in the build process, and getting the wrong .so to load with the wrong binary or something like that?


​and yet you think you have enough time to provide more than a "looks like it's working" report to the developers?​

If it isn't working, reports of it isn't working with error messages are pretty helpful and don't take a whole lot of time.  If it is working, reports of that probably aren't terribly helpful without putting a lot more work into it.  But people might be willing to put a lot of work into, say, performance regression testing it that is their area of expertise, if they don' t also have to put a lot of work into getting the software to compile in the first place, which is not their area.

Now I don't see a lot of evidence of beta testing from the public (i.e. unfamiliar names) on -hackers and -bugs lists.  But a lot of hackers report things that "a client" or "someone on IRC" reported to them, so I'm willing to believe that a lot of useful beta testing does go on, even though I don't directly see it, and if there were alpha releases, why wouldn't it extend to that?
 

(NB the sed for the pg_config line will probably need work, it looks like it should work on the two types of system I have here but I have to admit I changed the config line manually when I built it)

Right, and are the people who use apt-get to install everything likely to know how to do that work?


Cheers,

Jeff

Re: [CORE] Restore-reliability mode

From
Alvaro Herrera
Date:
Joshua D. Drake wrote:
> 
> On 06/05/2015 08:07 PM, Bruce Momjian wrote:
> 
> >> From my side, it is only recently I got some clear answers to my questions
> >>about how it worked. I think it is very important that major features have
> >>extensive README type documentation with them so the underlying principles used
> >>in the development are clear. I would define the measure of a good feature as
> >>whether another committer can read the code comments and get a good feel. A bad
> >>feature is one where committers walk away from it, saying I don't really get it
> >>and I can't read an explanation of why it does that. Tom's most significant
> >>contribution is his long descriptive comments on what the problem is that need
> >>to be solved, the options and the method chosen. Clarity of thought is what
> >>solves bugs.
> >
> >Yes, I think we should have done that early-on for multi-xact, and I am
> >hopeful we will learn to do that more often when complex features are
> >implemented, or when we identify areas that are more complex than we
> >thought.
> 
> I see this idea of the README as very useful. There are far more people like
> me in this community than Simon or Alvaro. I can test, I can break things, I
> can script up a harness but I need to be understand HOW and the README would
> help allow for that.

There is a src/backend/access/README.tuplock that attempts to describe
multixacts.  Is that not sufficient?

Now that I think about it, this file hasn't been updated with the latest
changes, so it's probably a bit outdated now.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Restore-reliability mode

From
Peter Geoghegan
Date:
On Sat, Jun 6, 2015 at 12:58 PM, Noah Misch <noah@leadboat.com> wrote:
>   - Call VALGRIND_MAKE_MEM_NOACCESS() on a shared buffer when its local pin
>     count falls to zero.  Under CLOBBER_FREED_MEMORY, wipe a shared buffer
>     when its global pin count falls to zero.

Did a patch for this ever materialize?


-- 
Peter Geoghegan



Re: [CORE] Restore-reliability mode

From
David Gould
Date:
I think Alphas are valuable and useful and even more so if they have release
notes. For example, some of my clients are capable of fetching sources and
building from scratch and filing bug reports and are often interested in
particular new features. They even have staging infrastructure that could
test new postgres releases with real applications. But they don't do it.
They also don't follow -hackers, they don't track git, and they don't have
any easy way to tell if if the new feature they are interested in is
actually complete and ready to test at any particular time. A lot of
features are developed in multiple commits over a period of time and they
see no point in testing until at least most of the feature is complete and
expected to work. But it is not obvious from outside when that happens for
any given feature. For my clients the value of Alpha releases would
mainly be the release notes, or some other mark in the sand that says "As of
Alpha-3 feature X is included and expected to mostly work."

-dg

-- 
David Gould                                   daveg@sonic.net
If simplicity worked, the world would be overrun with insects.



Re: [CORE] Restore-reliability mode

From
Geoff Winkless
Date:
<div dir="ltr"><div class="gmail_default" style="font-family:verdana,sans-serif;font-size:small"><span
style="font-family:arial,sans-serif">Amongseveral others, On 8 June 2015 at 13:59, David Gould </span><span dir="ltr"
style="font-family:arial,sans-serif"><<ahref="mailto:daveg@sonic.net"
target="_blank">daveg@sonic.net</a>></span><spanstyle="font-family:arial,sans-serif"> wrote:</span></div><div
class="gmail_extra"><divclass="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px
#cccsolid;padding-left:1ex"> I think Alphas are valuable and useful and even more so if they have release<br /> notes.
Forexample, some of my clients are capable of fetching sources and<br /> building from scratch and filing bug reports
andare often interested in<br /> particular new features. They even have staging infrastructure that could<br /> test
newpostgres releases with real applications. But they don't do it.<br /> They also don't follow -hackers, they don't
trackgit, and they don't have<br /> any easy way to tell if if the new feature they are interested in is<br /> actually
completeand ready to test at any particular time. A lot of<br /> features are developed in multiple commits over a
periodof time and they<br /> see no point in testing until at least most of the feature is complete and<br /> expected
towork. But it is not obvious from outside when that happens for<br /> any given feature. For my clients the value of
Alphareleases would<br /> mainly be the release notes, or some other mark in the sand that says "As of<br /> Alpha-3
featureX is included and expected to mostly work."<br /></blockquote></div></div><div class="gmail_extra"><br
/></div><divclass="gmail_extra"><div class="gmail_default" style="font-family:verdana,sans-serif;font-size:small">Wow!
Inever knew there were all these people out there who would be rushing to help test if only the PG developers released
alphaversions. It's funny how they never used to do it when those alphas were done.</div><div class="gmail_default"
style="font-family:verdana,sans-serif;font-size:small"><br/></div><div class="gmail_default"
style="font-family:verdana,sans-serif;font-size:small">Isay again: in my experience you don't get useful test reports
frompeople who aren't able or prepared to compile software; what you do get is lots of unrelated and/or unhelpful noise
inthe mailing list. That may be harsh or unfair or whatever, it's just my experience.</div><div class="gmail_default"
style="font-family:verdana,sans-serif;font-size:small"><br/></div><div class="gmail_default"
style="font-family:verdana,sans-serif;font-size:small">Iguess the only thing we can do is see who's right. I'm simply
tryingto point out that it's not the zero-cost exercise that everyone appears to think that it is.</div><div
class="gmail_default"style="font-family:verdana,sans-serif;font-size:small"><br /></div><div class="gmail_default"
style="font-family:verdana,sans-serif;font-size:small">Geoff</div></div></div>

Re: [CORE] Restore-reliability mode

From
Robert Haas
Date:
On Mon, Jun 8, 2015 at 9:21 AM, Geoff Winkless <pgsqladmin@geoff.dj> wrote:
> Wow! I never knew there were all these people out there who would be rushing
> to help test if only the PG developers released alpha versions. It's funny
> how they never used to do it when those alphas were done.

That's probably overplaying your hand a little bit (and it sounds a
bit catty, too).  Some testing got done and it had some value.  It
just wasn't enough to make Peter feel like it was worthwhile.  That
doesn't mean that no testing got done and that it had no value, or
that the same thing would happen this time.  I'm as skeptical about
this whole rush-out-an-alpha business as anyone, but I think that
skepticism has to yield to contrary evidence, and people saying "I
would test if..." is legitimate contrary evidence.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [CORE] Restore-reliability mode

From
"Joshua D. Drake"
Date:
On 06/08/2015 06:21 AM, Geoff Winkless wrote:

>
> Wow! I never knew there were all these people out there who would be
> rushing to help test if only the PG developers released alpha versions.
> It's funny how they never used to do it when those alphas were done.

The type of responses you are providing on this thread are not warranted.

JD

-- 
The most kicking donkey PostgreSQL Infrastructure company in existence.
The oldest, the most experienced, the consulting company to the stars.
Command Prompt, Inc. http://www.commandprompt.com/ +1 -503-667-4564 -
24x7 - 365 - Proactive and Managed Professional Services!



Re: [CORE] Restore-reliability mode

From
Petr Jelinek
Date:
On Mon, Jun 8, 2015 at 5:01 , Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jun 8, 2015 at 9:21 AM, Geoff Winkless <pgsqladmin@geoff.dj> 
> wrote:
>>  Wow! I never knew there were all these people out there who would 
>> be rushing
>>  to help test if only the PG developers released alpha versions. 
>> It's funny
>>  how they never used to do it when those alphas were done.
> 
> That's probably overplaying your hand a little bit (and it sounds a
> bit catty, too).  Some testing got done and it had some value.  It
> just wasn't enough to make Peter feel like it was worthwhile.  That
> doesn't mean that no testing got done and that it had no value, or
> that the same thing would happen this time.  I'm as skeptical about
> this whole rush-out-an-alpha business as anyone, but I think that
> skepticism has to yield to contrary evidence, and people saying "I
> would test if..." is legitimate contrary evidence.


Agreed.

To get back to the point, I think the problem with original alphas was 
that they were after CF snapshots, not something that represented the 
final release.

I do think that proper alpha/beta release is signal for several 
companies (I do know some that do testing once beta gets out) to do 
testing as it does indeed say that we are releasing something that is 
close in functionality to the final release.

Also the packages are really important, there are enough companies that 
don't install development packages to servers at all so it's not just 
compile and run for them, they have to move it over to other machines, 
etc. We should be lowering the barrier to user based testing as much as 
possible and doing alpha with packages is exactly how we do that.

IMHO the only real discussion here is if current 9.5 is ready for user 
testing and FWIW I thin it is.


-- Petr Jelinek                  http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services




Re: [CORE] Restore-reliability mode

From
Geoff Winkless
Date:
On 8 June 2015 at 16:01, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Jun 8, 2015 at 9:21 AM, Geoff Winkless <pgsqladmin@geoff.dj> wrote:
> Wow! I never knew there were all these people out there who would be rushing
> to help test if only the PG developers released alpha versions. It's funny
> how they never used to do it when those alphas were done.

That's probably overplaying your hand a little bit (and it sounds a
bit catty, too). 

​I agree. The responses I had written yesterday but didn't send were much worse.

Mainly because I think it's quite an attitude to take that open-source developers should put extra time into building RPMs of development versions rather than testers waiting 5 minutes while their machines compile. Ohmygosh, you have to rpm install a bunch of -devel stuff? What a massive hardship.

On 8 June 2015 at 16:06, Joshua D. Drake <jd@commandprompt.com> wrote:
​​
The type of responses you are providing on this thread are not warranted.

I got people appearing completely insulted at my remarks and telling me that if only they could run the alpha they would provide testing, so I pointed out how easy it is to install the nightly from source and then they tell me that actually compiling is far too difficult and complicated, and that there are loads of clients who would run these nightlies if they had RPMS...

If I truly believed that such an RPM would produce useful testing, I would spend some of my own time building a setup to produce those RPMs myself and post here publicising them, at which point we would have a huge number of useful and productive test reports. Any one of the people telling me that I'm wrong could easily do the same, but so far none has.

I'm not harping on because I want to make people feel bad, I'm harping on because I don't want to see beta (and final) releases pushed back further because of a bad compromise, and I believe that that will happen. I apologise that I've clearly upset some people but they all have a very easy route to prove me wrong, and I'll be happy to admit my error.
​Geoff​

Re: [CORE] Restore-reliability mode

From
Claudio Freire
Date:
On Mon, Jun 8, 2015 at 12:22 PM, Geoff Winkless <pgsqladmin@geoff.dj> wrote:
> On 8 June 2015 at 16:01, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Mon, Jun 8, 2015 at 9:21 AM, Geoff Winkless <pgsqladmin@geoff.dj>
>> wrote:
>> > Wow! I never knew there were all these people out there who would be
>> > rushing
>> > to help test if only the PG developers released alpha versions. It's
>> > funny
>> > how they never used to do it when those alphas were done.
>>
>> That's probably overplaying your hand a little bit (and it sounds a
>> bit catty, too).
>
>
> I agree. The responses I had written yesterday but didn't send were much
> worse.
>
> Mainly because I think it's quite an attitude to take that open-source
> developers should put extra time into building RPMs of development versions
> rather than testers waiting 5 minutes while their machines compile.
> Ohmygosh, you have to rpm install a bunch of -devel stuff? What a massive
> hardship.

It's not about the 5 minutes of compile time, it's about the signalling.

Just *when* is git ready for testing? You don't know from the outside.

I do lurk here a lot and still am unsure quite often.

Even simply releasing an alpha *tarball* would be useful enough. What
is needed is the signal to test, rather than a fully-built package.



Re: [CORE] Restore-reliability mode

From
Geoff Winkless
Date:
<div dir="ltr"><div class="gmail_default" style="font-family:verdana,sans-serif;font-size:small"><span
style="font-family:arial,sans-serif">On8 June 2015 at 17:03, Claudio Freire </span><span dir="ltr"
style="font-family:arial,sans-serif"><<ahref="mailto:klaussfreire@gmail.com"
target="_blank">klaussfreire@gmail.com</a>></span><spanstyle="font-family:arial,sans-serif"> wrote:</span><br
/></div><divclass="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px#ccc solid;padding-left:1ex">It's not about the 5 minutes of compile time, it's about the
signalling.<br/><br /> Just *when* is git ready for testing? You don't know from the outside.<br /><br /> I do lurk
herea lot and still am unsure quite often.<br /><br /> Even simply releasing an alpha *tarball* would be useful enough.
What<br/> is needed is the signal to test, rather than a fully-built package.<br /></blockquote></div><br /></div><div
class="gmail_extra"><divclass="gmail_default" style="font-family:verdana,sans-serif;font-size:small">​I can see that,
andcan absolutely get behind the idea of a nightly being flagged as an alpha, since it should involve next to no
developertime.</div><div class="gmail_default" style="font-family:verdana,sans-serif;font-size:small"><br /></div><div
class="gmail_default"style="font-family:verdana,sans-serif;font-size:small">I may be overestimating the amount of time
thatgoes towards producing a release; the fact that the full-on alpha releases were stopped did imply to me that it's
notinsignificant.</div><div class="gmail_default" style="font-family:verdana,sans-serif;font-size:small"><br
/></div><divclass="gmail_default" style="font-family:verdana,sans-serif;font-size:small">Geoff​</div></div></div> 

Re: [CORE] Restore-reliability mode

From
"David G. Johnston"
Date:
On Mon, Jun 8, 2015 at 12:03 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
Just *when* is git ready for testing? You don't know from the outside.

I do lurk here a lot and still am unsure quite often.

Even simply releasing an alpha *tarball* would be useful enough. What
is needed is the signal to test, rather than a fully-built package.


​IIUC the master branch is always ready for testing.

​I do not think the project cares whether everyone is testing the exact same codebase; as long as test findings include the relevant commit hash the results will be informative.

David J.​

Re: [CORE] Restore-reliability mode

From
"David G. Johnston"
Date:
On Mon, Jun 8, 2015 at 12:14 PM, Geoff Winkless <pgsqladmin@geoff.dj> wrote:
On 8 June 2015 at 17:03, Claudio Freire <klaussfreire@gmail.com> wrote:
It's not about the 5 minutes of compile time, it's about the signalling.

Just *when* is git ready for testing? You don't know from the outside.

I do lurk here a lot and still am unsure quite often.

Even simply releasing an alpha *tarball* would be useful enough. What
is needed is the signal to test, rather than a fully-built package.

​I can see that, and can absolutely get behind the idea of a nightly being flagged as an alpha, since it should involve next to no developer time.


​Nightly where?  This is an international community.

The tip of the master branch is the current "alpha" - so the question is whether a tar bundle should be provided instead of asking people to simply keep their Git clone up-to-date.  These both have the flaw of excluding people who would test the application if it could simply be installed like any other package on their system.  But I'm not seeing where there would be a huge group of people who would test an automatically generated source tar-ball but would not be willing to use Git.  Or are we talking about a non-source tar-ball?

Maybe packagers could be convinced to bundle up the master branch on a monthly basis and simply call it Master-SNAPSHOT.  No alpha, no beta, no version number.  I've never packaged before so I don't know but while the project should encourage this as things currently standard the core project is doing its job by ensuring that the tip of master is always in a usable state.

Or, whenever a new patch release goes out packagers can also bundle up the current master at the same time.

David J.


Re: [CORE] Restore-reliability mode

From
Andres Freund
Date:
On 2015-06-08 12:16:34 -0400, David G. Johnston wrote:
> ​IIUC the master branch is always ready for testing.

I don't really think so. For one we often find bugs ourselves quite
quickly.

But more importantly, I'd much rather have users use their precious (and
thus limited!) time to test when the set of features (not every detail
of a feature) is mostly set in stone. There's not much point in doing
in-depth testing before that. Similarly it's not particularly worthwhile
to test while the buildfarm still shows failures on common platforms.

Andres



Re: [CORE] Restore-reliability mode

From
Stephen Frost
Date:
David,

* David G. Johnston (david.g.johnston@gmail.com) wrote:
> On Mon, Jun 8, 2015 at 12:03 PM, Claudio Freire <klaussfreire@gmail.com>
> wrote:
> > Just *when* is git ready for testing? You don't know from the outside.
> >
> > I do lurk here a lot and still am unsure quite often.
> >
> > Even simply releasing an alpha *tarball* would be useful enough. What
> > is needed is the signal to test, rather than a fully-built package.
> >
> >
> IIUC the master branch is always ready for testing.
>
> I do not think the project cares whether everyone is testing the exact
> same codebase; as long as test findings include the relevant commit hash
> the results will be informative.

For my 2c, I do believe it's useful for projects which are based on PG
or which work with PG to have a 'alpha1' tag to refer to.  Asking users
to test with git hash XYZABC isn't great.  Getting more users of
applications which use PG to do testing is, in my view at least, a great
way to improve our test coverage and I do think having an alpha will
help with that.

That said, I'm not pushing to have one released this week or before
PGCon or any such- let's get the back-branch releases dealt with and
then we can get an alpha out.
Thanks!
    Stephen

Re: [CORE] Restore-reliability mode

From
Alvaro Herrera
Date:
David G. Johnston wrote:
> On Mon, Jun 8, 2015 at 12:14 PM, Geoff Winkless <pgsqladmin@geoff.dj> wrote:

> > ​I can see that, and can absolutely get behind the idea of a nightly being
> > flagged as an alpha, since it should involve next to no developer time.
> >
> ​Nightly where?  This is an international community.

A "nightly" refers to our development snapshots, which are uploaded to
the ftp servers every "night" (according to some timezone).  You can
find them in pub/snapshot/ for each branch.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [CORE] Restore-reliability mode

From
Bruce Momjian
Date:
On Mon, Jun  8, 2015 at 12:32:45PM -0400, David G. Johnston wrote:
> On Mon, Jun 8, 2015 at 12:14 PM, Geoff Winkless <pgsqladmin@geoff.dj> wrote:
> 
>     On 8 June 2015 at 17:03, Claudio Freire <klaussfreire@gmail.com> wrote:
> 
>         It's not about the 5 minutes of compile time, it's about the
>         signalling.
> 
>         Just *when* is git ready for testing? You don't know from the outside.
> 
>         I do lurk here a lot and still am unsure quite often.
> 
>         Even simply releasing an alpha *tarball* would be useful enough. What
>         is needed is the signal to test, rather than a fully-built package.
> 
> 
>     ​I can see that, and can absolutely get behind the idea of a nightly being
>     flagged as an alpha, since it should involve next to no developer time.
> 
> 
> 
> ​Nightly where?  This is an international community.

The daily snapshot tarballs are built in a way to minimize the number of
development tools required:
http://www.postgresql.org/ftp/snapshot/dev/

These would be easier to use than pulling from git.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: [CORE] Restore-reliability mode

From
Magnus Hagander
Date:
On Mon, Jun 8, 2015 at 7:01 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
David G. Johnston wrote:
> On Mon, Jun 8, 2015 at 12:14 PM, Geoff Winkless <pgsqladmin@geoff.dj> wrote:

> > ​I can see that, and can absolutely get behind the idea of a nightly being
> > flagged as an alpha, since it should involve next to no developer time.
> >
> ​Nightly where?  This is an international community.

A "nightly" refers to our development snapshots, which are uploaded to
the ftp servers every "night" (according to some timezone).  You can
find them in pub/snapshot/ for each branch.

Snapshots are actually not nightly anymore, and haven't been for some time. They are currently run every 4 hours, and are uploaded to the ftp server once a buildfarm run (on debian x64) finishes. 

--

Re: Restore-reliability mode

From
Bruce Momjian
Date:
On Sat, Jun  6, 2015 at 03:58:05PM -0400, Noah Misch wrote:
> On Fri, Jun 05, 2015 at 08:25:34AM +0100, Simon Riggs wrote:
> > This whole idea of "feature development" vs reliability is bogus. It
> > implies people that work on features don't care about reliability. Given
> > the fact that many of the features are actually about increasing database
> > reliability in the event of crashes and corruptions it just makes no sense.
> 
> I'm contrasting work that helps to keep our existing promises ("reliability")
> with work that makes new promises ("features").  In software development, we
> invariably hazard old promises to make new promises; our success hinges on
> electing neither too little nor too much risk.  Two years ago, PostgreSQL's
> track record had placed it in a good position to invest in new, high-risk,
> high-reward promises.  We did that, and we emerged solvent yet carrying an
> elevated debt service ratio.  It's time to reduce risk somewhat.
> 
> You write about a different sense of "reliability."  (Had I anticipated this
> misunderstanding, I might have written "Restore-probity mode.")  None of this
> was about classifying people, most of whom allocate substantial time to each
> kind of work.
> 
> > How will we participate in cleanup efforts? How do we know when something
> > has been "cleaned up", how will we measure our success or failure? I think
> > we should be clear that wasting N months on cleanup can *fail* to achieve a
> > useful objective. Without a clear plan it almost certainly will do so. The
> > flip side is that wasting N months will cause great amusement and dancing
> > amongst those people who wish to pull ahead of our open source project and
> > we should take care not to hand them a victory from an overreaction.
> 
> I agree with all that.  We should likewise take care not to become insolvent
> from an underreaction.

I understand the overreaction/underreaction debate.  Here were my goals
in this discussion:

1.  stop worry about the 9.5 timeline so we could honestly assess our   software - *done*
2.  seriously address multi-xact issues without 9.5/commit-fest pressure -   *in process*
3.  identify any other areas in need of serious work

While I like the list you provided, I don't think we can be effective in
an environment where we assume every big new features will have problems
like multi-xact.  For example, we have not seen destabilization from any
major 9.4 features, that I can remember anyway.

Unless there is consensus about new areas for #3, I am thinking we will
continue looking at multi-xact until we are happy, then move ahead with
9.5 items in the way we have before.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Restore-reliability mode

From
Andres Freund
Date:
On 2015-06-08 13:44:05 -0400, Bruce Momjian wrote:
> I understand the overreaction/underreaction debate.  Here were my goals
> in this discussion:
> 
> 1.  stop worry about the 9.5 timeline so we could honestly assess our
>     software - *done*
> 2.  seriously address multi-xact issues without 9.5/commit-fest pressure -
>     *in process*
> 3.  identify any other areas in need of serious work
> 
> While I like the list you provided, I don't think we can be effective in
> an environment where we assume every big new features will have problems
> like multi-xact.  For example, we have not seen destabilization from any
> major 9.4 features, that I can remember anyway.
> 
> Unless there is consensus about new areas for #3, I am thinking we will
> continue looking at multi-xact until we are happy, then move ahead with
> 9.5 items in the way we have before.

I think one important part is that we (continue to?) regularly tell our
employers that work on pre-commit, post-commit review, and refactoring
are critical for their long term business prospects.  My impression so
far is that that the employer side hasn't widely realized that fact, and
that many contributors do the review etc. part in their spare time.

Andres



Re: Restore-reliability mode

From
Bruce Momjian
Date:
On Mon, Jun  8, 2015 at 07:48:36PM +0200, Andres Freund wrote:
> On 2015-06-08 13:44:05 -0400, Bruce Momjian wrote:
> > I understand the overreaction/underreaction debate.  Here were my goals
> > in this discussion:
> > 
> > 1.  stop worry about the 9.5 timeline so we could honestly assess our
> >     software - *done*
> > 2.  seriously address multi-xact issues without 9.5/commit-fest pressure -
> >     *in process*
> > 3.  identify any other areas in need of serious work
> > 
> > While I like the list you provided, I don't think we can be effective in
> > an environment where we assume every big new features will have problems
> > like multi-xact.  For example, we have not seen destabilization from any
> > major 9.4 features, that I can remember anyway.
> > 
> > Unless there is consensus about new areas for #3, I am thinking we will
> > continue looking at multi-xact until we are happy, then move ahead with
> > 9.5 items in the way we have before.
> 
> I think one important part is that we (continue to?) regularly tell our
> employers that work on pre-commit, post-commit review, and refactoring
> are critical for their long term business prospects.  My impression so
> far is that that the employer side hasn't widely realized that fact, and
> that many contributors do the review etc. part in their spare time.

Agreed.  My bet is that more employers realize it now than they did a
few months ago --- kind of hard to miss all those minor releases and
customer complaints.  :-(

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: [CORE] Restore-reliability mode

From
Gavin Flower
Date:
On 09/06/15 00:59, David Gould wrote:
> I think Alphas are valuable and useful and even more so if they have release
> notes. For example, some of my clients are capable of fetching sources and
> building from scratch and filing bug reports and are often interested in
> particular new features. They even have staging infrastructure that could
> test new postgres releases with real applications. But they don't do it.
> They also don't follow -hackers, they don't track git, and they don't have
> any easy way to tell if if the new feature they are interested in is
> actually complete and ready to test at any particular time. A lot of
> features are developed in multiple commits over a period of time and they
> see no point in testing until at least most of the feature is complete and
> expected to work. But it is not obvious from outside when that happens for
> any given feature. For my clients the value of Alpha releases would
> mainly be the release notes, or some other mark in the sand that says "As of
> Alpha-3 feature X is included and expected to mostly work."
>
> -dg
>

RELEASE NOTES

I think that having:
1. release notes
2. an Alpha people can simply install without having to compile

Would encourage more people to get involved.  Such people would be 
unlikely to have the time and inclination to use 'nightlies', even if 
compiling was not required.

I have read other posts in this thread, that support the above.

Surely, it would be good for pg to have some more people checking 
quality at an earlier stage?  So reducing barriers to do so is a good thing?


Cheers,
Gavin



Re: [CORE] Restore-reliability mode

From
David Gould
Date:
On Mon, 8 Jun 2015 13:03:56 -0300
Claudio Freire <klaussfreire@gmail.com> wrote:

> > Ohmygosh, you have to rpm install a bunch of -devel stuff? What a massive
> > hardship.
> 
> It's not about the 5 minutes of compile time, it's about the signalling.
> 
> Just *when* is git ready for testing? You don't know from the outside.
> 
> I do lurk here a lot and still am unsure quite often.
> 
> Even simply releasing an alpha *tarball* would be useful enough. What
> is needed is the signal to test, rather than a fully-built package.

This. The clients I referred to earlier don't even use the rpm packages,
they build from sources. They need to know when it is worthwhile to take a
new set of sources and test. Some sort of labeling about what the contents
are would enable them to do this.

I don't think a monthly snapshot would work as well as the requirement is
knowing that "grouping sets are in" not that "it is July now".

-dg

-- 
David Gould                                    daveg@sonic.net
If simplicity worked, the world would be overrun with insects.



Re: Restore-reliability mode

From
Noah Misch
Date:
On Wed, Jun 03, 2015 at 04:18:37PM +0200, Andres Freund wrote:
> On 2015-06-03 09:50:49 -0400, Noah Misch wrote:
> > Second, I would define the subject matter as "bug fixes, testing and
> > review", not "restructuring, testing and review."  Different code
> > structures are clearest to different hackers.  Restructuring, on
> > average, adds bugs even more quickly than feature development adds
> > them.
> 
> I can't agree with this. While I agree with not doing large
> restructuring for 9.5, I think we can't affort not to refactor for
> clarity, even if that introduces bugs. Noticeable parts of our code have
> to frequently be modified for new features and are badly structured at
> the same time. While restructuring will may temporarily increase the
> number of bugs in the short term, it'll decrease the number of bugs long
> term while increasing the number of potential contributors and new
> features.  That's obviously not to say we should just refactor for the
> sake of it.

I think I agree with everything after your first sentence.  I liked your
specific proposal to split StartupXLOG(), but making broad-appeal
restructuring proposals is hard.  I doubt we would get good results by casting
a wide net for restructuring ideas.  Automated testing has a lower barrier to
entry and is far less liable to make things worse instead of better.  I can
hope for good results from a TestSuiteFest, but not from a RestructureFest.
That said, if folks initiate compelling restructure proposals, we should be
willing to risk bugs from them like we risk bugs to acquire new features.



Re: Restore-reliability mode

From
Andres Freund
Date:
On 2015-06-10 01:57:22 -0400, Noah Misch wrote:
> I think I agree with everything after your first sentence.  I liked your
> specific proposal to split StartupXLOG(), but making broad-appeal
> restructuring proposals is hard.  I doubt we would get good results by casting
> a wide net for restructuring ideas.

I'm not meaning that we should actively strive to find as many things to
refactor as possible (yes, over-emphasized a bit). But that we shouldn't
skip refactoring if we notice something structurally bad, just because
it's been that way and we don't want to touch something "working". That
argument has e.g. been made repeatedly for xlog.c contents.

My feeling is that we're reaching the stage where a significant number
of bugs are added because code is structured "needlessly" complicated
and/or repetitive. And better testing can only catch so much - often
enough somebody has to think of all the possible corner cases.

> Automated testing has a lower barrier to
> entry and is far less liable to make things worse instead of better.  I can
> hope for good results from a TestSuiteFest, but not from a RestructureFest.
> That said, if folks initiate compelling restructure proposals, we should be
> willing to risk bugs from them like we risk bugs to acquire new
> features.

Sure, increasing testing and reviews are good independently. And
especially testing actually makes refactoring much more realistic.

Greetings,

Andres Freund



Re: Restore-reliability mode

From
Alvaro Herrera
Date:
Peter Geoghegan wrote:
> On Sat, Jun 6, 2015 at 12:58 PM, Noah Misch <noah@leadboat.com> wrote:
> >   - Call VALGRIND_MAKE_MEM_NOACCESS() on a shared buffer when its local pin
> >     count falls to zero.  Under CLOBBER_FREED_MEMORY, wipe a shared buffer
> >     when its global pin count falls to zero.
>
> Did a patch for this ever materialize?

I think the first part would be something like the attached.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Restore-reliability mode

From
Alvaro Herrera
Date:
Noah Misch wrote:

> - Add buildfarm members.  This entails reporting any bugs that prevent an
>   initial passing run.  Once you have a passing run, schedule regular runs.
>   Examples of useful additions:
>   - "./configure ac_cv_func_getopt_long=no, ac_cv_func_snprintf=no ..." to
>     enable all the replacement code regardless of the current platform's need
>     for it.  This helps distinguish "Windows bug" from "replacement code bug."
>   - --disable-integer-datetimes, --disable-float8-byval, disable-float4-byval,
>     --disable-spinlocks, --disable-atomics, disable-thread-safety,
>     --disable-largefile, #define RANDOMIZE_ALLOCATED_MEMORY
     #define RELCACHE_FORCE_RELEASE + #define CLOBBER_FREED_MEMORY

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Restore-reliability mode

From
Noah Misch
Date:
On Thu, Jul 23, 2015 at 04:53:49PM -0300, Alvaro Herrera wrote:
> Peter Geoghegan wrote:
> > On Sat, Jun 6, 2015 at 12:58 PM, Noah Misch <noah@leadboat.com> wrote:
> > >   - Call VALGRIND_MAKE_MEM_NOACCESS() on a shared buffer when its local pin
> > >     count falls to zero.  Under CLOBBER_FREED_MEMORY, wipe a shared buffer
> > >     when its global pin count falls to zero.
> > 
> > Did a patch for this ever materialize?
> 
> I think the first part would be something like the attached.

Neat.  Does it produce any new complaints during "make installcheck"?



Re: Restore-reliability mode

From
Alvaro Herrera
Date:
Noah Misch wrote:
> On Thu, Jul 23, 2015 at 04:53:49PM -0300, Alvaro Herrera wrote:
> > Peter Geoghegan wrote:
> > > On Sat, Jun 6, 2015 at 12:58 PM, Noah Misch <noah@leadboat.com> wrote:
> > > >   - Call VALGRIND_MAKE_MEM_NOACCESS() on a shared buffer when its local pin
> > > >     count falls to zero.  Under CLOBBER_FREED_MEMORY, wipe a shared buffer
> > > >     when its global pin count falls to zero.
> > > 
> > > Did a patch for this ever materialize?
> > 
> > I think the first part would be something like the attached.
> 
> Neat.  Does it produce any new complaints during "make installcheck"?

I only tried a few tests, for lack of time, and it didn't produce any.
(To verify that the whole thing was working properly, I reduced the
range of memory made available during PinBuffer and that resulted in a
crash immediately).  I am not really familiar with valgrind TBH and just
copied a recipe to run postmaster under it, so if someone with more
valgrind-fu could verify this, it would be great.


This part:

> > > >     Under CLOBBER_FREED_MEMORY, wipe a shared buffer when its
> > > >     global pin count falls to zero.

can be done without any valgrind, I think.  Any takers?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services