Thread: back-branch multixact fixes & 9.5 alpha/beta: schedule
Hi, I think we have consensus that we should proceed with releasing fixes for the known multixact bugs in two stages: - One set of minor releases with the fixes that we have now, to undo the damage caused by 9.4.2 and still present in 9.4.3. These changes will force immediate anti-wraparound vacuums for some users in order to repair bogus relminmxid, datminmxid, and control-file oldestMultiXid values. They will also fix failure-to-start problems confirmed to exist in 9.4.2 and suspected problems with crash recovery and recovery of an online backup. - Another set of minor releases with the changes that Andres is working on to add WAL-logging for multixact truncation. I suppose this will require the usual dance of upgrading the standby first and then the master afterwards; there are details I'm not clear on yet here. This will fix other problems with recovery that are not new in 9.4.2 but go all the way back to 9.3.0. In addition, it seems abundantly clear that everyone is very eager to get some sort of 9.5 version out the door - if not a beta, then an alpha. I am still a bit concerned that's premature, but, on the other hand, time is passing and at least some issues are getting dealt with in the meantime, so... that's something. So, when shall we do all of this releasing? It seems like we could do stage-one of the multixact fixing this week, and then figure out how to do the other stuff after PGCon. Alternatively, we can let the latest round of changes that are already in the tree settle until after PGCon, and plan a release for stage-one of the multixact fixing then. Whichever we pick, we then need to figure out the timetable for the rest of it. Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > So, when shall we do all of this releasing? It seems like we could do > stage-one of the multixact fixing this week, and then figure out how > to do the other stuff after PGCon. Alternatively, we can let the > latest round of changes that are already in the tree settle until > after PGCon, and plan a release for stage-one of the multixact fixing > then. Whichever we pick, we then need to figure out the timetable for > the rest of it. I think we've already basically missed the window for releases this week. Not that we couldn't physically do it, but that we normally give the packagers more than one day's notice. (There's also the fact that we've already asked them to do two releases in the past three weeks.) I propose that we plan for back-branch releases the week after PGCon (wrap on Monday June 22), and 9.5alpha1 the week after that (wrap on Monday June 29). If there's a need for an additional round of back-branch releases shortly thereafter, we'll deal with that as needed. regards, tom lane
On Mon, Jun 8, 2015 at 11:40:43AM -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > So, when shall we do all of this releasing? It seems like we could do > > stage-one of the multixact fixing this week, and then figure out how > > to do the other stuff after PGCon. Alternatively, we can let the > > latest round of changes that are already in the tree settle until > > after PGCon, and plan a release for stage-one of the multixact fixing > > then. Whichever we pick, we then need to figure out the timetable for > > the rest of it. > > I think we've already basically missed the window for releases this week. > Not that we couldn't physically do it, but that we normally give the > packagers more than one day's notice. (There's also the fact that we've > already asked them to do two releases in the past three weeks.) Yeah, I think if we needed this out in an emergency, we would do it, but based on the volume of recent releases, it would be hard. Are we seeing user reports of failures even on the newest released versions, or are these preventive fixes? > I propose that we plan for back-branch releases the week after PGCon > (wrap on Monday June 22), and 9.5alpha1 the week after that (wrap on > Monday June 29). If there's a need for an additional round of back-branch > releases shortly thereafter, we'll deal with that as needed. I am working on the 9.5 release notes so will be done long before that date. I will finish sooner so we can do the week-long release notes feedback session. :-) -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Mon, Jun 8, 2015 at 12:36 PM, Bruce Momjian <bruce@momjian.us> wrote: > Yeah, I think if we needed this out in an emergency, we would do it, but > based on the volume of recent releases, it would be hard. Are we seeing > user reports of failures even on the newest released versions, or are > these preventive fixes? User reports of failures. See the thread about upgrading from 9.4.1 to 9.4.2 and having the server *fail to start*. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Bruce Momjian wrote: > On Mon, Jun 8, 2015 at 11:40:43AM -0400, Tom Lane wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > > > So, when shall we do all of this releasing? It seems like we could do > > > stage-one of the multixact fixing this week, and then figure out how > > > to do the other stuff after PGCon. Alternatively, we can let the > > > latest round of changes that are already in the tree settle until > > > after PGCon, and plan a release for stage-one of the multixact fixing > > > then. Whichever we pick, we then need to figure out the timetable for > > > the rest of it. > > > > I think we've already basically missed the window for releases this week. > > Not that we couldn't physically do it, but that we normally give the > > packagers more than one day's notice. (There's also the fact that we've > > already asked them to do two releases in the past three weeks.) > > Yeah, I think if we needed this out in an emergency, we would do it, but > based on the volume of recent releases, it would be hard. Are we seeing > user reports of failures even on the newest released versions, or are > these preventive fixes? * people with the wrong oldestMulti setting in pg_control (which would be due to a buggy pg_upgrade being used long ago) will be unable to start if they upgrade to 9.3.7 or 9.3.8. A solution for them would be to downgrade to 9.3.6. We had reports of this problem starting just a couple of days after we released 9.4.2, I think. * We had a customer unable to refresh their base backups once they upgraded to 9.3.7; taking a new base backup would fail with a very similar error to those above (except no buggy pg_upgrade was involved). They seem to have gotten from under that problem by removing from crontab a script that ran whole-table vacuuming more frequently than with default settings. Their data is 3 TB in size, so the basebackup takes long enough that multixact truncations occured while the base backups were running, every time, so they were unrestorable. (Actually I just checked and it seems they haven't verified that they can take a new base backup -- the new one is still running.) Anyway my point is that for some guys these bugs are pretty critical. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jun 8, 2015 at 12:39:24PM -0400, Robert Haas wrote: > On Mon, Jun 8, 2015 at 12:36 PM, Bruce Momjian <bruce@momjian.us> wrote: > > Yeah, I think if we needed this out in an emergency, we would do it, but > > based on the volume of recent releases, it would be hard. Are we seeing > > user reports of failures even on the newest released versions, or are > > these preventive fixes? > > User reports of failures. See the thread about upgrading from 9.4.1 > to 9.4.2 and having the server *fail to start*. OK, are these fixed in 9.4.2 or would the same failure happen in 9.4.3? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Mon, Jun 8, 2015 at 01:53:42PM -0300, Alvaro Herrera wrote: > * people with the wrong oldestMulti setting in pg_control (which would > be due to a buggy pg_upgrade being used long ago) will be unable to > start if they upgrade to 9.3.7 or 9.3.8. A solution for them would be > to downgrade to 9.3.6. We had reports of this problem starting just a > couple of days after we released 9.4.2, I think. > > * We had a customer unable to refresh their base backups once they > upgraded to 9.3.7; taking a new base backup would fail with a very > similar error to those above (except no buggy pg_upgrade was involved). > They seem to have gotten from under that problem by removing from > crontab a script that ran whole-table vacuuming more frequently than > with default settings. Their data is 3 TB in size, so the basebackup > takes long enough that multixact truncations occured while the base > backups were running, every time, so they were unrestorable. > > (Actually I just checked and it seems they haven't verified that they > can take a new base backup -- the new one is still running.) > > Anyway my point is that for some guys these bugs are pretty critical. OK, thanks for the summary. I assume they would still have problems with 9.4.3. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Bruce Momjian wrote: > On Mon, Jun 8, 2015 at 12:39:24PM -0400, Robert Haas wrote: > > On Mon, Jun 8, 2015 at 12:36 PM, Bruce Momjian <bruce@momjian.us> wrote: > > > Yeah, I think if we needed this out in an emergency, we would do it, but > > > based on the volume of recent releases, it would be hard. Are we seeing > > > user reports of failures even on the newest released versions, or are > > > these preventive fixes? > > > > User reports of failures. See the thread about upgrading from 9.4.1 > > to 9.4.2 and having the server *fail to start*. > > OK, are these fixed in 9.4.2 or would the same failure happen in 9.4.3? The fixes are not yet in any released branch, hence the rush to get these out. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jun 8, 2015 at 02:01:52PM -0300, Alvaro Herrera wrote: > Bruce Momjian wrote: > > On Mon, Jun 8, 2015 at 12:39:24PM -0400, Robert Haas wrote: > > > On Mon, Jun 8, 2015 at 12:36 PM, Bruce Momjian <bruce@momjian.us> wrote: > > > > Yeah, I think if we needed this out in an emergency, we would do it, but > > > > based on the volume of recent releases, it would be hard. Are we seeing > > > > user reports of failures even on the newest released versions, or are > > > > these preventive fixes? > > > > > > User reports of failures. See the thread about upgrading from 9.4.1 > > > to 9.4.2 and having the server *fail to start*. > > > > OK, are these fixed in 9.4.2 or would the same failure happen in 9.4.3? > > The fixes are not yet in any released branch, hence the rush to get > these out. OK, now I understand. :-O We have known failures that are not patched, hence the desire for a release. I am a little concerned we are getting into a case where community members dedicated to this issue are asking for a release, and it is going into the core black hole, meaning there is no visibility on what actions core is taking to make a decision. (This is not a criticism, but rather an observation of how it looks from a non-core-member perspective.) -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Mon, Jun 8, 2015 at 1:08 PM, Bruce Momjian <bruce@momjian.us> wrote: > OK, now I understand. :-O We have known failures that are not patched, > hence the desire for a release. > > I am a little concerned we are getting into a case where community > members dedicated to this issue are asking for a release, and it is > going into the core black hole, meaning there is no visibility on what > actions core is taking to make a decision. (This is not a criticism, > but rather an observation of how it looks from a non-core-member > perspective.) It's not exactly going into a black hole, but there was some communication between Tom and Andres on Friday that left Andres with the impression that if he spent the weekend testing the new code for problems and things went well, we'd be able to get a release this week. So he spent his weekend on that, rather than, saying, doing something fun, and now Tom wants to wait two weeks. I'm not accusing anybody of anything, but if Andres felt like beating his head against a nearby wall at this point, I'd sympathize. Obviously, we need to do what is best for the project overall, not what is best for any individual developer's cranial integrity. But the decision-making process here is not entirely clear, and it's not entirely obvious that we're making the right ones. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Bruce Momjian wrote: > On Mon, Jun 8, 2015 at 02:01:52PM -0300, Alvaro Herrera wrote: > > > OK, are these fixed in 9.4.2 or would the same failure happen in 9.4.3? > > > > The fixes are not yet in any released branch, hence the rush to get > > these out. > > OK, now I understand. :-O We have known failures that are not patched, > hence the desire for a release. Part of the problem is that they are regressions: these systems did not have any trouble with 9.4.1/9.3.6 (other than being at risk of members overrun, of course.) > I am a little concerned we are getting into a case where community > members dedicated to this issue are asking for a release, and it is > going into the core black hole, meaning there is no visibility on what > actions core is taking to make a decision. At 2ndQuadrant, and I imagine EDB is in the same position, we have enough packaging stuff going on that we can ship patched releases to customers in case of trouble. I worry about users not having that privilege. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2015-06-08 13:16:00 -0400, Robert Haas wrote: > It's not exactly going into a black hole, but there was some > communication between Tom and Andres on Friday that left Andres with > the impression that if he spent the weekend testing the new code for > problems and things went well, we'd be able to get a release this > week. More precisely I felt rather unsure whether we'd release on Monday, Tuesday, or not at all. And I'd rather have a tested release out there than an untested one. Andres
Robert Haas <robertmhaas@gmail.com> writes: > It's not exactly going into a black hole, but there was some > communication between Tom and Andres on Friday that left Andres with > the impression that if he spent the weekend testing the new code for > problems and things went well, we'd be able to get a release this > week. So he spent his weekend on that, rather than, saying, doing > something fun, and now Tom wants to wait two weeks. I'm not accusing > anybody of anything, but if Andres felt like beating his head against > a nearby wall at this point, I'd sympathize. As I saw it, on Friday it was not clear whether we would be able to do a release this week. Now it's Monday, and we still have a rather long list of issues, and apparently Andres isn't all that happy even with the fixes that have gone in, because he still wants more time for testing. Are we really benefiting anyone if we force out a rushed release right now? What are the odds that it will make things worse? > Obviously, we need to do what is best for the project overall, not > what is best for any individual developer's cranial integrity. But > the decision-making process here is not entirely clear, and it's not > entirely obvious that we're making the right ones. AFAICT, you've been in on every single email thread discussing schedule over the past couple of weeks, and so has Andres. If you think it's unclear, it's not because somebody is hiding something from you, it's because it *is* unclear what we ought to do. regards, tom lane
On 2015-06-08 14:18:22 -0400, Tom Lane wrote: > As I saw it, on Friday it was not clear whether we would be able to do a > release this week. Now it's Monday, and we still have a rather long list > of issues Well, these issues aren't regressions, they're "just" general problems we need to fix. And some of them are going to require somewhat invasive changes. Both you and Robert have argued that the regressions should be fixed first. And by now you've convinced me. >, and apparently Andres isn't all that happy even with the fixes > that have gone in, because he still wants more time for testing. I'm now satisfied that the current HEAD is better than what was released last time round. Greetings, Andres Freund
On Mon, Jun 8, 2015 at 2:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > As I saw it, on Friday it was not clear whether we would be able to do a > release this week. Now it's Monday, and we still have a rather long list > of issues, and apparently Andres isn't all that happy even with the fixes > that have gone in, because he still wants more time for testing. My reading is that Andres is convinced that we've fixed the regressions in 9.4.2. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Andres Freund <andres@anarazel.de> writes: > On 2015-06-08 14:18:22 -0400, Tom Lane wrote: >> As I saw it, on Friday it was not clear whether we would be able to do a >> release this week. Now it's Monday, and we still have a rather long list >> of issues > Well, these issues aren't regressions, they're "just" general problems > we need to fix. And some of them are going to require somewhat invasive > changes. Both you and Robert have argued that the regressions should be > fixed first. And by now you've convinced me. >> , and apparently Andres isn't all that happy even with the fixes >> that have gone in, because he still wants more time for testing. > I'm now satisfied that the current HEAD is better than what was released > last time round. If there's general agreement that there are no regressions from 9.4.anything, then perhaps we should put out a release this week. Wrap today seems out of the question but we could still do it tomorrow for Friday release. Given the lack of notice, I doubt that all the packagers would be on board promptly; but as long as it's not a security release there's no urgent reason that they all have to be ready at the same time. regards, tom lane
On 06/08/2015 12:21 PM, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: >> On 2015-06-08 14:18:22 -0400, Tom Lane wrote: >>> As I saw it, on Friday it was not clear whether we would be able to do a >>> release this week. Now it's Monday, and we still have a rather long list >>> of issues > >> Well, these issues aren't regressions, they're "just" general problems >> we need to fix. And some of them are going to require somewhat invasive >> changes. Both you and Robert have argued that the regressions should be >> fixed first. And by now you've convinced me. > >>> , and apparently Andres isn't all that happy even with the fixes >>> that have gone in, because he still wants more time for testing. > >> I'm now satisfied that the current HEAD is better than what was released >> last time round. > > If there's general agreement that there are no regressions from > 9.4.anything, then perhaps we should put out a release this week. > Wrap today seems out of the question but we could still do it tomorrow > for Friday release. If we release on Friday that is the 12th, PgCon is starts the 16th and there is a weekend in between. If there is an unknown regression or new bug that is severe, are we going to have the resources to resolve it? Sincerely, JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564 PostgreSQL Centered full stack support, consulting and development. Announcing "I'm offended" is basically telling the world you can't control your own emotions, so everyone else should do it for you.
Joshua D. Drake wrote: > If we release on Friday that is the 12th, PgCon is starts the 16th and there > is a weekend in between. If there is an unknown regression or new bug that > is severe, are we going to have the resources to resolve it? ISTM if that happens, we're no worse off than currently. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 06/08/2015 12:31 PM, Alvaro Herrera wrote: > > Joshua D. Drake wrote: > >> If we release on Friday that is the 12th, PgCon is starts the 16th and there >> is a weekend in between. If there is an unknown regression or new bug that >> is severe, are we going to have the resources to resolve it? > > ISTM if that happens, we're no worse off than currently. Technically sure, reputation? If we really want to do this, let's do it. I am not trying to throw a wrench into things. I just really don't want to have to go back, yet again. Sincerely, JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564 PostgreSQL Centered full stack support, consulting and development. Announcing "I'm offended" is basically telling the world you can't control your own emotions, so everyone else should do it for you.
Joshua D. Drake wrote: > > On 06/08/2015 12:31 PM, Alvaro Herrera wrote: > > > >Joshua D. Drake wrote: > > > >>If we release on Friday that is the 12th, PgCon is starts the 16th and there > >>is a weekend in between. If there is an unknown regression or new bug that > >>is severe, are we going to have the resources to resolve it? > > > >ISTM if that happens, we're no worse off than currently. > > Technically sure, reputation? Well, reputation-wise we're already losing every time somebody's server crashes on 9.4.2 and finds it won't start, where it did start fine with 9.4.1. Maybe they simply wanted to change shared_buffers and the server won't start anymore. Some people even update binaries with the server running; if they don't restart immediately, it could be several days before it fails to start. It's pretty scary. > If we really want to do this, let's do it. I am not trying to throw a wrench > into things. I just really don't want to have to go back, yet again. Sure, me neither. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 06/08/2015 12:48 PM, Alvaro Herrera wrote: > Well, reputation-wise we're already losing every time somebody's server > crashes on 9.4.2 and finds it won't start, where it did start fine with > 9.4.1. Maybe they simply wanted to change shared_buffers and the server > won't start anymore. Some people even update binaries with the server > running; if they don't restart immediately, it could be several days > before it fails to start. It's pretty scary. Yeah no doubt there. JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564 PostgreSQL Centered full stack support, consulting and development. Announcing "I'm offended" is basically telling the world you can't control your own emotions, so everyone else should do it for you.
On 06/08/2015 12:48 PM, Alvaro Herrera wrote: > Well, reputation-wise we're already losing every time somebody's server > crashes on 9.4.2 and finds it won't start, where it did start fine with > 9.4.1. Maybe they simply wanted to change shared_buffers and the server > won't start anymore. Some people even update binaries with the server > running; if they don't restart immediately, it could be several days > before it fails to start. It's pretty scary. I'm confused by this discussion. 9.4.3 is released. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus wrote: > On 06/08/2015 12:48 PM, Alvaro Herrera wrote: > > Well, reputation-wise we're already losing every time somebody's server > > crashes on 9.4.2 and finds it won't start, where it did start fine with > > 9.4.1. Maybe they simply wanted to change shared_buffers and the server > > won't start anymore. Some people even update binaries with the server > > running; if they don't restart immediately, it could be several days > > before it fails to start. It's pretty scary. > > I'm confused by this discussion. 9.4.3 is released. The bug was not fixed by 9.4.3. It was fixed by this commit: Author: Robert Haas <rhaas@postgresql.org> Branch: master [068cfadf9] 2015-06-05 09:31:57 -0400 Branch: REL9_4_STABLE [b6a3444fa] 2015-06-05 09:33:52 -0400 Branch: REL9_3_STABLE [2a9b01928] 2015-06-05 09:34:15 -0400 Cope with possible failure of the oldest MultiXact to exist. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, 8 Jun 2015 13:53:42 -0300 Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > * people with the wrong oldestMulti setting in pg_control (which would > be due to a buggy pg_upgrade being used long ago) will be unable to > start if they upgrade to 9.3.7 or 9.3.8. A solution for them would be > to downgrade to 9.3.6. We had reports of this problem starting just a > couple of days after we released 9.4.2, I think. Does this mean that for people with wrong oldestMulti settings in pg_control due to a buggy pg_upgrade being used long ago can fix this by updating to 9.3.9 when it is released? Asking for a friend... -dg -- David Gould 510 282 0869 daveg@sonic.net If simplicity worked, the world would be overrun with insects.
On Tue, Jun 9, 2015 at 3:04 AM, David Gould <daveg@sonic.net> wrote: > On Mon, 8 Jun 2015 13:53:42 -0300 > Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > >> * people with the wrong oldestMulti setting in pg_control (which would >> be due to a buggy pg_upgrade being used long ago) will be unable to >> start if they upgrade to 9.3.7 or 9.3.8. A solution for them would be >> to downgrade to 9.3.6. We had reports of this problem starting just a >> couple of days after we released 9.4.2, I think. > > Does this mean that for people with wrong oldestMulti settings in pg_control > due to a buggy pg_upgrade being used long ago can fix this by updating to > 9.3.9 when it is released? Asking for a friend... If the value is buggy because it is 1 when it should have some larger value, yes, this should fix it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company