Thread: 9a57858f1103b89a5674f0d50c5fe1f756411df6
On the pgsql-packagers list, there has been some (OT for that list) discussion of whether commit 9a57858f1103b89a5674f0d50c5fe1f756411df6 is sufficiently serious to justify yet another immediate minor release of 9.3.x. The relevant questions seem to be: 1. Is it really bad? 2. Does it affect a lot of people or only a few? 3. Are there more, equally bad bugs that are unfixed, or perhaps even unreported, yet? Obviously, we don't want to leave serious bugs unpatched. On the other hand, as Tom pointed out in that discussion, releases are a lot of work, and we can't do them for every commit. Discuss. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Discuss. This thread badly needs a more informative Subject line. But, yeah: do people think the referenced commit fixes a bug bad enough to deserve a quick update release? If so, why? Multiple reports of problems in the field would be a good reason, but I've not seen such. regards, tom lane
Bug: Fix Wal replay of locking an updated tuple (WAS: Re: 9a57858f1103b89a5674f0d50c5fe1f756411df6)
From
"Joshua D. Drake"
Date:
On 03/12/2014 06:15 PM, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Discuss. > > This thread badly needs a more informative Subject line. > No kidding. Or at least a link for goodness sake. Although the pgsql-packers list wasn't all that helpful either. What I know is that we have a known in the wild version of PostgreSQL that eats data. That is bad. It is unfortunate that we just released 9.3.3 but we can't knowingly allow people to get their data eaten. We look bad. It appears that this is the specific bug: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=9a57858f1103b89a5674f0d50c5fe1f756411df6 JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579 PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc For my dreams of your image that blossoms a rose in the deeps of my heart. - W.B. Yeats
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > This thread badly needs a more informative Subject line. Agreed. > But, yeah: do people think the referenced commit fixes a bug bad enough > to deserve a quick update release? If so, why? Multiple reports of > problems in the field would be a good reason, but I've not seen such. Uh, isn't what brought this to light two independent complaints from Peter and Greg Stark of seeing corruption in the field due to this? Peter's initial email also indicated it was two different systems which had gotten bit by this and Greg explicitly stated that he was working on an independent database from what Peter was reporting on, so that's at least 2 (one each), or 3 (if you count databases, as Peter had 2). Sure, they're all from Heroku, but I find it highly unlikely no one else has run into this issue. More likely, they simply haven't realized it's happened to them (which is another reason this is a particularly nasty bug..). I understand that another release makes work for everyone, and that stinks, and it's also no fun in the press to have *another* release that is fixing corruption issues, but sitting on a fix which is actively causing corruption in the field isn't any good either. So, my +1 is for a "quick update release"- and if there's a way I can help offload some of the work (or at least learn the steps to help with offloading in the future), I'm happy to do so- just let me know. Thanks, Stephen
Re: Bug: Fix Wal replay of locking an updated tuple (WAS: Re: 9a57858f1103b89a5674f0d50c5fe1f756411df6)
From
David Johnston
Date:
Joshua D. Drake wrote > On 03/12/2014 06:15 PM, Tom Lane wrote: >> Robert Haas < > robertmhaas@ > > writes: >>> Discuss. >> >> This thread badly needs a more informative Subject line. >> > > No kidding. Or at least a link for goodness sake. Although the > pgsql-packers list wasn't all that helpful either. A link would be nice though if -packers is a security list then that may not be a good thing since -hackers is public... A quick search of Nabble and the "Mailing Lists" section of the homepage do not indicate pgsql-packers exists - at least not in any publicly (even if read-only) accessible way. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/9a57858f1103b89a5674f0d50c5fe1f756411df6-tp5795816p5795827.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
<p dir="ltr"><br /> On 13 Mar 2014 01:36, "Stephen Frost" <<a href="mailto:sfrost@snowman.net">sfrost@snowman.net</a>>wrote:<br /> ><br /> > * Tom Lane (<a href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>)wrote:<br /> > > This thread badly needs a more informative Subjectline.<br /> ><br /> > Agreed.<br /> ><br /> > > But, yeah: do people think the referenced commit fixesa bug bad enough<br /> > > to deserve a quick update release? If so, why? Multiple reports of<br /> > >problems in the field would be a good reason, but I've not seen such.<br /> ><br /> > Uh, isn't what brought thisto light two independent complaints from<br /> > Peter and Greg Stark of seeing corruption in the field due to this?<br/> ><br /> > Peter's initial email also indicated it was two different systems which<br /> > had gottenbit by this and Greg explicitly stated that he was working on<br /> > an independent database from what Peter wasreporting on, so that's at<br /> > least 2 (one each), or 3 (if you count databases, as Peter had 2).<br /> > Sure,they're all from Heroku, but I find it highly unlikely no one else<br /> > has run into this issue. More likely,they simply haven't realized it's<br /> > happened to them (which is another reason this is a particularly nasty<br/> > bug..).<p dir="ltr">We have the two databases where we're sure this was the problem. On the one I workedon the customer complained that it happened repeatedly. <p dir="ltr">The key I demonstrated here wasn't even the onethe costumer was complaining about. It seems their usage pattern made it extremely easy to trigger and that usage patternarose naturally from using a rails module called counter_cache which maintains a cache of the count of a child takein the parent table.<p dir="ltr">We also have a few other customers complaining about duplicate keys. It's hard to besure but these may have been standbys where the problem occurred ages ago and they only now activated their standby andran into the problem.<p dir="ltr">That's what worries me most about this bug. You'll only detect it if you're routinelyquerying your standby. If you have a standby for HA purposes it might be corrupt for a long time without you realisingit. We may be fielding corruption complaints for a long time without being able to conclusively prove whether it'sdue to this bug or not.
On 2014-03-12 20:09:23 -0400, Robert Haas wrote: > On the pgsql-packagers list, there has been some (OT for that list) > discussion of whether commit 9a57858f1103b89a5674f0d50c5fe1f756411df6 > is sufficiently serious to justify yet another immediate minor release > of 9.3.x. The relevant questions seem to be: > > 1. Is it really bad? It breaks the ctid of concurrently updated/locked tuples during WAL replay. Which can lead to all sorts of nastiness like indexes not finding any rows. Since that kind of locking/updating is pretty common with foreign keys, it's not an unlikely scenario. Unfortunately FPIs won't save the day in all that many scenarios because there'll normally a XLOG_HEAP2_LOCK_UPDATED before the XLOG_HEAP_LOCK record which is replayed badly. Now, one could argue that it only affects replicas or servers that crashed at some point, but I think that's not much comfort. > 2. Does it affect a lot of people or only a few? It's been reported twice (Peter Geoghegan, Greg Stark) by Heroku and one person on IRC could reproduce it repeatedly. The latter was what made me look into it again and find the bug. Greg has confirmed that it fixes the bug when replaying the WAL again. > 3. Are there more, equally bad bugs that are unfixed, or perhaps even > unreported, yet? Uh. I have no idea. I don't know of any reports that can't be attributed to any of these, but as you're also include unreported bugs in that question... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, 2014-03-13 at 12:00 +0100, Andres Freund wrote: > On 2014-03-12 20:09:23 -0400, Robert Haas wrote: > > On the pgsql-packagers list, there has been some (OT for that list) > > discussion of whether commit 9a57858f1103b89a5674f0d50c5fe1f756411df6 > > is sufficiently serious to justify yet another immediate minor release > > of 9.3.x. The relevant questions seem to be: > > > > 1. Is it really bad? > > It breaks the ctid of concurrently updated/locked tuples during WAL > replay. Which can lead to all sorts of nastiness like indexes not > finding any rows. Since that kind of locking/updating is pretty common > with foreign keys, it's not an unlikely scenario. > Unfortunately FPIs won't save the day in all that many scenarios because > there'll normally a XLOG_HEAP2_LOCK_UPDATED before the XLOG_HEAP_LOCK > record which is replayed badly. > > Now, one could argue that it only affects replicas or servers that > crashed at some point, but I think that's not much comfort. > > > 2. Does it affect a lot of people or only a few? > > It's been reported twice (Peter Geoghegan, Greg Stark) by Heroku and one > person on IRC could reproduce it repeatedly. The latter was what made me > look into it again and find the bug. Greg has confirmed that it fixes > the bug when replaying the WAL again. > > > 3. Are there more, equally bad bugs that are unfixed, or perhaps even > > unreported, yet? > > Uh. I have no idea. I don't know of any reports that can't be attributed > to any of these, but as you're also include unreported bugs in that > question... > Does this affect also other branches? 9.2 ? regards, -- Jozef Mlich <jmlich@redhat.com> Associate Software Engineer - EMEA ENG Developer Experience Mobile: +420 604 217 719 http://cz.redhat.com/ Red Hat, Inc.
On 2014-03-13 13:06:00 +0100, Jozef Mlich wrote: > Does this affect also other branches? 9.2 ? Nope, it's 9.3 only. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
All, First, I'll note that one of the reasons we haven't had a bunch of reports from the field about this is that a lot of our users have yet to apply 9.3.3, so if they have corruption issues they probably attribute them to the issues which are fixed in 9.3.3. I know that's the case with our customer base. As much as I hate extra releases, it might be better to push this one out; if we can get it out in the next 2 weeks, folks can skip the downtime for 9.3.3 and go straight to 9.3.4. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Thu, Mar 13, 2014 at 5:10 PM, Josh Berkus <josh@agliodbs.com> wrote: > First, I'll note that one of the reasons we haven't had a bunch of > reports from the field about this is that a lot of our users have yet to > apply 9.3.3, so if they have corruption issues they probably attribute > them to the issues which are fixed in 9.3.3. I know that's the case > with our customer base. I was speculating that the reason we saw a sudden bunch after 9.3.3 might be that there might be a number of people who wait N releases before upgrading and the number of people for whom the value of N is 3 might be significant. Or it could be a coincidence. Users will only notice if they fail over to their standby or run queries on their standby. -- greg
Alvaro, All: Can someone help me with what we should tell users about this issue? 1. What users are especially likely to encounter it? All replication users, or do they have to do something else? 2. What error messages will affected users get? A link to the reports of this issue on pgsql lists would tell me this, but I'm not sure exactly which error reports are associated. 3. If users have already encountered corruption due to the fixed issue, what do they need to do after updating? re-basebackup? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus wrote: > Alvaro, All: > > Can someone help me with what we should tell users about this issue? > > 1. What users are especially likely to encounter it? All replication > users, or do they have to do something else? Replication users are more likely to get it on replicas, of course, because that's running the recovery code continuously; however, anyone that suffers a crash of a standalone system might also be affected. (And it'd be worse, even, because that corrupts your main source of data, not just a replicated copy of it.) Obviously, if you have a corrupted replica and fail over to it, you're similarly screwed. Basically you might be affected if you have tables that are referenced in primary keys and to which you also apply UPDATEs that are HOT-enabled. > 2. What error messages will affected users get? A link to the reports > of this issue on pgsql lists would tell me this, but I'm not sure > exactly which error reports are associated. Not sure about error messages. Essentially some rows would be visible to seqscans but not to index scans. These are the threads: http://www.postgresql.org/message-id/CAM3SWZTMQiCi5PV5OWHb+bYkUcnCk=O67w0cSswPvV7XfUcU5g@mail.gmail.com http://www.postgresql.org/message-id/CAM-w4HPTOeMT4KP0OJK+mGgzgcTOtLRTvFZyvD0O4aH-7dxo3Q@mail.gmail.com > 3. If users have already encountered corruption due to the fixed issue, > what do they need to do after updating? re-basebackup? Replicas can be fixed by recloning, yeah. I haven't stopped to think how to fix the masters. Greg, Peter, any clues there? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services