Thread: 9a57858f1103b89a5674f0d50c5fe1f756411df6

9a57858f1103b89a5674f0d50c5fe1f756411df6

From

Robert Haas

Date:

13 March 2014, 03:09:28

On the pgsql-packagers list, there has been some (OT for that list)
discussion of whether commit 9a57858f1103b89a5674f0d50c5fe1f756411df6
is sufficiently serious to justify yet another immediate minor release
of 9.3.x.  The relevant questions seem to be:

1. Is it really bad?

2. Does it affect a lot of people or only a few?

3. Are there more, equally bad bugs that are unfixed, or perhaps even
unreported, yet?

Obviously, we don't want to leave serious bugs unpatched.  On the
other hand, as Tom pointed out in that discussion, releases are a lot
of work, and we can't do them for every commit.

Discuss.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: 9a57858f1103b89a5674f0d50c5fe1f756411df6

From

Tom Lane

Date:

13 March 2014, 04:15:28

Robert Haas <robertmhaas@gmail.com> writes:
> Discuss.

This thread badly needs a more informative Subject line.

But, yeah: do people think the referenced commit fixes a bug bad enough
to deserve a quick update release?  If so, why?  Multiple reports of
problems in the field would be a good reason, but I've not seen such.
        regards, tom lane

Bug: Fix Wal replay of locking an updated tuple (WAS: Re: 9a57858f1103b89a5674f0d50c5fe1f756411df6)

From

"Joshua D. Drake"

Date:

13 March 2014, 04:23:05

On 03/12/2014 06:15 PM, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Discuss.
>
> This thread badly needs a more informative Subject line.
>

No kidding. Or at least a link for goodness sake. Although the 
pgsql-packers list wasn't all that helpful either.

What I know is that we have a known in the wild version of PostgreSQL 
that eats data. That is bad. It is unfortunate that we just released 
9.3.3 but we can't knowingly allow people to get their data eaten. We 
look bad.

It appears that this is the specific bug:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=9a57858f1103b89a5674f0d50c5fe1f756411df6

JD

-- 
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms   a rose in the deeps of my heart. - W.B. Yeats

Re: 9a57858f1103b89a5674f0d50c5fe1f756411df6

From

Stephen Frost

Date:

13 March 2014, 04:36:04

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> This thread badly needs a more informative Subject line.

Agreed.

> But, yeah: do people think the referenced commit fixes a bug bad enough
> to deserve a quick update release?  If so, why?  Multiple reports of
> problems in the field would be a good reason, but I've not seen such.

Uh, isn't what brought this to light two independent complaints from
Peter and Greg Stark of seeing corruption in the field due to this?

Peter's initial email also indicated it was two different systems which
had gotten bit by this and Greg explicitly stated that he was working on
an independent database from what Peter was reporting on, so that's at
least 2 (one each), or 3 (if you count databases, as Peter had 2).
Sure, they're all from Heroku, but I find it highly unlikely no one else
has run into this issue.  More likely, they simply haven't realized it's
happened to them (which is another reason this is a particularly nasty
bug..).

I understand that another release makes work for everyone, and that
stinks, and it's also no fun in the press to have *another* release that
is fixing corruption issues, but sitting on a fix which is actively
causing corruption in the field isn't any good either.

So, my +1 is for a "quick update release"- and if there's a way I can
help offload some of the work (or at least learn the steps to help with
offloading in the future), I'm happy to do so- just let me know.
Thanks,
    Stephen

Re: Bug: Fix Wal replay of locking an updated tuple (WAS: Re: 9a57858f1103b89a5674f0d50c5fe1f756411df6)

From

David Johnston

Date:

13 March 2014, 04:37:41

Joshua D. Drake wrote
> On 03/12/2014 06:15 PM, Tom Lane wrote:
>> Robert Haas <

> robertmhaas@

> > writes:
>>> Discuss.
>>
>> This thread badly needs a more informative Subject line.
>>
> 
> No kidding. Or at least a link for goodness sake. Although the 
> pgsql-packers list wasn't all that helpful either.

A link would be nice though if -packers is a security list then that may not
be a good thing since -hackers is public...

A quick search of Nabble and the "Mailing Lists" section of the homepage do
not indicate pgsql-packers exists - at least not in any publicly (even if
read-only) accessible way.

David J.

--
View this message in context:
http://postgresql.1045698.n5.nabble.com/9a57858f1103b89a5674f0d50c5fe1f756411df6-tp5795816p5795827.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

Re: 9a57858f1103b89a5674f0d50c5fe1f756411df6

From

Greg Stark

Date:

13 March 2014, 07:13:40

<p dir="ltr"><br /> On 13 Mar 2014 01:36, "Stephen Frost" <<a
href="mailto:sfrost@snowman.net">sfrost@snowman.net</a>>wrote:<br /> ><br /> > * Tom Lane (<a
href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>)wrote:<br /> > > This thread badly needs a more informative
Subjectline.<br /> ><br /> > Agreed.<br /> ><br /> > > But, yeah: do people think the referenced commit
fixesa bug bad enough<br /> > > to deserve a quick update release?  If so, why?  Multiple reports of<br /> >
>problems in the field would be a good reason, but I've not seen such.<br /> ><br /> > Uh, isn't what brought
thisto light two independent complaints from<br /> > Peter and Greg Stark of seeing corruption in the field due to
this?<br/> ><br /> > Peter's initial email also indicated it was two different systems which<br /> > had
gottenbit by this and Greg explicitly stated that he was working on<br /> > an independent database from what Peter
wasreporting on, so that's at<br /> > least 2 (one each), or 3 (if you count databases, as Peter had 2).<br /> >
Sure,they're all from Heroku, but I find it highly unlikely no one else<br /> > has run into this issue.  More
likely,they simply haven't realized it's<br /> > happened to them (which is another reason this is a particularly
nasty<br/> > bug..).<p dir="ltr">We have the two databases where we're sure this was the problem. On the one I
workedon the customer complained that it happened repeatedly. <p dir="ltr">The key I demonstrated here wasn't even the
onethe costumer was complaining about. It seems their usage pattern made it extremely easy to trigger and that usage
patternarose naturally from using a rails module called counter_cache which maintains a cache of the count of a child
takein the parent table.<p dir="ltr">We also have a few other customers complaining about duplicate keys. It's hard to
besure but these may have been standbys where the problem occurred ages ago and they only now activated their standby
andran into the problem.<p dir="ltr">That's what worries me most about this bug. You'll only detect it if you're
routinelyquerying your standby. If you have a standby for HA purposes it might be corrupt for a long time without you
realisingit. We may be fielding corruption complaints for a long time without being able to conclusively prove whether
it'sdue to this bug or not.

Re: 9a57858f1103b89a5674f0d50c5fe1f756411df6

From

Andres Freund

Date:

13 March 2014, 14:01:07

On 2014-03-12 20:09:23 -0400, Robert Haas wrote:
> On the pgsql-packagers list, there has been some (OT for that list)
> discussion of whether commit 9a57858f1103b89a5674f0d50c5fe1f756411df6
> is sufficiently serious to justify yet another immediate minor release
> of 9.3.x.  The relevant questions seem to be:
> 
> 1. Is it really bad?

It breaks the ctid of concurrently updated/locked tuples during WAL
replay. Which can lead to all sorts of nastiness like indexes not
finding any rows. Since that kind of locking/updating is pretty common
with foreign keys, it's not an unlikely scenario.
Unfortunately FPIs won't save the day in all that many scenarios because
there'll normally a XLOG_HEAP2_LOCK_UPDATED before the XLOG_HEAP_LOCK
record which is replayed badly.

Now, one could argue that it only affects replicas or servers that
crashed at some point, but I think that's not much comfort.

> 2. Does it affect a lot of people or only a few?

It's been reported twice (Peter Geoghegan, Greg Stark) by Heroku and one
person on IRC could reproduce it repeatedly. The latter was what made me
look into it again and find the bug. Greg has confirmed that it fixes
the bug when replaying the WAL again.

> 3. Are there more, equally bad bugs that are unfixed, or perhaps even
> unreported, yet?

Uh. I have no idea. I don't know of any reports that can't be attributed
to any of these, but as you're also include unreported bugs in that
question...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: 9a57858f1103b89a5674f0d50c5fe1f756411df6

From

Jozef Mlich

Date:

13 March 2014, 15:06:12

On Thu, 2014-03-13 at 12:00 +0100, Andres Freund wrote:
> On 2014-03-12 20:09:23 -0400, Robert Haas wrote:
> > On the pgsql-packagers list, there has been some (OT for that list)
> > discussion of whether commit 9a57858f1103b89a5674f0d50c5fe1f756411df6
> > is sufficiently serious to justify yet another immediate minor release
> > of 9.3.x.  The relevant questions seem to be:
> > 
> > 1. Is it really bad?
> 
> It breaks the ctid of concurrently updated/locked tuples during WAL
> replay. Which can lead to all sorts of nastiness like indexes not
> finding any rows. Since that kind of locking/updating is pretty common
> with foreign keys, it's not an unlikely scenario.
> Unfortunately FPIs won't save the day in all that many scenarios because
> there'll normally a XLOG_HEAP2_LOCK_UPDATED before the XLOG_HEAP_LOCK
> record which is replayed badly.
> 
> Now, one could argue that it only affects replicas or servers that
> crashed at some point, but I think that's not much comfort.
> 
> > 2. Does it affect a lot of people or only a few?
> 
> It's been reported twice (Peter Geoghegan, Greg Stark) by Heroku and one
> person on IRC could reproduce it repeatedly. The latter was what made me
> look into it again and find the bug. Greg has confirmed that it fixes
> the bug when replaying the WAL again.
> 
> > 3. Are there more, equally bad bugs that are unfixed, or perhaps even
> > unreported, yet?
> 
> Uh. I have no idea. I don't know of any reports that can't be attributed
> to any of these, but as you're also include unreported bugs in that
> question...
> 

Does this affect also other branches? 9.2 ?

regards,
-- 
Jozef Mlich <jmlich@redhat.com>
Associate Software Engineer - EMEA ENG Developer Experience
Mobile: +420 604 217 719
http://cz.redhat.com/
Red Hat, Inc.

Re: 9a57858f1103b89a5674f0d50c5fe1f756411df6

From

Andres Freund

Date:

13 March 2014, 15:12:24

On 2014-03-13 13:06:00 +0100, Jozef Mlich wrote:
> Does this affect also other branches? 9.2 ?

Nope, it's 9.3 only.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: 9a57858f1103b89a5674f0d50c5fe1f756411df6

From

Josh Berkus

Date:

13 March 2014, 20:10:09

All,

First, I'll note that one of the reasons we haven't had a bunch of
reports from the field about this is that a lot of our users have yet to
apply 9.3.3, so if they have corruption issues they probably attribute
them to the issues which are fixed in 9.3.3.  I know that's the case
with our customer base.

As much as I hate extra releases, it might be better to push this one
out; if we can get it out in the next 2 weeks, folks can skip the
downtime for 9.3.3 and go straight to 9.3.4.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: 9a57858f1103b89a5674f0d50c5fe1f756411df6

From

Greg Stark

Date:

13 March 2014, 20:24:15

On Thu, Mar 13, 2014 at 5:10 PM, Josh Berkus <josh@agliodbs.com> wrote:
> First, I'll note that one of the reasons we haven't had a bunch of
> reports from the field about this is that a lot of our users have yet to
> apply 9.3.3, so if they have corruption issues they probably attribute
> them to the issues which are fixed in 9.3.3.  I know that's the case
> with our customer base.

I was speculating that the reason we saw a sudden bunch after 9.3.3
might be that there might be a number of people who wait N releases
before upgrading and the number of people for whom the value of N is 3
might be significant.

Or it could be a coincidence. Users will only notice if they fail over
to their standby or run queries on their standby.

-- 
greg

Re: 9a57858f1103b89a5674f0d50c5fe1f756411df6

From

Josh Berkus

Date:

14 March 2014, 22:51:21

Alvaro, All:

Can someone help me with what we should tell users about this issue?

1. What users are especially likely to encounter it?  All replication
users, or do they have to do something else?

2. What error messages will affected users get?  A link to the reports
of this issue on pgsql lists would tell me this, but I'm not sure
exactly which error reports are associated.

3. If users have already encountered corruption due to the fixed issue,
what do they need to do after updating?  re-basebackup?


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: 9a57858f1103b89a5674f0d50c5fe1f756411df6

From

Alvaro Herrera

Date:

15 March 2014, 00:19:16

Josh Berkus wrote:
> Alvaro, All:
> 
> Can someone help me with what we should tell users about this issue?
> 
> 1. What users are especially likely to encounter it?  All replication
> users, or do they have to do something else?

Replication users are more likely to get it on replicas, of course,
because that's running the recovery code continuously; however, anyone
that suffers a crash of a standalone system might also be affected.
(And it'd be worse, even, because that corrupts your main source of
data, not just a replicated copy of it.)  Obviously, if you have a
corrupted replica and fail over to it, you're similarly screwed.

Basically you might be affected if you have tables that are referenced
in primary keys and to which you also apply UPDATEs that are
HOT-enabled.

> 2. What error messages will affected users get?  A link to the reports
> of this issue on pgsql lists would tell me this, but I'm not sure
> exactly which error reports are associated.

Not sure about error messages.  Essentially some rows would be visible
to seqscans but not to index scans.
These are the threads:
http://www.postgresql.org/message-id/CAM3SWZTMQiCi5PV5OWHb+bYkUcnCk=O67w0cSswPvV7XfUcU5g@mail.gmail.com
http://www.postgresql.org/message-id/CAM-w4HPTOeMT4KP0OJK+mGgzgcTOtLRTvFZyvD0O4aH-7dxo3Q@mail.gmail.com

> 3. If users have already encountered corruption due to the fixed issue,
> what do they need to do after updating?  re-basebackup?

Replicas can be fixed by recloning, yeah.  I haven't stopped to think
how to fix the masters.  Greg, Peter, any clues there?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services