Thread: BUG #7648: Momentary index corruption while in hot standby

BUG #7648: Momentary index corruption while in hot standby

From
daniel@heroku.com
Date:
The following bug has been logged on the website:

Bug reference:      7648
Logged by:          Daniel Farina
Email address:      daniel@heroku.com
PostgreSQL version: 9.0.9
Operating system:   Ubuntu 10.04
Description:        =


At more or less one point in time exactly (the same second, at minimum), a
couple of errors were raised on a hot standby while performing query access
that would nominally suggest corruption were raised:

PGError: ERROR: could not read block 556642 in file "base/16385/2904143.4":
read only 0 of 8192 =


Immediately thereafter, no occurrences of the error resurfaced and any
attempts to reproduce the issue with an identical query were met with
failure.

On investigation, this relfile is a fairly common beast: an integer btree
index being used for row identification (e.g. serial).

This is reading from a 9.0.8 Postgres.

Re: BUG #7648: Momentary index corruption while in hot standby

From
Jeff Davis
Date:
On Fri, 2012-11-09 at 22:49 +0000, daniel@heroku.com wrote:
> PGError: ERROR: could not read block 556642 in file "base/16385/2904143.4":
> read only 0 of 8192

Does that seem wildly off to you, or a little off? Do you think that
block may have existed in the past, but was truncated by a VACUUM or
something?

Just to clarify, that file belongs to the index in question, right? It's
not the heap pointed to by the index?

> Immediately thereafter, no occurrences of the error resurfaced and any
> attempts to reproduce the issue with an identical query were met with
> failure.

Or a lack of failure, I presume.

> This is reading from a 9.0.8 Postgres.

Any indication whether it's present on other versions or does it appear
to be isolated to 9.0.X?

Regards,
    Jeff Davis

Re: BUG #7648: Momentary index corruption while in hot standby

From
Daniel Farina
Date:
On Fri, Nov 9, 2012 at 3:35 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Fri, 2012-11-09 at 22:49 +0000, daniel@heroku.com wrote:
>> PGError: ERROR: could not read block 556642 in file "base/16385/2904143.4":
>> read only 0 of 8192
>
> Does that seem wildly off to you, or a little off? Do you think that
> block may have existed in the past, but was truncated by a VACUUM or
> something?

A little off.  I think that Tom Lane posted a pretty good mechanism in
-hackers: http://archives.postgresql.org/pgsql-hackers/2012-11/msg00409.php

> Just to clarify, that file belongs to the index in question, right? It's
> not the heap pointed to by the index?

Yes.

>> Immediately thereafter, no occurrences of the error resurfaced and any
>> attempts to reproduce the issue with an identical query were met with
>> failure.
>
> Or a lack of failure, I presume.

Yup.

>> This is reading from a 9.0.8 Postgres.
>
> Any indication whether it's present on other versions or does it appear
> to be isolated to 9.0.X?

This is the only known occurrence to me, ever, but given it's
incredibly ephemeral nature probably glossed over most of the time, I
can't say "it's the only time it's ever happened".

--
fdr

Re: BUG #7648: Momentary index corruption while in hot standby

From
Andres Freund
Date:
On 2012-11-09 15:57:06 -0800, Daniel Farina wrote:
> This is the only known occurrence to me, ever, but given it's
> incredibly ephemeral nature probably glossed over most of the time, I
> can't say "it's the only time it's ever happened".

I wish we had some way to easily discern such "something is wrong" error
from other ERRORs that are expected during normal operation..

Greetings,

Andres Freund

Re: BUG #7648: Momentary index corruption while in hot standby

From
Daniel Farina
Date:
On Fri, Nov 9, 2012 at 4:16 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2012-11-09 15:57:06 -0800, Daniel Farina wrote:
>> This is the only known occurrence to me, ever, but given it's
>> incredibly ephemeral nature probably glossed over most of the time, I
>> can't say "it's the only time it's ever happened".
>
> I wish we had some way to easily discern such "something is wrong" error
> from other ERRORs that are expected during normal operation..

Me too. Database clients finding these unambiguously platform-level
problems and being relied upon to report them to receive treatment is
a long-standing embarrassment to me.  However, I've been way too
swamped to even start thinking of how one would disentangle error
reporting suitable for physical issues from logical issues.  Clearly
it is *possible* (most error sites are fairly clear being in one
category or the other), but the carefully considered upheaval to the
logging system is beyond my mind's reach for now.

It's a weakness we've kicked around the office here a few times with
much regret that nobody has the necessary remaining attention to
attack it.

--
fdr

Re: BUG #7648: Momentary index corruption while in hot standby

From
Peter Geoghegan
Date:
On 10 November 2012 00:29, Daniel Farina <daniel@heroku.com> wrote:
> Me too. Database clients finding these unambiguously platform-level
> problems and being relied upon to report them to receive treatment is
> a long-standing embarrassment to me.  However, I've been way too
> swamped to even start thinking of how one would disentangle error
> reporting suitable for physical issues from logical issues.

I complained about this a few months ago (and a few months before
that), and the upshot was that we kicked around a few ideas and were
able to outline a useful API [1]. The idea here was to derive what I
called magnitude from SQLSTATE. In other words, we'd represent how
routine or non-routine a particular error message was (the "wake me up
in the middle of the night" factor). Severity levels don't and cannot
capture this, since for example a FATAL error occurs in the event of
failed authentication, whereas ERRORs (technically a lesser severity)
may occur in far more serious situations that a Postgres DBA can
reasonably hope to never see, with problems that indicate data
corruption, for example.

[1] http://archives.postgresql.org/message-id/CAEYLb_XdtyJE6WtUy4TGdjUQ6eUtjJp0cTfLaDP9qwp8gOtTdg@mail.gmail.com

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services