Thread: BUG #7648: Momentary index corruption while in hot standby
The following bug has been logged on the website: Bug reference: 7648 Logged by: Daniel Farina Email address: daniel@heroku.com PostgreSQL version: 9.0.9 Operating system: Ubuntu 10.04 Description: = At more or less one point in time exactly (the same second, at minimum), a couple of errors were raised on a hot standby while performing query access that would nominally suggest corruption were raised: PGError: ERROR: could not read block 556642 in file "base/16385/2904143.4": read only 0 of 8192 = Immediately thereafter, no occurrences of the error resurfaced and any attempts to reproduce the issue with an identical query were met with failure. On investigation, this relfile is a fairly common beast: an integer btree index being used for row identification (e.g. serial). This is reading from a 9.0.8 Postgres.
On Fri, 2012-11-09 at 22:49 +0000, daniel@heroku.com wrote: > PGError: ERROR: could not read block 556642 in file "base/16385/2904143.4": > read only 0 of 8192 Does that seem wildly off to you, or a little off? Do you think that block may have existed in the past, but was truncated by a VACUUM or something? Just to clarify, that file belongs to the index in question, right? It's not the heap pointed to by the index? > Immediately thereafter, no occurrences of the error resurfaced and any > attempts to reproduce the issue with an identical query were met with > failure. Or a lack of failure, I presume. > This is reading from a 9.0.8 Postgres. Any indication whether it's present on other versions or does it appear to be isolated to 9.0.X? Regards, Jeff Davis
On Fri, Nov 9, 2012 at 3:35 PM, Jeff Davis <pgsql@j-davis.com> wrote: > On Fri, 2012-11-09 at 22:49 +0000, daniel@heroku.com wrote: >> PGError: ERROR: could not read block 556642 in file "base/16385/2904143.4": >> read only 0 of 8192 > > Does that seem wildly off to you, or a little off? Do you think that > block may have existed in the past, but was truncated by a VACUUM or > something? A little off. I think that Tom Lane posted a pretty good mechanism in -hackers: http://archives.postgresql.org/pgsql-hackers/2012-11/msg00409.php > Just to clarify, that file belongs to the index in question, right? It's > not the heap pointed to by the index? Yes. >> Immediately thereafter, no occurrences of the error resurfaced and any >> attempts to reproduce the issue with an identical query were met with >> failure. > > Or a lack of failure, I presume. Yup. >> This is reading from a 9.0.8 Postgres. > > Any indication whether it's present on other versions or does it appear > to be isolated to 9.0.X? This is the only known occurrence to me, ever, but given it's incredibly ephemeral nature probably glossed over most of the time, I can't say "it's the only time it's ever happened". -- fdr
On 2012-11-09 15:57:06 -0800, Daniel Farina wrote: > This is the only known occurrence to me, ever, but given it's > incredibly ephemeral nature probably glossed over most of the time, I > can't say "it's the only time it's ever happened". I wish we had some way to easily discern such "something is wrong" error from other ERRORs that are expected during normal operation.. Greetings, Andres Freund
On Fri, Nov 9, 2012 at 4:16 PM, Andres Freund <andres@anarazel.de> wrote: > On 2012-11-09 15:57:06 -0800, Daniel Farina wrote: >> This is the only known occurrence to me, ever, but given it's >> incredibly ephemeral nature probably glossed over most of the time, I >> can't say "it's the only time it's ever happened". > > I wish we had some way to easily discern such "something is wrong" error > from other ERRORs that are expected during normal operation.. Me too. Database clients finding these unambiguously platform-level problems and being relied upon to report them to receive treatment is a long-standing embarrassment to me. However, I've been way too swamped to even start thinking of how one would disentangle error reporting suitable for physical issues from logical issues. Clearly it is *possible* (most error sites are fairly clear being in one category or the other), but the carefully considered upheaval to the logging system is beyond my mind's reach for now. It's a weakness we've kicked around the office here a few times with much regret that nobody has the necessary remaining attention to attack it. -- fdr
On 10 November 2012 00:29, Daniel Farina <daniel@heroku.com> wrote: > Me too. Database clients finding these unambiguously platform-level > problems and being relied upon to report them to receive treatment is > a long-standing embarrassment to me. However, I've been way too > swamped to even start thinking of how one would disentangle error > reporting suitable for physical issues from logical issues. I complained about this a few months ago (and a few months before that), and the upshot was that we kicked around a few ideas and were able to outline a useful API [1]. The idea here was to derive what I called magnitude from SQLSTATE. In other words, we'd represent how routine or non-routine a particular error message was (the "wake me up in the middle of the night" factor). Severity levels don't and cannot capture this, since for example a FATAL error occurs in the event of failed authentication, whereas ERRORs (technically a lesser severity) may occur in far more serious situations that a Postgres DBA can reasonably hope to never see, with problems that indicate data corruption, for example. [1] http://archives.postgresql.org/message-id/CAEYLb_XdtyJE6WtUy4TGdjUQ6eUtjJp0cTfLaDP9qwp8gOtTdg@mail.gmail.com -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services