Thread: ERROR: could not read block

ERROR: could not read block

From

"Kevin Grittner"

Date:

14 November 2005, 19:21:00

A programmer ran a query to fix some data against two "identical"
databases -- one on Linux and one on Windows.  They are both 8.1.0,
running on dual hyperthreaded Xeons, with data on RAID5.  The Linux
update went fine.  The Windows attempt give this:

dtr=> UPDATE
dtr->     "DbTranRepository"
dtr-> SET "userId" = UPPER("userId")
dtr-> WHERE (
dtr(>     ("userId" <> UPPER("userId")) AND
dtr(>     ("timestampValue" BETWEEN '2005-10-28' AND '2005-11-15'));
ERROR:  could not read block 649847 of relation 1663/16385/16483:
Invalid argument

            Table "public.DbTranRepository"
      Column      |          Type          | Modifiers
------------------+------------------------+-----------
 countyNo         | "CountyNoT"            | not null
 tranImageSeqNo   | "TranImageSeqNoT"      | not null
 timestampValue   | "TimestampT"           | not null
 transactionImage | "ImageT"               |
 status           | character(1)           | not null
 queryName        | "QueryNameT"           |
 runDuration      | numeric(15,0)          |
 userId           | "UserIdT"              |
 functionalArea   | character varying(50)  |
 sourceRef        | character varying(255) |
 url              | "URLT"                 |
 tranImageSize    | numeric(15,0)          |
Indexes:
    "DbTranRepositoryPK" PRIMARY KEY, btree ("countyNo",
"tranImageSeqNo") CLUSTER
    "DbTranRepository_timestamp" btree ("countyNo", "timestampValue")
    "DbTranRepository_userId" btree ("countyNo", "userId",
"timestampValue")


fsync is on.  We have not had any crashes that I know of (one key player
here is currently unavailable, so I may get a surprise when I can talk
to him).  The table being updated has about 23.3 million rows, each of
which has a bytea column which should normally be toasted.  The database
was in use during this fix query.  This table is normally insert-only.
The fix query was running for about an hour, during which time about
45,000 rows were inserted.

This looks to me like a database corruption.  We can recover from the
Linux machine, but we're interested in the cause.  (Management is
divided on using Linux versus Windows as our db server platform.)  Any
suggestions on causes or what to check?

-Kevin

Re: ERROR: could not read block

From

Scott Marlowe

Date:

14 November 2005, 19:43:48

On Mon, 2005-11-14 at 17:20, Kevin Grittner wrote:
> A programmer ran a query to fix some data against two "identical"
> databases -- one on Linux and one on Windows.  They are both 8.1.0,
> running on dual hyperthreaded Xeons, with data on RAID5.  The Linux
> update went fine.  The Windows attempt give this:
>
> dtr=> UPDATE
> dtr->     "DbTranRepository"
> dtr-> SET "userId" = UPPER("userId")
> dtr-> WHERE (
> dtr(>     ("userId" <> UPPER("userId")) AND
> dtr(>     ("timestampValue" BETWEEN '2005-10-28' AND '2005-11-15'));
> ERROR:  could not read block 649847 of relation 1663/16385/16483:
> Invalid argument

When a block is unreadable, this means that the OS is experiencing a
read error from the hard drive.

This is not necessarily related to the OS you are running so much as the
hardware you are running on top of.

If you were running on top of a RAID 1+0 or RAID 5 array, such an error
would likely never have happened, since it would have been detected by
the controller, and either the bad block would be mapped out or the
drive would be kicked out of the array and you'd get a system alert
telling you you had a bad drive in your array.

All Databases expect to run on "perfect" hardware.  If the hardware gets
a read error, the database is NOT going to know what to do to fix it.

So, right now, it's not Linux 1, Windows 0, it's The box you're running
Linux on 1, the box you're running Windows on 0.

You've got a drive with a bad block.

Are you running on quality hardware (ECC memory, Server class SCSI
drives, battery backed cache hardware RAID array, etc...) or just
whatever was laying about unused.  If you're not running your database
on quality hardware, you can't expect reliable results from your
testing, because what you're going to see are the hardware issues
showing up, not the database issues.

That said.  PostgreSQL has been running on unix or some flavor of it
since its inception.  It's been running on Windows in native mode for
about a year now.  I'd say Liunx is going to be a better choice for
production for a while.

Re: ERROR: could not read block

From

Joshua Marsh

Date:

14 November 2005, 20:02:27

On 11/14/05, Scott Marlowe <smarlowe@g2switchworks.com> wrote:

If you were running on top of a RAID 1+0 or RAID 5 array, such an error
would likely never have happened, since it would have been detected by
the controller, and either the bad block would be mapped out or the
drive would be kicked out of the array and you'd get a system alert
telling you you had a bad drive in your array.

Except that he jus said he was running on a RAID5. True this problem would be unlikely on a hardware controlled RAID5 configuration, but not impossible.

Does the log file tell you anymore information?

Have you tried to rerun the query? If so, did you get similar results?

-Josh

Re: ERROR: could not read block

From

"Kevin Grittner"

Date:

14 November 2005, 20:19:35

Both machines are IBM xSeries 346 model 884042U with 6 drives
in a RAID 5 array through an IBM battery backed controller.  We
had a couple of these lying around after replacing them with better,
but they have been pretty stable workhorses for us.

I'm checking on whether the RAM is ECC -- the techs available at
the moment aren't sure.  The current machines are "transitional",
and it may not be too late to set the permanent servers up with ECC
memory.  Is it something I should fight for?

For specs on the base machines, before we dressed them up:

http://www-132.ibm.com/webapp/wcs/stores/servlet/ProductDisplay?productId=8741193&langId=-1

-Kevin

>>> Scott Marlowe <smarlowe@g2switchworks.com>  >>>
On Mon, 2005-11-14 at 17:20, Kevin Grittner wrote:

> running on dual hyperthreaded Xeons, with data on RAID5.

> ERROR:  could not read block 649847 of relation 1663/16385/16483:
> Invalid argument

If you were running on top of a RAID 1+0 or RAID 5 array, such an error
would likely never have happened, since it would have been detected by
the controller, and either the bad block would be mapped out or the
drive would be kicked out of the array and you'd get a system alert
telling you you had a bad drive in your array.

Are you running on quality hardware (ECC memory, Server class SCSI
drives, battery backed cache hardware RAID array, etc...) or just
whatever was laying about unused.

Re: ERROR: could not read block

From

Tom Lane

Date:

14 November 2005, 20:43:58

Scott Marlowe <smarlowe@g2switchworks.com> writes:
> On Mon, 2005-11-14 at 17:20, Kevin Grittner wrote:
>> ERROR:  could not read block 649847 of relation 1663/16385/16483:
>> Invalid argument

> When a block is unreadable, this means that the OS is experiencing a
> read error from the hard drive.

I'd believe that explanation if the errno were EIO (I/O error), but
EINVAL isn't really a very sane error number to return for a hardware
failure.  It'd be useful to find out what Windows error code is actually
being returned by the operating system.  I seem to recall that we've
noted some poorly chosen Windows->errno translations before, so maybe
this is another one ... but all we can really say at the moment is
"that shouldn't be happening".

            regards, tom lane

Re: ERROR: could not read block

From

Andrew Sullivan

Date:

15 November 2005, 09:25:05

On Mon, Nov 14, 2005 at 06:19:16PM -0600, Kevin Grittner wrote:
> the moment aren't sure.  The current machines are "transitional",
> and it may not be too late to set the permanent servers up with ECC
> memory.  Is it something I should fight for?

Yes.  Always.

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
It is above all style through which power defers to reason.
        --J. Robert Oppenheimer