Re: What could cause CREATE TEMP... "could not read block" error? - Mailing list pgsql-general

From Albe Laurenz
Subject Re: What could cause CREATE TEMP... "could not read block" error?
Date
Msg-id A737B7A37273E048B164557ADEF4A58B50FEA9E2@ntex2010i.host.magwien.gv.at
Whole thread Raw
In response to Re: What could cause CREATE TEMP... "could not read block" error?  (Chris Richards <chris@infinite.io>)
List pgsql-general
Chris Richards wrote:
[had corruption in pg_depend indexes]
> I've observed the problem on other than the "pg_depend" table. A normal table (called "blocks") had
> the problem. A "REINDEX TABLE blocks" did not fix the problem, however a "VACUUM FULL" did "fix" it.

Did you dump/remove cluster/initdb/restore?
That's the only reliable way to get rid of corruption.

>> Are you running the latest minor release for your PostgreSQL version?
> 
> No. 9.3.9 whereas the lastest is 9.3.10

I think no data corruption bugs have been discovered since 9.3.9.

>> Did you make sure that you have a reliable storage system?
> 
> RAID-6 SSDs on ext4

Did you test if it is reliable?
Pulled the plug on the server during a busy checkpoint?
Or read http://brad.livejournal.com/2116715.html

But that shouldn't apply to your case if there was no crash.

>> Were there any crashes recently?
> 
> No[*]. See comments below about LVM snapshots.

> We had thought we figured out what might have caused the problem. We were taking LVM snapshots while
> the database was running and then at sometime later we reverted the snapshot (while postgres was
> stopped). Our theory was that since postgres was running and therefore we captured its backing store
> in an indeterminate state. We changed our snapshot process to shutdown postgres, take the LVM
> snapshot, and then start postgres.

If the snapshot is truly atomic, it should work - that would be just like restarting
PostgreSQL after a server crash.  PostgreSQL is designed to handle such a case well,
provided that the storage does what it promises, i.e. make sure that data are really on disk
when an fsync() is issued.

Is the whole database, including pg_xlog, on one logical volume?

> Unfortunately, the problem cropped up again.
> 
> We had an empty database except for our created tables, took the snapshot, and then populated the
> tables using normal INSERT/UPDATE/DELETEs while the system ran. Then, we reboot and revert the LVM
> snapshot--back to the empty database. We then populated the tables by using a series "COPY <table>
> FROM <file>". The only things special about this is we dropped one foreign key constraint and that
> with our data files is that we may issue back-to-back COPY's to the same table (each COPY pulling in
> different data as we pre-process the data before COPY'ing it). The amount of data is relatively small
> from our COPY's:
> 
> dirtable - 1 row
> dirpath - 1 row
> cloud - 940 rows
> blocks - 176 rows (before it died)
> 
> Indexes were not disabled during the bulk import.

It "died" because of data curruption?
And you had PostgreSQL shut down when you took the snapshot?
Was this a new database cluster (not just a new database) created
with initdb just before the test?

Now that should work even if your storage lies to you.
Unless there is something quite wrong with your storage - RAID checksums
should prevent the effects of a failing disk from affecting you.

Yours,
Laurenz Albe

pgsql-general by date:

Previous
From: Benedikt Grundmann
Date:
Subject: Re: Problems with pg_upgrade after change of unix user running db.
Next
From: NTPT
Date:
Subject: Query failed: ERROR: character with byte sequence 0xc2 0x96 in encoding "UTF8" has no equivalent in encoding "WIN1250"