Thread: pgsql 9.0.1 table corruption

pgsql 9.0.1 table corruption

From

Dan Biagini

Date:

14 April 2011, 12:11:09

I have a 9.0.1 database with two corrupted tables (one table has 20
rows, the other 140).  The tables *seem* fine for read/select
operations, but updating certain rows in the table produce error
messages:

update media set updated_at = now() at time zone 'UTC';
ERROR:  could not read block 2 in file "base/16384/16485": read only 0
of 8192 bytes

update media_status set updated_at = now() at time zone 'UTC';
2011-04-14 00:15:15 UTC ERROR:  could not read block 3 in file
"base/16384/16543": read only 0 of 8192 bytes
2011-04-14 00:15:15 UTC STATEMENT:  update media_status set updated_at
= now() at time zone 'UTC';

Examining the corrupted files in the filesystem (linux), they aren't
zero bytes:

ll base/16384/16485
-rwx------ 1 postgres postgres 16384 2011-04-07 09:43 base/16384/16485

I ran a "vacuum(FULL, VERBOSE) <table>" command and the corruption (or
at least the errors on update) has disappeared.  Is it expected that a
"vacuum(FULL)" command would/could fix table corruption?  Does that
providing any clues as to what may have happened?

Is there any way to determine how/when this corruption may have occurred?

I suspect that it may have occurred during a filesystem level backup
(ie pg_start_backup(), tar -czf..., pg_stop_backup()), as I performed
a backup and moved the database to a different system.  After
restoring the files and starting postgres I began getting these
errors.  I have tried restoring multiple times with the same tar
archive with the same results (on different systems).

Thanks,
Dan

Re: pgsql 9.0.1 table corruption

From

Alan Hodgson

Date:

14 April 2011, 13:08:43

On April 14, 2011 08:10:47 am Dan Biagini wrote:

> I suspect that it may have occurred during a filesystem level backup
> (ie pg_start_backup(), tar -czf..., pg_stop_backup()), as I performed
> a backup and moved the database to a different system.  After
> restoring the files and starting postgres I began getting these
> errors.  I have tried restoring multiple times with the same tar
> archive with the same results (on different systems).

Did you perform a PITR restore using that tar as a base backup? Do any errors
occur?  The tar file alone will not be an intact backup, as I'm sure you're
aware.

http://www.postgresql.org/docs/9.0/static/continuous-archiving.html#BACKUP-
PITR-RECOVERY

Re: pgsql 9.0.1 table corruption

From

Tom Lane

Date:

15 April 2011, 13:51:07

Dan Biagini <dbiagini@gmail.com> writes:
> I have a 9.0.1 database with two corrupted tables (one table has 20
> rows, the other 140).  The tables *seem* fine for read/select
> operations, but updating certain rows in the table produce error
> messages:

> update media set updated_at = now() at time zone 'UTC';
> ERROR:  could not read block 2 in file "base/16384/16485": read only 0
> of 8192 bytes

The UPDATE scan itself wouldn't read off the end of the table.
I speculate that what is happening here is that you have a unique
index, so it's trying to verify that the uniqueness constraint isn't
violated, which means fetching via the index, and there are corrupt
TID pointers in the index (pointing to blocks that don't exist).

> I ran a "vacuum(FULL, VERBOSE) <table>" command and the corruption (or
> at least the errors on update) has disappeared.  Is it expected that a
> "vacuum(FULL)" command would/could fix table corruption?

In 9.0, vacuum full would rebuild the index(es), so that would fix it
according to this theory.

> I suspect that it may have occurred during a filesystem level backup
> (ie pg_start_backup(), tar -czf..., pg_stop_backup()), as I performed
> a backup and moved the database to a different system.  After
> restoring the files and starting postgres I began getting these
> errors.  I have tried restoring multiple times with the same tar
> archive with the same results (on different systems).

If you simply unpacked the tar archive and started a postmaster on that,
you'd be pretty much guaranteed to get a corrupt database.  The tar
archive is *not* a valid snapshot by itself --- you have to replay
whatever WAL was generated during the archiving sequence in order to get
to a consistent database state.  So the whole thing sounds like a pretty
foreseeable consequence of that.  You've probably got some other issues
in that database :-(.  VACUUM FULL would be enough to rebuild indexes,
but it cannot fix inconsistencies in the heap data.

            regards, tom lane

Re: pgsql 9.0.1 table corruption

From

Benjamin Smith

Date:

15 April 2011, 21:56:22

On Friday, April 15, 2011 09:50:57 AM Tom Lane wrote:

> If you simply unpacked the tar archive and started a postmaster on that,

> you'd be pretty much guaranteed to get a corrupt database. The tar

> archive is not a valid snapshot by itself --- you have to replay

> whatever WAL was generated during the archiving sequence in order to get

> to a consistent database state.

I have, more than once, "moved" a PG instance from one machine to another with the following sequence, without apparent issue. is there anything I'm missing and/or need to be concerned with?

1) service postgresql stop;

2) rsync -vaz /var/lib/pgsql root@newserver:/var/lib/pgsql;

3) ssh root@newserver;

4) (edit postgresql.conf, set IP addresses, memory, etc)

5) service postgresql start;

This is a "done at 10 PM on Friday night" kind of process.

-Ben
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

Re: pgsql 9.0.1 table corruption

From

Tom Lane

Date:

15 April 2011, 22:27:59

Benjamin Smith <lists@benjamindsmith.com> writes:
> On Friday, April 15, 2011 09:50:57 AM Tom Lane wrote:
>> If you simply unpacked the tar archive and started a postmaster on that,
>> you'd be pretty much guaranteed to get a corrupt database.  The tar
>> archive is not a valid snapshot by itself --- you have to replay
>> whatever WAL was generated during the archiving sequence in order to get
>> to a consistent database state.

> I have, more than once, "moved" a PG instance from one machine to another with
> the following sequence, without apparent issue. is there anything I'm missing
> and/or need to be concerned with?

> 1) service postgresql stop;
> 2) rsync -vaz /var/lib/pgsql root@newserver:/var/lib/pgsql;
> 3) ssh root@newserver;
> 4) (edit postgresql.conf, set IP addresses, memory, etc)
> 5) service postgresql start;

That way is fine --- what you can't do is take a tarball while the
server is running, unless you do the full nine yards with WAL replay.

            regards, tom lane