Thread: pg_dump or hardware?

pg_dump or hardware?

From
andrea suisani
Date:
Hi all,

short summary:

platform: i386 SMP (dual PIII)
os: linux 2.6.8.1
vendor: debian (3.1, stable)
pgsql ver: 7.4.7 (deb)
disk: tech. SCSI vendor. IBM model. DDYS-T36950N rev. S96H
controller: adaptec aic-7892a

description:

we're experiencing a weird problem
trying to get a dump of our db for backup purposes,
the executed command is:

/usr/bin/pg_dump -U postgres -h 6pali elenco | /usr/bin/bzip2 > elenco_test.bz2

the output:

pg_dump: ERROR:  could not open relation with OID 201327173
pg_dump: SQL command to dump the contents of table "nominativi" failed: PQendcopy() failed.
pg_dump: Error message from server: ERROR:  could not open relation with OID 201327173
pg_dump: The command was: COPY public.nominativi (nome_cogno, indirizzo, cap, citta, prov,
prefisso, telefono1, telefono2, note, idpersona, estrazione, num_estra, occupato,
cod_prov, cod_com, cod_reg, capoluo, rand) TO stdout;


so it seems that we've got some problems with the "nominativi" table
(a 20 million-row table), in fact the following command also fails:

pg_dump -t nominativi -U postgres -h 6pali elenco | /usr/bin/bzip2 > nominativi.bz2

with the same err msg as before. Before the erros occurs we're are able to
get a partial backup, see:

#> ls -l nominativi.bz2
-rw-r--r--    1 sickpig  users        2.5M apr 19 12:35 nominativi.bz2

#> wc -l nominativi
  145904 nominativi

We're trying to understand whether this is due to data corruption or
hardware failure. We run long self-tests on our SCSI disk through
smartmontools on a regular basis. see attached file for "smartctl -a /dev/sda"
output. All suggestions are welcome.


Regards,


Andrea






smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Device: IBM      DDYS-T36950N     Version: S96H
Serial number:         5FFL3272
Device type: disk
Transport protocol: Fibre channel (FCP-2)
Local Time is: Wed Apr 19 13:14:01 2006 CEST
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
SMART Health Status: OK

Current Drive Temperature:     41 C
Drive Trip Temperature:        85 C
Manufactured in week 06 of year 2001
Current start stop count:      147 times
Recommended maximum start stop count:  10000 times

Error counter log:
          Errors Corrected    Total      Total   Correction     Gigabytes    Total
              delay:       [rereads/    errors   algorithm      processed    uncorrected
            minor | major  rewrites]  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         5          5       6628.657           0
write:         0        0         0         0          0       4231.306           0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed                   - 22800                   - [-   -    -]
# 2  Background long   Completed                   - 22631                   - [-   -    -]
# 3  Background long   Completed                   - 22463                   - [-   -    -]
# 4  Background long   Completed                   - 22294                   - [-   -    -]
# 5  Background long   Completed                   - 22126                   - [-   -    -]
# 6  Background long   Completed                   - 21958                   - [-   -    -]
# 7  Background long   Completed                   - 21789                   - [-   -    -]
# 8  Background long   Completed                   - 21621                   - [-   -    -]
# 9  Background long   Completed                   - 21452                   - [-   -    -]
#10  Background long   Completed                   - 21284                   - [-   -    -]
#11  Background long   Completed                   - 21115                   - [-   -    -]
#12  Background long   Completed                   - 20947                   - [-   -    -]
#13  Background long   Completed                   - 20801                   - [-   -    -]
#14  Background long   Completed                   - 20633                   - [-   -    -]
#15  Background long   Completed                   - 20464                   - [-   -    -]
#16  Background long   Completed                   - 20296                   - [-   -    -]
#17  Background long   Completed                   - 20127                   - [-   -    -]
#18  Background long   Completed                   - 19959                   - [-   -    -]
#19  Background long   Completed                   - 19790                   - [-   -    -]
#20  Background long   Completed                   - 19622                   - [-   -    -]

Long (extended) Self Test duration: 1340 seconds [22.3 minutes]

Re: pg_dump or hardware?

From
Tom Lane
Date:
andrea suisani <andrea.suisani@opinioni.net> writes:
> pg_dump: ERROR:  could not open relation with OID 201327173
> pg_dump: SQL command to dump the contents of table "nominativi" failed: PQendcopy() failed.
> pg_dump: Error message from server: ERROR:  could not open relation with OID 201327173

Hmm ... what do you get from

select oid from pg_class where relname = 'nominativi';
select relname from pg_class where oid = 201327173;

If the first returns 201327173 while the second doesn't return anything,
then I'd wonder about corruption of pg_class's OID index.  REINDEXing it
might help.  I don't remember whether 7.4 makes you use a standalone
backend to reindex system catalogs --- see its REINDEX man page for
details.

            regards, tom lane

Re: pg_dump or hardware?

From
andrea suisani
Date:
Thanks for the quick reply

Tom Lane wrote:
> andrea suisani <andrea.suisani@opinioni.net> writes:

[cut]

> Hmm ... what do you get from
>
> select oid from pg_class where relname = 'nominativi';
   oid
--------
  561644
(1 row)

afaics it seems weird.... does this mean that another postgresql
object screw up? this is the "\d nominativi" output:

          Table "public.nominativi"
    Column   |       Type       | Modifiers
------------+------------------+-----------
  nome_cogno | character(70)    |
  indirizzo  | character(40)    |
  cap        | character(5)     |
  citta      | character(35)    |
  prov       | character(2)     |
  prefisso   | character(4)     |
  telefono1  | character(13)    |
  telefono2  | character(13)    |
  note       | character(100)   |
  idpersona  | integer          |
  estrazione | date             |
  num_estra  | integer          |
  occupato   | boolean          |
  cod_prov   | integer          |
  cod_com    | integer          |
  cod_reg    | integer          |
  capoluo    | integer          |
  rand       | double precision |
Indexes:
     "citta1_idx" btree (prov, citta) WHERE (((prov = 'BL'::bpchar) OR (prov =
'PD'::bpchar) OR (prov = 'RO'::bpchar) OR (prov = 'VE'::bpchar) OR (prov = 'TV'::bpchar)
OR (prov = 'VR'::bpchar)) AND (cod_com IS NULL))
     "cod_com_ndx1" btree (cod_com)
     "codprov_capo_rand1" btree (cod_prov, capoluo, rand)
     "pre_tel_index1" btree (prefisso, telefono1)


and none of nominativi's indexes has an oid like 201327173

select oid
from pg_class
where relname in ('citta1_idx','cod_com_ndx1','codprov_capo_rand1','pre_tel_index1');
    oid
----------
  46788374
  40916657
  40916656
  40916658
(4 rows)


> select relname from pg_class where oid = 201327173;
  relname
---------
(0 rows)

while if I "ask" for oid  561644 this is what I get:

select relname from pg_class where oid = 561644;
   relname
------------
  nominativi
(1 row)


> If the first returns 201327173 while the second doesn't return anything,
> then I'd wonder about corruption of pg_class's OID index.  REINDEXing it
> might help.
> I don't remember whether 7.4 makes you use a standalone
> backend to reindex system catalogs --- see its REINDEX man page for
> details.

from what I can see from REINDEX man page, I can reindex shared system catalogs
in stand-alone mode

[cut]

Regards,


Andrea