Perl script failure => Postgres 7.1.2 database corruption - Mailing list pgsql-bugs

From Frank McKenney
Subject Perl script failure => Postgres 7.1.2 database corruption
Date
Msg-id 200111090409.XAA00267@maynard.mail.mindspring.net
Whole thread Raw
Responses Re: Perl script failure => Postgres 7.1.2 database corruption
List pgsql-bugs
Severity
--------

Major annoyance (having to rebuild a database)...  unless there is a
simple workaround I was too ignorant to apply (;-).

Problem
-------

A few days ago, for no obvious reason, a Perl script used to update
the database portion of a 'web site suddenly stopped working.
Several hours later it became apparent that "idle" Postgres
processes were queuing up on the Postgres server machine.  Still
later it became obvious that at least one table in the database was
damaged to the point that even pg_dump would not dump it properly.

The 'web site uses Perl CGI (Apache mod_perl) and DBD::Pg to access
Postgres database on a separate server.


Symptoms
--------

 - Backed-up hung ("idle") Postgres processes.
 - Undeletable table in database (apparently damaged).
 - pg_dump failure.
 - psql errors accessing table.
 - Perl DBI errors reported in Apache/mod_perl error log.

Background
----------

The Perl script involved here updates the Postgres database from
Win32 .DBF files using DBD:Pg and DBD::Xbase.  The script has run
successfully for several weeks, and in fact ran successfully using
the same set of DBF files last Friday.

In order to minimize effect of the update on users who may be using
the database, the script performs the entire database update as a
single transaction. In Perl::DBI terms, it issues a

   $obj = DBI->connect($db_datasource,
                       $db_username,
                       $db_password,
                      {AutoCommit => 0});

and when the updates have been added without error (which usually
happens) it issues a

   $obj->commit();
   $obj->disconnect();
   $obj = undef;

This time the update died partway through:

    + Marking old records for deletion.
        (At this point the script was in the process of issuing
          DELETE FROM TABLE xxxxx  statements)
    DBD::Pg::db do failed: pqReadData() -- backend closed the
    channel unexpectedly. at ./update-foundation-db.pl line 244.

    Could not purge old records from summary, stopping at
    ./update-foundation-db.pl line 244.

What triggered the first error is still not completely clear, but it
may have been the result of the account exceeding its disk space
quota.  However, even after some 40Mb of files were deleted and the
quota was increased by 50Mb, the script continued to fail at the
same point.

The only mildly bright spot in all this is that none of this
appeared to affect 'web site users.  SQL queries from site users
continued to yield results throughout (well, up to the point where I
deleted the database (;-)).

Other messages reported by the script included:

    DBD::Pg::db do failed: ERROR:  heap_delete: (am)invalid tid at
        ./update-foundation-db.pl line 244.


    DBD::Pg::db do failed: ERROR:  simple_heap_delete: tuple
        concurrently updated at  ./update-foundation-db.pl line 244.


    DBD::Pg::db do failed: ERROR:  Relation 1632370768 does not
        exist at ./update-foundation-db.pl line 244.


    DBD::Pg::db do failed: ERROR:  Relation 419450449 does not exist
        at ./update-foundation-db.pl line 244.


At the same time, "idle" Postgres processes were backing up on the
Postgres server.  These only cleared when the Postgres server
process was restarted.
-------
  10626 ?        S      0:00 postgres: websiteadmin websiteadmin
      206.163.128.1 idle in transaction
  (and 8 more like it at one point)
-------

Postgres access from the 'web site logged errors to the Apache error
log for the site's virtual host:
-----------
    NOTICE:  Message from PostgreSQL backend:
    The Postmaster has informed me that some other backend    died
        abnormally and possibly corrupted shared memory.
    I have rolled back the current transaction and am    going to
        terminate your database system connection and exit.
    Please reconnect to the database system and repeat your query.
    [Tue Nov  6 20:52:21 2001] null: Database handle destroyed
        without explicit disconnect at
        /usr/lib/perl5/site_perl/Apache/DBI.pm line 119.
-----------

A number of the 'null:'  errors were reported.  A similar NOTICE:
preceeded the following errors as well:

-----------
    [Tue Nov  6 20:52:53 2001] [error] FATAL 2:  XLogFlush: request
        is not satisfied at
        /home1/g/websiteadmin/site/modules//StandardUtil.pm line
        48.

    [Tue Nov  6 20:52:54 2001] null: DBD::Pg::db rollback failed:
        rollback failed at /usr/lib/perl5/site_perl/Apache/DBI.pm
        line 153.
-----------


I attempted to dump the database with pg_dump, and failed miserably.
The results convinced me that the database was _seriously_ mangled.
-----------
  pg_dump -c -Fc -h postgres.sitehost.com -u websiteadmin >db.pgd
    Username: websiteadmin
    Password:
    ERROR:  Relation 419450449 does not exist
    PQendcopy: resetting connection
    SQL query to dump the contents of Table 'summary' did not
    execute correctly.  After we read all the table contents from
    the backend, PQendcopy() failed.  Explanation from backend:
    'ERROR: Relation 419450449 does not exist '.
    The query was: 'COPY "summary" TO stdout;'.

  psql -h postgres.kattare.com -U websiteadmin -W
    Password:
    Welcome to psql, the PostgreSQL interactive terminal.
    --snip--
    websiteadmin=> vacuum verbose summary;
    NOTICE:  --Relation summary--
    NOTICE:  Pages 728: Changed 34, reaped 604, Empty 0, New 0; Tup
    1032: Vac 5094, Keep/VTL 0/0, Crash 0, UnUsed 0, MinLen 96,
    MaxLen 2030; Re-using:  Free/Avail.  Space 4913484/4913484;
    EndEmpty/Avail.  Pages 0/604.  CPU 0.03s/0.00u sec.
    NOTICE:  Index summary_pkey: Pages 31; Tuples 1032: Deleted
    5094. CPU 0.00s/0.06u sec.
    NOTICE:  Rel summary: Pages: 728 --> 118; Tuple(s) moved: 1032.
    CPU 0.11s/0.16u sec.
    NOTICE:  Index summary_pkey: Pages 31; Tuples 1032: Deleted
    1032. CPU 0.00s/0.01u sec.
    NOTICE:  --Relation pg_toast_20037--
    NOTICE:  Rel pg_toast_20037: TID 1444/14: OID IS INVALID.
    TUPGONE 1.
    NOTICE:  Rel pg_toast_20037: TID 1444/19: OID IS INVALID.
    TUPGONE 1.
    -- Many, MANY similar lines snipped --
    NOTICE:  Rel pg_toast_20037: TID 1444/2008: OID IS INVALID.
    TUPGONE 1.
    NOTICE:  Rel pg_toast_20037: TID 1444/2016: OID IS INVALID.
    TUPGONE 1.
    pqReadData() -- backend closed the channel unexpectedly.  This
    probably means the backend terminated abnormally before or while
    processing the request.
    The connection to the server was lost.
    Attempting reset:  Failed.  !> -> \q

Workaround
----------

Due to time constraints, and because I had the data available, I
deleted the database with 'dropdb -h', re-created it using
'createdb', and re-populated it using the now-functioning Perl
script.  Another choice might have been to restore the database from
a backup, assuming that the database corruption actually occurred in
conjunction with the initial script failure and not some time
earlier.

Environment
-----------

 DB Server: Postgres 7.1.2, Slackware Linux 8.?
            SELECT version() reports PostgreSQL 7.1.2 on
            i686-pc-linux-gnu, compiled by GCC egcs-2.91.66

 DB Client: Apache 1.3.20, Perl 5.6.1, mod_perl, Slackware 8.1

Questions
---------

 1) Are there circumstances under which a "space exceeded" error on
    a client machine _can_ damage a database on the Postgres server
    machine?

 2) The Postgres error messages all indicate that the transaction
    was going to be "rolled back", but to all indications (including
    the hung processes (;-)) it was not. Is this a bug? Or am I
    misreading the message?

 3) What other things could we have tried to recover this situation?

    Could I have ignored Postgres' claim that it was rolling back
    the transaction, re-connected to the database (with psql, say),
    and issued a ROLLBACK myself?

 4) Where would I find information that would help me accurately
    estimate the client's disk space requirements for SQL database
    processing on a Postgres server?



Frank McKenney


Frank McKenney, McKenney Associates
Richmond, Virginia / (804) 320-4887
E-mail: frank_mckenney@mindspring.com

pgsql-bugs by date:

Previous
From: Victor Marinescu
Date:
Subject: postgresql 7.1 driver
Next
From: Stephan Szabo
Date:
Subject: Re: Bug #512: outer join bug