Perl script failure => Postgres 7.1.2 database corruption - Mailing list pgsql-bugs
From | Frank McKenney |
---|---|
Subject | Perl script failure => Postgres 7.1.2 database corruption |
Date | |
Msg-id | 200111090409.XAA00267@maynard.mail.mindspring.net Whole thread Raw |
Responses |
Re: Perl script failure => Postgres 7.1.2 database corruption
|
List | pgsql-bugs |
Severity -------- Major annoyance (having to rebuild a database)... unless there is a simple workaround I was too ignorant to apply (;-). Problem ------- A few days ago, for no obvious reason, a Perl script used to update the database portion of a 'web site suddenly stopped working. Several hours later it became apparent that "idle" Postgres processes were queuing up on the Postgres server machine. Still later it became obvious that at least one table in the database was damaged to the point that even pg_dump would not dump it properly. The 'web site uses Perl CGI (Apache mod_perl) and DBD::Pg to access Postgres database on a separate server. Symptoms -------- - Backed-up hung ("idle") Postgres processes. - Undeletable table in database (apparently damaged). - pg_dump failure. - psql errors accessing table. - Perl DBI errors reported in Apache/mod_perl error log. Background ---------- The Perl script involved here updates the Postgres database from Win32 .DBF files using DBD:Pg and DBD::Xbase. The script has run successfully for several weeks, and in fact ran successfully using the same set of DBF files last Friday. In order to minimize effect of the update on users who may be using the database, the script performs the entire database update as a single transaction. In Perl::DBI terms, it issues a $obj = DBI->connect($db_datasource, $db_username, $db_password, {AutoCommit => 0}); and when the updates have been added without error (which usually happens) it issues a $obj->commit(); $obj->disconnect(); $obj = undef; This time the update died partway through: + Marking old records for deletion. (At this point the script was in the process of issuing DELETE FROM TABLE xxxxx statements) DBD::Pg::db do failed: pqReadData() -- backend closed the channel unexpectedly. at ./update-foundation-db.pl line 244. Could not purge old records from summary, stopping at ./update-foundation-db.pl line 244. What triggered the first error is still not completely clear, but it may have been the result of the account exceeding its disk space quota. However, even after some 40Mb of files were deleted and the quota was increased by 50Mb, the script continued to fail at the same point. The only mildly bright spot in all this is that none of this appeared to affect 'web site users. SQL queries from site users continued to yield results throughout (well, up to the point where I deleted the database (;-)). Other messages reported by the script included: DBD::Pg::db do failed: ERROR: heap_delete: (am)invalid tid at ./update-foundation-db.pl line 244. DBD::Pg::db do failed: ERROR: simple_heap_delete: tuple concurrently updated at ./update-foundation-db.pl line 244. DBD::Pg::db do failed: ERROR: Relation 1632370768 does not exist at ./update-foundation-db.pl line 244. DBD::Pg::db do failed: ERROR: Relation 419450449 does not exist at ./update-foundation-db.pl line 244. At the same time, "idle" Postgres processes were backing up on the Postgres server. These only cleared when the Postgres server process was restarted. ------- 10626 ? S 0:00 postgres: websiteadmin websiteadmin 206.163.128.1 idle in transaction (and 8 more like it at one point) ------- Postgres access from the 'web site logged errors to the Apache error log for the site's virtual host: ----------- NOTICE: Message from PostgreSQL backend: The Postmaster has informed me that some other backend died abnormally and possibly corrupted shared memory. I have rolled back the current transaction and am going to terminate your database system connection and exit. Please reconnect to the database system and repeat your query. [Tue Nov 6 20:52:21 2001] null: Database handle destroyed without explicit disconnect at /usr/lib/perl5/site_perl/Apache/DBI.pm line 119. ----------- A number of the 'null:' errors were reported. A similar NOTICE: preceeded the following errors as well: ----------- [Tue Nov 6 20:52:53 2001] [error] FATAL 2: XLogFlush: request is not satisfied at /home1/g/websiteadmin/site/modules//StandardUtil.pm line 48. [Tue Nov 6 20:52:54 2001] null: DBD::Pg::db rollback failed: rollback failed at /usr/lib/perl5/site_perl/Apache/DBI.pm line 153. ----------- I attempted to dump the database with pg_dump, and failed miserably. The results convinced me that the database was _seriously_ mangled. ----------- pg_dump -c -Fc -h postgres.sitehost.com -u websiteadmin >db.pgd Username: websiteadmin Password: ERROR: Relation 419450449 does not exist PQendcopy: resetting connection SQL query to dump the contents of Table 'summary' did not execute correctly. After we read all the table contents from the backend, PQendcopy() failed. Explanation from backend: 'ERROR: Relation 419450449 does not exist '. The query was: 'COPY "summary" TO stdout;'. psql -h postgres.kattare.com -U websiteadmin -W Password: Welcome to psql, the PostgreSQL interactive terminal. --snip-- websiteadmin=> vacuum verbose summary; NOTICE: --Relation summary-- NOTICE: Pages 728: Changed 34, reaped 604, Empty 0, New 0; Tup 1032: Vac 5094, Keep/VTL 0/0, Crash 0, UnUsed 0, MinLen 96, MaxLen 2030; Re-using: Free/Avail. Space 4913484/4913484; EndEmpty/Avail. Pages 0/604. CPU 0.03s/0.00u sec. NOTICE: Index summary_pkey: Pages 31; Tuples 1032: Deleted 5094. CPU 0.00s/0.06u sec. NOTICE: Rel summary: Pages: 728 --> 118; Tuple(s) moved: 1032. CPU 0.11s/0.16u sec. NOTICE: Index summary_pkey: Pages 31; Tuples 1032: Deleted 1032. CPU 0.00s/0.01u sec. NOTICE: --Relation pg_toast_20037-- NOTICE: Rel pg_toast_20037: TID 1444/14: OID IS INVALID. TUPGONE 1. NOTICE: Rel pg_toast_20037: TID 1444/19: OID IS INVALID. TUPGONE 1. -- Many, MANY similar lines snipped -- NOTICE: Rel pg_toast_20037: TID 1444/2008: OID IS INVALID. TUPGONE 1. NOTICE: Rel pg_toast_20037: TID 1444/2016: OID IS INVALID. TUPGONE 1. pqReadData() -- backend closed the channel unexpectedly. This probably means the backend terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !> -> \q Workaround ---------- Due to time constraints, and because I had the data available, I deleted the database with 'dropdb -h', re-created it using 'createdb', and re-populated it using the now-functioning Perl script. Another choice might have been to restore the database from a backup, assuming that the database corruption actually occurred in conjunction with the initial script failure and not some time earlier. Environment ----------- DB Server: Postgres 7.1.2, Slackware Linux 8.? SELECT version() reports PostgreSQL 7.1.2 on i686-pc-linux-gnu, compiled by GCC egcs-2.91.66 DB Client: Apache 1.3.20, Perl 5.6.1, mod_perl, Slackware 8.1 Questions --------- 1) Are there circumstances under which a "space exceeded" error on a client machine _can_ damage a database on the Postgres server machine? 2) The Postgres error messages all indicate that the transaction was going to be "rolled back", but to all indications (including the hung processes (;-)) it was not. Is this a bug? Or am I misreading the message? 3) What other things could we have tried to recover this situation? Could I have ignored Postgres' claim that it was rolling back the transaction, re-connected to the database (with psql, say), and issued a ROLLBACK myself? 4) Where would I find information that would help me accurately estimate the client's disk space requirements for SQL database processing on a Postgres server? Frank McKenney Frank McKenney, McKenney Associates Richmond, Virginia / (804) 320-4887 E-mail: frank_mckenney@mindspring.com
pgsql-bugs by date: