Folks,
Just had this particular very unpleasant experience for the first time.
I had an overnight series of data transformations running ... usually,
they run from 12:30am to 1:20 am ... and the process hung. Badly.
Requiring a "fast" system shutdown and restoring the database from
backup.
Here's the details:
Platform: Hand-built Dual Athalon MP/Molex RAID 5 (UW SCSI) system.
PostgreSQL 7.2.3
SuSE Linux 7.3
Data imports started normally at 12:00am and apparently completed.
Data transformation process (16-35 UPDATES and INSERTs affecting a
combined 1, 300,000 records) started at about 12:30am after the import
ended. The data transformations are a series of functions called by a
Perl script through cron as the root user.
Sometime during the transformation process, a statement hung. The
procedure continued running for at least 2 hours, at which point
another script, set up to detect such problems, ran a "pg_ctl -m fast
stop". Instead of stopping, the postgresql server hung.
When I got to the machine in the morning, there were 3 processes, one
query, one checkpoint process and the postmaster which were frozen.
SIGHUP and SIGTERM were ignored by these; SIGKILL was able to kill
the postmaster process, but the two other processes went to "D" status
and were untouchable.
I was forced to fast-shutdown the server. While Postgres did restart OK
after restarting the machine, I did not trust the data integrity, and
restored from backup.
Has anyone else encountered this kind of situation? Is there a way to
prevent it, or a less drastic way to resolve it? What are likely
causes?
-Josh Berkus