Tom Lane <tgl@sss.pgh.pa.us> writes:
> Doug McNaught <doug@wireboard.com> writes:
> > One funny thing is that the nightly VACUUM doesn't always fail--the
> > system will run smoothly for one to three days on average before a
> > crash.
>
> That does seem to contradict the corrupt-data theory. Do you run a
> VACUUM ANALYZE or just a plain VACUUM? If there were a persisting
> corrupted tuple, I'd expect VACUUM ANALYZE to crash always, VACUUM
> never (VACUUM doesn't inquire into the actual contents of tuples).
I'm running VACUUM, then VACUUM ANALYZE (the docs seem to suggest that
you need both). Basically my script is:
$ vacuumdb -a
$ vacuumdb -z -a
The example I sent was a crash during VACUUM.
> > That's a thought, and I will try it. I'm currently (as of yesterday's
> > crash) running with -d 2 and output sent to a logfile. Is this
> > debuglevel high enough to tell me which table contains the bad tuple,
> > if that's indeed the problem?
>
> That would tell you what query is running. It's not enough to tell you
> where VACUUM is unless you do VACUUM VERBOSE.
Which will no doubt generate reams and reams of data...
> > If I can't nail it down that way, how hard would it be to write a C
> > program to scan all the tuples in a database looking for bogus size
> > fields?
>
> Fairly hard. I'd suggest instead that you just do
> psql -c "copy FOO to stdout" dbname >/dev/null
> and try that on each table in turn to see if you get any crashes...
OK, I'll keep that in reserve.
Another thing that springs to mind--once the crash happens, the
database doesn't respond (or gives fatal errors) to new connections
and to queries on existing connections. Killing the postmaster does
nothing--I have to send SIGTERM to all backends and the postmaster in
order to get it to exit. I don't know if this helps...
-Doug