dvitek@grammatech.com wrote:
> We had a postgres panic a few weeks ago. Here is a relevant fragment of the
> postgres log:
>
> [2014-01-27 04:57:37 EST 3756] WARNING: pgstat wait timeout
> ...
> ...
> ...
> [2014-01-27 04:55:36 EST 5804] ERROR: could not access status of
> transaction 0
> [2014-01-27 04:55:42 EST 5804] DETAIL: Could not fsync file "pg_clog/0000":
> Bad file descriptor.
I noticed that SimpleLruFlush calls SlruInternalWritePage() to write all
pages, and stores the file descriptors in fdata, with the intention of
fsyncing the files later; SlruInternalWritePage in turn calls
SlruPhysicalWritePage. If the physical write fails,
SlruInternalWritePage will dutifully close all the files, *but fdata is
not updated to remove the file descriptors*. This might lead to the
"bad file descriptor" error (but see below). Really, what we should be
reporting is the failure to do the writes, I think. There is something
broken about this system that makes the writes fail (something which can
probably be learnt about in the kernel log, if there is such a thing on
Windows), but this part seems our bug.
The "status of transaction 0" part of the error message should surprise
no one, since InvalidTransactionId is what SimpleLruFlush uses in its
failure report.
This does nothing to explain or help with the PANIC, however; nor why
things seem to have continued running after a PANIC for two minutes.
> [2014-01-27 09:21:04 EST 3080] PANIC: could not fsync file "pg_xlog/xlogtemp.3080": Bad file descriptor
> [2014-01-27 09:23:01 EST 5404] LOG: WAL writer process (PID 3080) exited with exit code 3
> [2014-01-27 09:23:07 EST 5404] LOG: terminating any other active server processes
There is no obvious path in which an fd is clobbered in xlog.c that I
can see. If there is an explanation for this failure at the filesystem
level, perhaps that can explain the above problem as well.
--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services