Home > mailing lists

Re: BUG #9190: Could not fsync file "pg_clog/0000": Bad file descriptor. - Mailing list pgsql-bugs

From	Alvaro Herrera
Subject	Re: BUG #9190: Could not fsync file "pg_clog/0000": Bad file descriptor.
Date	February 12, 2014 17:16:22
Msg-id	20140212141607.GJ6342@eldon.alvh.no-ip.org Whole thread Raw
In response to	BUG #9190: Could not fsync file "pg_clog/0000": Bad file descriptor. (dvitek@grammatech.com)
List	pgsql-bugs

Tree view

dvitek@grammatech.com wrote:

> We had a postgres panic a few weeks ago.  Here is a relevant fragment of the
> postgres log:
>
> [2014-01-27 04:57:37 EST 3756] WARNING:  pgstat wait timeout
> ...
> ...
> ...
> [2014-01-27 04:55:36 EST 5804] ERROR:  could not access status of
> transaction 0
> [2014-01-27 04:55:42 EST 5804] DETAIL:  Could not fsync file "pg_clog/0000":
> Bad file descriptor.

I noticed that SimpleLruFlush calls SlruInternalWritePage() to write all
pages, and stores the file descriptors in fdata, with the intention of
fsyncing the files later; SlruInternalWritePage in turn calls
SlruPhysicalWritePage.  If the physical write fails,
SlruInternalWritePage will dutifully close all the files, *but fdata is
not updated to remove the file descriptors*.  This might lead to the
"bad file descriptor" error (but see below).  Really, what we should be
reporting is the failure to do the writes, I think.  There is something
broken about this system that makes the writes fail (something which can
probably be learnt about in the kernel log, if there is such a thing on
Windows), but this part seems our bug.

The "status of transaction 0" part of the error message should surprise
no one, since InvalidTransactionId is what SimpleLruFlush uses in its
failure report.

This does nothing to explain or help with the PANIC, however; nor why
things seem to have continued running after a PANIC for two minutes.

> [2014-01-27 09:21:04 EST 3080] PANIC:  could not fsync file "pg_xlog/xlogtemp.3080": Bad file descriptor
> [2014-01-27 09:23:01 EST 5404] LOG:  WAL writer process (PID 3080) exited with exit code 3
> [2014-01-27 09:23:07 EST 5404] LOG:  terminating any other active server processes

There is no obvious path in which an fd is clobbered in xlog.c that I
can see.  If there is an explanation for this failure at the filesystem
level, perhaps that can explain the above problem as well.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

pgsql-bugs by date:

From: Christoph Berg
Date: 12 February 2014, 16:09:30
Subject: Re: BUG #9198: psql -c 'SET; ...' not working

From: Bruce Momjian
Date: 12 February 2014, 17:31:55
Subject: Re: BUG #8354: stripped positions can generate nonzero rank in ts_rank_cd

Re: BUG #9190: Could not fsync file "pg_clog/0000": Bad file descriptor. - Mailing list pgsql-bugs

Previous

Next