Thread: BUG #9190: Could not fsync file "pg_clog/0000": Bad file descriptor.

BUG #9190: Could not fsync file "pg_clog/0000": Bad file descriptor.

From
dvitek@grammatech.com
Date:
The following bug has been logged on the website:

Bug reference:      9190
Logged by:          David Vitek
Email address:      dvitek@grammatech.com
PostgreSQL version: 9.2.4
Operating system:   Windows 8
Description:

We had a postgres panic a few weeks ago.  Here is a relevant fragment of the
postgres log:

[2014-01-27 04:57:37 EST 3756] WARNING:  pgstat wait timeout
...
...
...
[2014-01-27 04:55:36 EST 5804] ERROR:  could not access status of
transaction 0
[2014-01-27 04:55:42 EST 5804] DETAIL:  Could not fsync file "pg_clog/0000":
Bad file descriptor.
...
repeats tens of times, interspersed with pgstat timetouts.
...
[2014-01-27 05:40:40 EST 5804] ERROR:  could not fsync file
"base/16384/17033": Bad file descriptor
[2014-01-27 05:42:20 EST 5804] ERROR:  could not access status of
transaction 0
[2014-01-27 05:42:20 EST 5804] DETAIL:  Could not fsync file "pg_clog/0000":
Bad file descriptor.
[2014-01-27 05:43:30 EST 5804] ERROR:  could not access status of
transaction 0
[2014-01-27 05:43:30 EST 5804] DETAIL:  Could not fsync file "pg_clog/0000":
Bad file descriptor.
[2014-01-27 05:44:53 EST 5804] ERROR:  could not fsync file
"base/16384/11768": Bad file descriptor
...
both flavors of errors continue repeating for hours.
...
[2014-01-27 09:00:44 EST 7536] WARNING:  pgstat wait timeout
[2014-01-27 09:00:54 EST 4604] WARNING:  pgstat wait timeout
[2014-01-27 09:01:14 EST 7536] WARNING:  pgstat wait timeout
[2014-01-27 09:01:24 EST 5992] WARNING:  pgstat wait timeout
[2014-01-27 09:02:12 EST 3420] LOG:  duration: 16231.000 ms  statement:
COMMIT
[2014-01-27 09:02:42 EST 5804] ERROR:  could not access status of
transaction 0
[2014-01-27 09:02:42 EST 5804] DETAIL:  Could not fsync file "pg_clog/0000":
Bad file descriptor.
[2014-01-27 09:02:44 EST 2324] LOG:  duration: 44583.000 ms  statement:
COMMIT
[2014-01-27 09:03:01 EST 2324] LOG:  duration: 11760.000 ms  statement:
COMMIT
[2014-01-27 09:03:20 EST 2324] LOG:  duration: 12952.000 ms  statement:
COMMIT
[2014-01-27 09:03:44 EST 7536] WARNING:  pgstat wait timeout
[2014-01-27 09:03:54 EST 8188] WARNING:  pgstat wait timeout
[2014-01-27 09:04:14 EST 7536] WARNING:  pgstat wait timeout
[2014-01-27 09:04:24 EST 3988] WARNING:  pgstat wait timeout
[2014-01-27 09:04:27 EST 5804] ERROR:  could not access status of
transaction 0
[2014-01-27 09:04:27 EST 5804] DETAIL:  Could not fsync file "pg_clog/0000":
Bad file descriptor.
[2014-01-27 09:05:38 EST 5804] ERROR:  could not access status of
transaction 0
[2014-01-27 09:05:38 EST 5804] DETAIL:  Could not fsync file "pg_clog/0000":
Bad file descriptor.
[2014-01-27 09:05:40 EST 7536] WARNING:  pgstat wait timeout
[2014-01-27 09:05:50 EST 7536] WARNING:  pgstat wait timeout
[2014-01-27 09:05:51 EST 4280] WARNING:  pgstat wait timeout
[2014-01-27 09:06:01 EST 7536] WARNING:  pgstat wait timeout
[2014-01-27 09:06:41 EST 7536] WARNING:  pgstat wait timeout
[2014-01-27 09:06:51 EST 4976] WARNING:  pgstat wait timeout
[2014-01-27 09:06:51 EST 5804] ERROR:  could not access status of
transaction 0
[2014-01-27 09:06:51 EST 5804] DETAIL:  Could not fsync file "pg_clog/0000":
Bad file descriptor.
[2014-01-27 09:07:11 EST 7536] WARNING:  pgstat wait timeout
[2014-01-27 09:07:21 EST 4368] WARNING:  pgstat wait timeout
[2014-01-27 09:07:41 EST 7536] WARNING:  pgstat wait timeout
[2014-01-27 09:07:51 EST 1416] WARNING:  pgstat wait timeout
[2014-01-27 09:08:11 EST 7536] WARNING:  pgstat wait timeout
[2014-01-27 09:08:11 EST 5804] ERROR:  could not access status of
transaction 0
[2014-01-27 09:08:11 EST 5804] DETAIL:  Could not fsync file "pg_clog/0000":
Bad file descriptor.
[2014-01-27 09:08:21 EST 4580] WARNING:  pgstat wait timeout
[2014-01-27 09:08:41 EST 7536] WARNING:  pgstat wait timeout
[2014-01-27 09:08:51 EST 8124] WARNING:  pgstat wait timeout
[2014-01-27 09:09:11 EST 7536] WARNING:  pgstat wait timeout
[2014-01-27 09:09:25 EST 5804] ERROR:  could not access status of
transaction 0
[2014-01-27 09:09:25 EST 5804] DETAIL:  Could not fsync file "pg_clog/0000":
Bad file descriptor.
[2014-01-27 09:10:41 EST 7536] WARNING:  pgstat wait timeout
[2014-01-27 09:10:41 EST 5804] ERROR:  could not access status of
transaction 0
[2014-01-27 09:10:41 EST 5804] DETAIL:  Could not fsync file "pg_clog/0000":
Bad file descriptor.
[2014-01-27 09:11:41 EST 7536] WARNING:  pgstat wait timeout
[2014-01-27 09:11:51 EST 5140] WARNING:  pgstat wait timeout
[2014-01-27 09:12:11 EST 7536] WARNING:  pgstat wait timeout
[2014-01-27 09:12:21 EST 4608] WARNING:  pgstat wait timeout
[2014-01-27 09:12:41 EST 7536] WARNING:  pgstat wait timeout
[2014-01-27 09:12:51 EST 7080] WARNING:  pgstat wait timeout
[2014-01-27 09:13:11 EST 7536] WARNING:  pgstat wait timeout
[2014-01-27 09:13:21 EST 5836] WARNING:  pgstat wait timeout
[2014-01-27 09:13:41 EST 7536] WARNING:  pgstat wait timeout
[2014-01-27 09:13:51 EST 1788] WARNING:  pgstat wait timeout
[2014-01-27 09:14:11 EST 7536] WARNING:  pgstat wait timeout
[2014-01-27 09:14:21 EST 672] WARNING:  pgstat wait timeout
[2014-01-27 09:14:41 EST 7536] WARNING:  pgstat wait timeout
[2014-01-27 09:14:51 EST 7368] WARNING:  pgstat wait timeout
[2014-01-27 09:15:11 EST 7536] WARNING:  pgstat wait timeout
[2014-01-27 09:15:21 EST 6940] WARNING:  pgstat wait timeout
[2014-01-27 09:16:11 EST 7536] WARNING:  pgstat wait timeout
[2014-01-27 09:16:21 EST 3944] WARNING:  pgstat wait timeout
[2014-01-27 09:16:41 EST 7536] WARNING:  pgstat wait timeout
[2014-01-27 09:16:51 EST 4620] WARNING:  pgstat wait timeout
[2014-01-27 09:17:11 EST 7536] WARNING:  pgstat wait timeout
[2014-01-27 09:17:21 EST 6728] WARNING:  pgstat wait timeout
[2014-01-27 09:17:41 EST 7536] WARNING:  pgstat wait timeout
[2014-01-27 09:17:51 EST 5476] WARNING:  pgstat wait timeout
[2014-01-27 09:18:11 EST 7536] WARNING:  pgstat wait timeout
[2014-01-27 09:18:21 EST 2652] WARNING:  pgstat wait timeout
[2014-01-27 09:18:41 EST 7536] WARNING:  pgstat wait timeout
[2014-01-27 09:18:51 EST 7944] WARNING:  pgstat wait timeout
[2014-01-27 09:19:11 EST 7536] WARNING:  pgstat wait timeout
[2014-01-27 09:21:04 EST 3080] PANIC:  could not fsync file
"pg_xlog/xlogtemp.3080": Bad file descriptor
[2014-01-27 09:23:01 EST 5404] LOG:  WAL writer process (PID 3080) exited
with exit code 3
[2014-01-27 09:23:07 EST 5404] LOG:  terminating any other active server
processes

We ran into this issue one time on one computer.  We are unable to reproduce
the issue.  It has been exposed to the same workload many times before and
since.  The machine was under fairly heavy load when this happened.  The
database load was likely write-heavy.  Many recent commit statements took >
10 seconds to push through, and there's a good chance auto-vacuum was
competing.

It is a 64-bit Windows 8.0 machine with fairly normal consumer grade
hardware.  The disk is a local NTFS disk.  There is no antivirus software on
the system.

Re: BUG #9190: Could not fsync file "pg_clog/0000": Bad file descriptor.

From
Alvaro Herrera
Date:
dvitek@grammatech.com wrote:

> We had a postgres panic a few weeks ago.  Here is a relevant fragment of the
> postgres log:
>
> [2014-01-27 04:57:37 EST 3756] WARNING:  pgstat wait timeout
> ...
> ...
> ...
> [2014-01-27 04:55:36 EST 5804] ERROR:  could not access status of
> transaction 0
> [2014-01-27 04:55:42 EST 5804] DETAIL:  Could not fsync file "pg_clog/0000":
> Bad file descriptor.

I noticed that SimpleLruFlush calls SlruInternalWritePage() to write all
pages, and stores the file descriptors in fdata, with the intention of
fsyncing the files later; SlruInternalWritePage in turn calls
SlruPhysicalWritePage.  If the physical write fails,
SlruInternalWritePage will dutifully close all the files, *but fdata is
not updated to remove the file descriptors*.  This might lead to the
"bad file descriptor" error (but see below).  Really, what we should be
reporting is the failure to do the writes, I think.  There is something
broken about this system that makes the writes fail (something which can
probably be learnt about in the kernel log, if there is such a thing on
Windows), but this part seems our bug.

The "status of transaction 0" part of the error message should surprise
no one, since InvalidTransactionId is what SimpleLruFlush uses in its
failure report.

This does nothing to explain or help with the PANIC, however; nor why
things seem to have continued running after a PANIC for two minutes.

> [2014-01-27 09:21:04 EST 3080] PANIC:  could not fsync file "pg_xlog/xlogtemp.3080": Bad file descriptor
> [2014-01-27 09:23:01 EST 5404] LOG:  WAL writer process (PID 3080) exited with exit code 3
> [2014-01-27 09:23:07 EST 5404] LOG:  terminating any other active server processes

There is no obvious path in which an fd is clobbered in xlog.c that I
can see.  If there is an explanation for this failure at the filesystem
level, perhaps that can explain the above problem as well.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services