Notify system doesn't recover from "No space" error - Mailing list pgsql-hackers

From Kevin Grittner
Subject Notify system doesn't recover from "No space" error
Date
Msg-id 4F339D6D02000025000450F5@gw.wicourts.gov
Whole thread Raw
Responses Re: Notify system doesn't recover from "No space" error
List pgsql-hackers
On a development system which wasn't being monitored very closely
I think we've uncovered a bug in the listen/notify behavior.  The
system was struggling with lack of disk space for about a week
before this surfaced, so it's probably not the most critical sort
of bug we could have, but it probably merits a fix.
We were getting all sorts of logging about the space problem, and
PostgreSQL soldiered bravely on, doing the best it could under such
adverse conditions.  (That's pretty impressive, really.)  A lot of
messages like these:
[2012-02-05 01:24:36.816 CST] 14950 <cc cc 127.0.0.1(35321)> ERROR: 
could not extend file "base/16401/46507734": wrote only 4096 of 8192
bytes at block 0
[2012-02-05 01:24:36.816 CST] 14950 <cc cc 127.0.0.1(35321)> HINT: 
Check free disk space.

[2012-02-05 01:25:40.643 CST] 5323 LOG:  could not write temporary
statistics file "pg_stat_tmp/pgstat.tmp": No space left on device

[2012-02-05 01:26:38.971 CST] 15065 <cc cc 127.0.0.1(53258)> ERROR: 
could not write to hash-join temporary file: No space left on device

I don't know how much of a clue this is, but less than a minute
before our first message about pg_notify problems, the statistics
messages went from complaining about not being able to *write* the
file to also complaining about not being able to *close* the file --
like this:
[2012-02-05 01:26:41.159 CST] 5323 LOG:  could not close temporary
statistics file "pg_stat_tmp/pgstat.tmp": No space left on device
[2012-02-05 01:26:44.663 CST] 5323 LOG:  could not write temporary
statistics file "pg_stat_tmp/pgstat.tmp": No space left on device
[2012-02-05 01:26:45.705 CST] 16252 WARNING:  pgstat wait timeout
We got three errors in the log with exactly this DETAIL (file,
offset and message):
[2012-02-05 01:27:36.878 CST] 14892 <cc cc 127.0.0.1(35320)> ERROR: 
could not access status of transaction 0
[2012-02-05 01:27:36.878 CST] 14892 <cc cc 127.0.0.1(35320)> DETAIL:Could not write to file "pg_notify/03A5" at offset
188416:No space
 
left on device.
[2012-02-05 01:27:36.878 CST] 14892 <cc cc 127.0.0.1(35320)>
STATEMENT:  COMMIT
[2012-02-05 01:27:36.882 CST] 14892 <cc cc 127.0.0.1(35320)> LOG: 
disconnection: session time: 0:07:00.600 user=cc database=cc
host=127.0.0.1 port=35320
After that the message changed to a new offset and a message of
"Success." for the failure:
[2012-02-05 01:30:36.952 CST] 16383 <cc cc 127.0.0.1(38931)> ERROR: 
could not access status of transaction 0
[2012-02-05 01:30:36.952 CST] 16383 <cc cc 127.0.0.1(38931)> DETAIL:Could not read from file "pg_notify/03A5" at offset
253952:
Success.
[2012-02-05 01:30:36.952 CST] 16383 <cc cc 127.0.0.1(38931)>
STATEMENT:  COMMIT
[2012-02-05 01:30:36.956 CST] 16383 <cc cc 127.0.0.1(38931)> LOG: 
disconnection: session time: 0:01:59.537 user=cc database=cc
host=127.0.0.1 port=38931
There were 4591 messages over the course of five days with exactly
that filename and offset and a message of "Success.", all for
transaction 0.  About midway through that time disk space was freed
and all other symptoms cleared.  It really is pretty impressive that
PostgreSQL survived that and was able to resume normal operations in
all other respects once disk space was freed for it.  Only this one
feature stayed in a bad state.
-Kevin


pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Progress on fast path sorting, btree index creation time
Next
From: Robert Haas
Date:
Subject: index-only quals vs. security_barrier views