Home > mailing lists

Notify system doesn't recover from "No space" error - Mailing list pgsql-hackers

From	Kevin Grittner
Subject	Notify system doesn't recover from "No space" error
Date	February 9, 2012 15:18:53
Msg-id	4F339D6D02000025000450F5@gw.wicourts.gov Whole thread Raw
Responses	Re: Notify system doesn't recover from "No space" error
List	pgsql-hackers

Tree view

On a development system which wasn't being monitored very closely
I think we've uncovered a bug in the listen/notify behavior.  The
system was struggling with lack of disk space for about a week
before this surfaced, so it's probably not the most critical sort
of bug we could have, but it probably merits a fix.
We were getting all sorts of logging about the space problem, and
PostgreSQL soldiered bravely on, doing the best it could under such
adverse conditions.  (That's pretty impressive, really.)  A lot of
messages like these:
[2012-02-05 01:24:36.816 CST] 14950 <cc cc 127.0.0.1(35321)> ERROR: 
could not extend file "base/16401/46507734": wrote only 4096 of 8192
bytes at block 0
[2012-02-05 01:24:36.816 CST] 14950 <cc cc 127.0.0.1(35321)> HINT: 
Check free disk space.

[2012-02-05 01:25:40.643 CST] 5323 LOG:  could not write temporary
statistics file "pg_stat_tmp/pgstat.tmp": No space left on device

[2012-02-05 01:26:38.971 CST] 15065 <cc cc 127.0.0.1(53258)> ERROR: 
could not write to hash-join temporary file: No space left on device

I don't know how much of a clue this is, but less than a minute
before our first message about pg_notify problems, the statistics
messages went from complaining about not being able to *write* the
file to also complaining about not being able to *close* the file --
like this:
[2012-02-05 01:26:41.159 CST] 5323 LOG:  could not close temporary
statistics file "pg_stat_tmp/pgstat.tmp": No space left on device
[2012-02-05 01:26:44.663 CST] 5323 LOG:  could not write temporary
statistics file "pg_stat_tmp/pgstat.tmp": No space left on device
[2012-02-05 01:26:45.705 CST] 16252 WARNING:  pgstat wait timeout
We got three errors in the log with exactly this DETAIL (file,
offset and message):
[2012-02-05 01:27:36.878 CST] 14892 <cc cc 127.0.0.1(35320)> ERROR: 
could not access status of transaction 0
[2012-02-05 01:27:36.878 CST] 14892 <cc cc 127.0.0.1(35320)> DETAIL:Could not write to file "pg_notify/03A5" at offset
188416:No space
 
left on device.
[2012-02-05 01:27:36.878 CST] 14892 <cc cc 127.0.0.1(35320)>
STATEMENT:  COMMIT
[2012-02-05 01:27:36.882 CST] 14892 <cc cc 127.0.0.1(35320)> LOG: 
disconnection: session time: 0:07:00.600 user=cc database=cc
host=127.0.0.1 port=35320
After that the message changed to a new offset and a message of
"Success." for the failure:
[2012-02-05 01:30:36.952 CST] 16383 <cc cc 127.0.0.1(38931)> ERROR: 
could not access status of transaction 0
[2012-02-05 01:30:36.952 CST] 16383 <cc cc 127.0.0.1(38931)> DETAIL:Could not read from file "pg_notify/03A5" at offset
253952:
Success.
[2012-02-05 01:30:36.952 CST] 16383 <cc cc 127.0.0.1(38931)>
STATEMENT:  COMMIT
[2012-02-05 01:30:36.956 CST] 16383 <cc cc 127.0.0.1(38931)> LOG: 
disconnection: session time: 0:01:59.537 user=cc database=cc
host=127.0.0.1 port=38931
There were 4591 messages over the course of five days with exactly
that filename and offset and a message of "Success.", all for
transaction 0.  About midway through that time disk space was freed
and all other symptoms cleared.  It really is pretty impressive that
PostgreSQL survived that and was able to resume normal operations in
all other respects once disk space was freed for it.  Only this one
feature stayed in a bad state.
-Kevin

pgsql-hackers by date:

From: Peter Geoghegan
Date: 09 February 2012, 14:36:44
Subject: Re: Progress on fast path sorting, btree index creation time

From: Robert Haas
Date: 09 February 2012, 16:02:34
Subject: index-only quals vs. security_barrier views

Notify system doesn't recover from "No space" error - Mailing list pgsql-hackers

Previous

Next