Notify system doesn't recover from "No space" error - Mailing list pgsql-hackers
From | Kevin Grittner |
---|---|
Subject | Notify system doesn't recover from "No space" error |
Date | |
Msg-id | 4F339D6D02000025000450F5@gw.wicourts.gov Whole thread Raw |
Responses |
Re: Notify system doesn't recover from "No space" error
|
List | pgsql-hackers |
On a development system which wasn't being monitored very closely I think we've uncovered a bug in the listen/notify behavior. The system was struggling with lack of disk space for about a week before this surfaced, so it's probably not the most critical sort of bug we could have, but it probably merits a fix. We were getting all sorts of logging about the space problem, and PostgreSQL soldiered bravely on, doing the best it could under such adverse conditions. (That's pretty impressive, really.) A lot of messages like these: [2012-02-05 01:24:36.816 CST] 14950 <cc cc 127.0.0.1(35321)> ERROR: could not extend file "base/16401/46507734": wrote only 4096 of 8192 bytes at block 0 [2012-02-05 01:24:36.816 CST] 14950 <cc cc 127.0.0.1(35321)> HINT: Check free disk space. [2012-02-05 01:25:40.643 CST] 5323 LOG: could not write temporary statistics file "pg_stat_tmp/pgstat.tmp": No space left on device [2012-02-05 01:26:38.971 CST] 15065 <cc cc 127.0.0.1(53258)> ERROR: could not write to hash-join temporary file: No space left on device I don't know how much of a clue this is, but less than a minute before our first message about pg_notify problems, the statistics messages went from complaining about not being able to *write* the file to also complaining about not being able to *close* the file -- like this: [2012-02-05 01:26:41.159 CST] 5323 LOG: could not close temporary statistics file "pg_stat_tmp/pgstat.tmp": No space left on device [2012-02-05 01:26:44.663 CST] 5323 LOG: could not write temporary statistics file "pg_stat_tmp/pgstat.tmp": No space left on device [2012-02-05 01:26:45.705 CST] 16252 WARNING: pgstat wait timeout We got three errors in the log with exactly this DETAIL (file, offset and message): [2012-02-05 01:27:36.878 CST] 14892 <cc cc 127.0.0.1(35320)> ERROR: could not access status of transaction 0 [2012-02-05 01:27:36.878 CST] 14892 <cc cc 127.0.0.1(35320)> DETAIL:Could not write to file "pg_notify/03A5" at offset 188416:No space left on device. [2012-02-05 01:27:36.878 CST] 14892 <cc cc 127.0.0.1(35320)> STATEMENT: COMMIT [2012-02-05 01:27:36.882 CST] 14892 <cc cc 127.0.0.1(35320)> LOG: disconnection: session time: 0:07:00.600 user=cc database=cc host=127.0.0.1 port=35320 After that the message changed to a new offset and a message of "Success." for the failure: [2012-02-05 01:30:36.952 CST] 16383 <cc cc 127.0.0.1(38931)> ERROR: could not access status of transaction 0 [2012-02-05 01:30:36.952 CST] 16383 <cc cc 127.0.0.1(38931)> DETAIL:Could not read from file "pg_notify/03A5" at offset 253952: Success. [2012-02-05 01:30:36.952 CST] 16383 <cc cc 127.0.0.1(38931)> STATEMENT: COMMIT [2012-02-05 01:30:36.956 CST] 16383 <cc cc 127.0.0.1(38931)> LOG: disconnection: session time: 0:01:59.537 user=cc database=cc host=127.0.0.1 port=38931 There were 4591 messages over the course of five days with exactly that filename and offset and a message of "Success.", all for transaction 0. About midway through that time disk space was freed and all other symptoms cleared. It really is pretty impressive that PostgreSQL survived that and was able to resume normal operations in all other respects once disk space was freed for it. Only this one feature stayed in a bad state. -Kevin
pgsql-hackers by date: