Thread: Notify system doesn't recover from "No space" error
On a development system which wasn't being monitored very closely I think we've uncovered a bug in the listen/notify behavior. The system was struggling with lack of disk space for about a week before this surfaced, so it's probably not the most critical sort of bug we could have, but it probably merits a fix. We were getting all sorts of logging about the space problem, and PostgreSQL soldiered bravely on, doing the best it could under such adverse conditions. (That's pretty impressive, really.) A lot of messages like these: [2012-02-05 01:24:36.816 CST] 14950 <cc cc 127.0.0.1(35321)> ERROR: could not extend file "base/16401/46507734": wrote only 4096 of 8192 bytes at block 0 [2012-02-05 01:24:36.816 CST] 14950 <cc cc 127.0.0.1(35321)> HINT: Check free disk space. [2012-02-05 01:25:40.643 CST] 5323 LOG: could not write temporary statistics file "pg_stat_tmp/pgstat.tmp": No space left on device [2012-02-05 01:26:38.971 CST] 15065 <cc cc 127.0.0.1(53258)> ERROR: could not write to hash-join temporary file: No space left on device I don't know how much of a clue this is, but less than a minute before our first message about pg_notify problems, the statistics messages went from complaining about not being able to *write* the file to also complaining about not being able to *close* the file -- like this: [2012-02-05 01:26:41.159 CST] 5323 LOG: could not close temporary statistics file "pg_stat_tmp/pgstat.tmp": No space left on device [2012-02-05 01:26:44.663 CST] 5323 LOG: could not write temporary statistics file "pg_stat_tmp/pgstat.tmp": No space left on device [2012-02-05 01:26:45.705 CST] 16252 WARNING: pgstat wait timeout We got three errors in the log with exactly this DETAIL (file, offset and message): [2012-02-05 01:27:36.878 CST] 14892 <cc cc 127.0.0.1(35320)> ERROR: could not access status of transaction 0 [2012-02-05 01:27:36.878 CST] 14892 <cc cc 127.0.0.1(35320)> DETAIL:Could not write to file "pg_notify/03A5" at offset 188416:No space left on device. [2012-02-05 01:27:36.878 CST] 14892 <cc cc 127.0.0.1(35320)> STATEMENT: COMMIT [2012-02-05 01:27:36.882 CST] 14892 <cc cc 127.0.0.1(35320)> LOG: disconnection: session time: 0:07:00.600 user=cc database=cc host=127.0.0.1 port=35320 After that the message changed to a new offset and a message of "Success." for the failure: [2012-02-05 01:30:36.952 CST] 16383 <cc cc 127.0.0.1(38931)> ERROR: could not access status of transaction 0 [2012-02-05 01:30:36.952 CST] 16383 <cc cc 127.0.0.1(38931)> DETAIL:Could not read from file "pg_notify/03A5" at offset 253952: Success. [2012-02-05 01:30:36.952 CST] 16383 <cc cc 127.0.0.1(38931)> STATEMENT: COMMIT [2012-02-05 01:30:36.956 CST] 16383 <cc cc 127.0.0.1(38931)> LOG: disconnection: session time: 0:01:59.537 user=cc database=cc host=127.0.0.1 port=38931 There were 4591 messages over the course of five days with exactly that filename and offset and a message of "Success.", all for transaction 0. About midway through that time disk space was freed and all other symptoms cleared. It really is pretty impressive that PostgreSQL survived that and was able to resume normal operations in all other respects once disk space was freed for it. Only this one feature stayed in a bad state. -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > We got three errors in the log with exactly this DETAIL (file, > offset and message): > [2012-02-05 01:27:36.878 CST] 14892 <cc cc 127.0.0.1(35320)> ERROR: > could not access status of transaction 0 > [2012-02-05 01:27:36.878 CST] 14892 <cc cc 127.0.0.1(35320)> DETAIL: > Could not write to file "pg_notify/03A5" at offset 188416: No space > left on device. > After that the message changed to a new offset and a message of > "Success." for the failure: > [2012-02-05 01:30:36.952 CST] 16383 <cc cc 127.0.0.1(38931)> ERROR: > could not access status of transaction 0 > [2012-02-05 01:30:36.952 CST] 16383 <cc cc 127.0.0.1(38931)> DETAIL: > Could not read from file "pg_notify/03A5" at offset 253952: > Success. IIRC, the "could not access status of transaction" bits are artifacts stemming from the use of the SLRU infrastructure for pg_notify access. Sometime we ought to think a bit harder about how to specialize SLRU's error messages for the cases where what it's manipulating aren't XIDs. Also, the "Could not read ...: Success" bit presumably ought to be spelled "Could not read ...: unexpected EOF". I thought we'd fixed that most places, but we must've missed a spot in SLRU. However, both of those points are just cosmetic. The substantive issue is that you say it didn't resume normal behavior once space became available again. Can you provide more info about that? In particular, what behavior was being seen by the application? regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> [2012-02-05 01:30:36.952 CST] 16383 <cc cc 127.0.0.1(38931)> >> ERROR: could not access status of transaction 0 >> [2012-02-05 01:30:36.952 CST] 16383 <cc cc 127.0.0.1(38931)> >> DETAIL: Could not read from file "pg_notify/03A5" at offset >> 253952: Success. > The substantive issue is that you say it didn't resume normal > behavior once space became available again. Can you provide more > info about that? In particular, what behavior was being seen by > the application? The application LISTENs on channel tcn and a trigger function is attached to most permanent tables to NOTIFY for DML on that channel. (See the recently committed triggered_change_notification patch if you want to see the trigger function.) At this point it might be hard to determine (at least without re-creating the problem) whether it was only transactions which issued a NOTIFY, but the application never really got very far because of COMMIT statements failing with the above message. The report to us was that testers were unable to start the application. I believe that the above error on COMMIT kept the application from getting past initial tests that the connection was good. -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > The application LISTENs on channel tcn and a trigger function is > attached to most permanent tables to NOTIFY for DML on that channel. > ... > The report to us was that testers were unable to start the > application. I believe that the above error on COMMIT kept the > application from getting past initial tests that the connection was > good. OK, so it was issuing a LISTEN and the LISTEN was failing at commit? (It's a bit hard to believe that you'd be triggering one of the NOTIFYs during "initial connection tests".) regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> The application LISTENs on channel tcn and a trigger function is >> attached to most permanent tables to NOTIFY for DML on that >> channel. >> ... >> The report to us was that testers were unable to start the >> application. I believe that the above error on COMMIT kept the >> application from getting past initial tests that the connection >> was good. > > OK, so it was issuing a LISTEN and the LISTEN was failing at > commit? (It's a bit hard to believe that you'd be triggering one > of the NOTIFYs during "initial connection tests".) In continued digging through logs I found something to indicate the transaction on which COMMIT failed for 2528 of the failures. In all cases the transaction made a change which would have fired a NOTIFY at commit (in a deferred trigger). So that initial assumption about the cause doesn't hold up. I'm not sure at this point why the tester perception was that they couldn't get in. -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > In continued digging through logs I found something to indicate > the transaction on which COMMIT failed for 2528 of the failures. > In all cases the transaction made a change which would have fired > a NOTIFY at commit (in a deferred trigger). Arg. Sorry, I got confused there. No deferred trigger involved in the triggered change notification; that is only in the replication code. I haven't seen anything to suggest that the replication trigger is contributing to this, but I guess I couldn't rule it out, either. -Kevin
Christoph Berg <christoph.berg@credativ.de> writes: > Warming up an old thread here - we ran into the same problem. Thanks for the example proving it is repeatable. I dug into this a bit and found the problem. When a page of pg_notify is filled, asyncQueueAddEntries() advances the global QUEUE_HEAD to point to the next page before it calls SimpleLruZeroPage() to make that page exist in slru.c. But SimpleLruZeroPage could fail --- even though it doesn't attempt to write out the new page, it might need to dump some existing dirty page from shared memory before it can allocate a slot to the new page. If it does fail, then QUEUE_HEAD is pointing at a page number that doesn't exist either on disk or in slru.c buffers. Subsequent attempts to execute asyncQueueAddEntries() will fail because SimpleLruReadPage() says "page? what page?". And there is noplace else in async.c that would try to create the missing page. This error will persist until the server is restarted (thus resetting the state of pg_notify), even if the underlying disk-full condition is cleared. The solution is to make sure we don't advance QUEUE_HEAD into the new page until SimpleLruZeroPage has created it. There are a couple of ways the code could be fixed, but what I chose to do in the attached proposed patch is to make asyncQueueAddEntries work with a local queue_head pointer, which is only copied back to shared memory on successful exit. Comments, objections? regards, tom lane diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c index fcb087ed15dfbb4d567ee0c742e1db8ec1353a2d..6fcd9093268e7bc5aaf6102fa9d6adcdcced1d6d 100644 *** a/src/backend/commands/async.c --- b/src/backend/commands/async.c *************** static ListCell * *** 1285,1290 **** --- 1285,1291 ---- asyncQueueAddEntries(ListCell *nextNotify) { AsyncQueueEntry qe; + QueuePosition queue_head; int pageno; int offset; int slotno; *************** asyncQueueAddEntries(ListCell *nextNotif *** 1292,1299 **** /* We hold both AsyncQueueLock and AsyncCtlLock during this operation */ LWLockAcquire(AsyncCtlLock, LW_EXCLUSIVE); /* Fetch the current page */ ! pageno = QUEUE_POS_PAGE(QUEUE_HEAD); slotno = SimpleLruReadPage(AsyncCtl, pageno, true, InvalidTransactionId); /* Note we mark the page dirty before writing in it */ AsyncCtl->shared->page_dirty[slotno] = true; --- 1293,1313 ---- /* We hold both AsyncQueueLock and AsyncCtlLock during this operation */ LWLockAcquire(AsyncCtlLock, LW_EXCLUSIVE); + /* + * We work with a local copy of QUEUE_HEAD, which we write back to shared + * memory upon exiting. The reason for this is that if we have to advance + * to a new page, SimpleLruZeroPage might fail (out of disk space, for + * instance), and we must not advance QUEUE_HEAD if it does. (Otherwise, + * subsequent insertions would try to put entries into a page that slru.c + * thinks doesn't exist yet.) So, use a local position variable. Note + * that if we do fail, any already-inserted queue entries are forgotten; + * this is okay, since they'd be useless anyway after our transaction + * rolls back. + */ + queue_head = QUEUE_HEAD; + /* Fetch the current page */ ! pageno = QUEUE_POS_PAGE(queue_head); slotno = SimpleLruReadPage(AsyncCtl, pageno, true, InvalidTransactionId); /* Note we mark the page dirty before writing in it */ AsyncCtl->shared->page_dirty[slotno] = true; *************** asyncQueueAddEntries(ListCell *nextNotif *** 1305,1311 **** /* Construct a valid queue entry in local variable qe */ asyncQueueNotificationToEntry(n, &qe); ! offset = QUEUE_POS_OFFSET(QUEUE_HEAD); /* Check whether the entry really fits on the current page */ if (offset + qe.length <= QUEUE_PAGESIZE) --- 1319,1325 ---- /* Construct a valid queue entry in local variable qe */ asyncQueueNotificationToEntry(n, &qe); ! offset = QUEUE_POS_OFFSET(queue_head); /* Check whether the entry really fits on the current page */ if (offset + qe.length <= QUEUE_PAGESIZE) *************** asyncQueueAddEntries(ListCell *nextNotif *** 1331,1338 **** &qe, qe.length); ! /* Advance QUEUE_HEAD appropriately, and note if page is full */ ! if (asyncQueueAdvance(&(QUEUE_HEAD), qe.length)) { /* * Page is full, so we're done here, but first fill the next page --- 1345,1352 ---- &qe, qe.length); ! /* Advance queue_head appropriately, and detect if page is full */ ! if (asyncQueueAdvance(&(queue_head), qe.length)) { /* * Page is full, so we're done here, but first fill the next page *************** asyncQueueAddEntries(ListCell *nextNotif *** 1342,1353 **** * asyncQueueIsFull() ensured that there is room to create this * page without overrunning the queue. */ ! slotno = SimpleLruZeroPage(AsyncCtl, QUEUE_POS_PAGE(QUEUE_HEAD)); /* And exit the loop */ break; } } LWLockRelease(AsyncCtlLock); return nextNotify; --- 1356,1370 ---- * asyncQueueIsFull() ensured that there is room to create this * page without overrunning the queue. */ ! slotno = SimpleLruZeroPage(AsyncCtl, QUEUE_POS_PAGE(queue_head)); /* And exit the loop */ break; } } + /* Success, so update the global QUEUE_HEAD */ + QUEUE_HEAD = queue_head; + LWLockRelease(AsyncCtlLock); return nextNotify;
[Resending as the original post didn't get through to the list] Warming up an old thread here - we ran into the same problem. Database is 9.1.4/x86_64 from Debian/testing. The client application is bucardo hammering the database with NOTIFYs (including some master-master replication conflicts, that might add to the parallel NOTIFY load). The problem is reproducible with the attached instructions (several ENOSPC cycles might be requried). When the filesystem is filled using dd, the bucardo and psql processes will die with this error: FEHLER: 53100: konnte auf den Status von Transaktion 0 nicht zugreifen DETAIL: Konnte nicht in Datei »pg_notify/0000« bei Position 180224 schreiben: Auf dem Gerät ist kein Speicherplatz mehrverfügbar. ORT: SlruReportIOError, slru.c:861 The line number might be different, sometimes its ENOENT, sometimes even "Success". Even after disk space is available again, subsequent "NOTIFY foobar" calls will die, without any other clients connected: ERROR: XX000: could not access status of transaction 0 DETAIL: Could not read from file "pg_notify/0000" at offset 245760: Success. ORT: SlruReportIOError, slru.c:854 Here's a backtrace, caught at slru.c:430: 430 SlruReportIOError(ctl, pageno, xid); (gdb) bt #0 SimpleLruReadPage (ctl=ctl@entry=0xb192a0, pageno=30, write_ok=write_ok@entry=1 '\001', xid=xid@entry=0) at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/access/transam/slru.c:430 #1 0x0000000000520d2f in asyncQueueAddEntries (nextNotify=nextNotify@entry=0x29b60c8) at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/commands/async.c:1318 #2 0x000000000052187f in PreCommit_Notify () at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/commands/async.c:869 #3 0x00000000004973d3 in CommitTransaction () at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/access/transam/xact.c:1827 #4 0x0000000000497a8d in CommitTransactionCommand () at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/access/transam/xact.c:2562 #5 0x0000000000649497 in finish_xact_command () at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/tcop/postgres.c:2452 #6 finish_xact_command () at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/tcop/postgres.c:2441 #7 0x000000000064c875 in exec_simple_query (query_string=0x2a99d70 "notify foobar;") at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/tcop/postgres.c:1037 #8 PostgresMain (argc=<optimized out>, argv=argv@entry=0x29b1df8, username=<optimized out>) at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/tcop/postgres.c:3968 #9 0x000000000060e731 in BackendRun (port=0x2a14800) at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/postmaster/postmaster.c:3611 #10 BackendStartup (port=0x2a14800) at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/postmaster/postmaster.c:3296 #11 ServerLoop () at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/postmaster/postmaster.c:1460 #12 0x000000000060f451 in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x29b1170) at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/postmaster/postmaster.c:1121 #13 0x0000000000464bc9 in main (argc=5, argv=0x29b1170) at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/main/main.c:199 Restarting the cluster seems to fix the condition in some cases, but I've seen the error persist over restarts, or reappear after some time even without disk full. (That's also what the customer on the live system is seeing.) Christoph -- cb@df7cb.de | http://www.df7cb.de/
Attachment
Re: Tom Lane 2012-06-29 <8308.1340933751@sss.pgh.pa.us> > in async.c that would try to create the missing page. This error will > persist until the server is restarted (thus resetting the state of > pg_notify), even if the underlying disk-full condition is cleared. Thanks for digging into this and providing a patch. I've applied it to 9.1.4-2 here, and now the server recovers more gracefully from the error: (sorry for the German locale, the error is ENOSPC) 2012-07-02 17:45:59 CEST FEHLER: konnte auf den Status von Transaktion 0 nicht zugreifen 2012-07-02 17:45:59 CEST DETAIL: Konnte nicht in Datei »pg_notify/0000« bei Position 32768 schreiben: Auf dem Gerät istkein Speicherplatz mehr verfügbar. 2012-07-02 17:45:59 CEST ANWEISUNG: commit A subsequent "notify foobar" command did succeed after clearing the disk space. What is still puzzling me is that the customer is repeatedly reporting these issues, even after rebooting the system. I am not 100% sure at which point the disk full situation was resolved, but at least it sounds like that problem would reappear after restarts. I will need to get more detailed reports on the order of events there to make sure. Christoph -- cb@df7cb.de | http://www.df7cb.de/
Christoph Berg <christoph.berg@credativ.de> writes: > What is still puzzling me is that the customer is repeatedly reporting > these issues, even after rebooting the system. Hm. A server restart really ought to clear any problem in this area, since pg_notify gets reset to starting conditions. Please check into that more closely. (It might be worth looking at $PGDATA/pg_notify/ after a restart just to verify there are no files there except 0000.) regards, tom lane
Re: Tom Lane 2012-07-02 <18088.1341245035@sss.pgh.pa.us> > Christoph Berg <christoph.berg@credativ.de> writes: > > What is still puzzling me is that the customer is repeatedly reporting > > these issues, even after rebooting the system. > > Hm. A server restart really ought to clear any problem in this area, > since pg_notify gets reset to starting conditions. Please check into > that more closely. (It might be worth looking at $PGDATA/pg_notify/ > after a restart just to verify there are no files there except 0000.) Hi, we deployed an updated 9.1.4 package with your patch there. After restarting the server, pg_notify was empty except for 0000. fsck.ext3 reported no errors. Now the error has happened again, 58P01, could not read status of transaction 0, could not open pg_notify/0016, ENOENT, location SlruReportIOError, slru.c:840. (Roughly translated from the German message.) We'll try to dig into the code here, maybe we can spot something. Mit freundlichen Grüßen, Christoph Berg -- Tel.: +49 (0)21 61 / 46 43-187 credativ GmbH, HRB Mönchengladbach 12080 Hohenzollernstr. 133, 41061 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz