Re: Notify system doesn't recover from "No space" error - Mailing list pgsql-hackers

From Christoph Berg
Subject Re: Notify system doesn't recover from "No space" error
Date
Msg-id 20120629082430.GA905@msgid.df7cb.de
Whole thread Raw
In response to Notify system doesn't recover from "No space" error  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
List pgsql-hackers
[Resending as the original post didn't get through to the list]

Warming up an old thread here - we ran into the same problem.

Database is 9.1.4/x86_64 from Debian/testing. The client application
is bucardo hammering the database with NOTIFYs (including some
master-master replication conflicts, that might add to the parallel
NOTIFY load).

The problem is reproducible with the attached instructions (several
ENOSPC cycles might be requried). When the filesystem is filled using
dd, the bucardo and psql processes will die with this error:

FEHLER:  53100: konnte auf den Status von Transaktion 0 nicht zugreifen
DETAIL:  Konnte nicht in Datei »pg_notify/0000« bei Position 180224 schreiben: Auf dem Gerät ist kein Speicherplatz
mehrverfügbar. 
ORT:  SlruReportIOError, slru.c:861

The line number might be different, sometimes its ENOENT, sometimes even
"Success".

Even after disk space is available again, subsequent "NOTIFY foobar"
calls will die, without any other clients connected:

ERROR:  XX000: could not access status of transaction 0
DETAIL:  Could not read from file "pg_notify/0000" at offset 245760: Success.
ORT:  SlruReportIOError, slru.c:854

Here's a backtrace, caught at slru.c:430:

430                SlruReportIOError(ctl, pageno, xid);
(gdb) bt
#0  SimpleLruReadPage (ctl=ctl@entry=0xb192a0, pageno=30, write_ok=write_ok@entry=1 '\001', xid=xid@entry=0)
    at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/access/transam/slru.c:430
#1  0x0000000000520d2f in asyncQueueAddEntries (nextNotify=nextNotify@entry=0x29b60c8)
    at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/commands/async.c:1318
#2  0x000000000052187f in PreCommit_Notify ()
    at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/commands/async.c:869
#3  0x00000000004973d3 in CommitTransaction ()
    at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/access/transam/xact.c:1827
#4  0x0000000000497a8d in CommitTransactionCommand ()
    at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/access/transam/xact.c:2562
#5  0x0000000000649497 in finish_xact_command ()
    at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/tcop/postgres.c:2452
#6  finish_xact_command ()
    at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/tcop/postgres.c:2441
#7  0x000000000064c875 in exec_simple_query (query_string=0x2a99d70 "notify foobar;")
    at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/tcop/postgres.c:1037
#8  PostgresMain (argc=<optimized out>, argv=argv@entry=0x29b1df8, username=<optimized out>)
    at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/tcop/postgres.c:3968
#9  0x000000000060e731 in BackendRun (port=0x2a14800)
    at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/postmaster/postmaster.c:3611
#10 BackendStartup (port=0x2a14800)
    at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/postmaster/postmaster.c:3296
#11 ServerLoop ()
    at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/postmaster/postmaster.c:1460
#12 0x000000000060f451 in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x29b1170)
    at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/postmaster/postmaster.c:1121
#13 0x0000000000464bc9 in main (argc=5, argv=0x29b1170)
    at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/main/main.c:199


Restarting the cluster seems to fix the condition in some cases, but
I've seen the error persist over restarts, or reappear after some time
even without disk full. (That's also what the customer on the live
system is seeing.)

Christoph
--
cb@df7cb.de | http://www.df7cb.de/

Attachment

pgsql-hackers by date:

Previous
From: Cédric Villemain
Date:
Subject: Re: We probably need autovacuum_max_wraparound_workers
Next
From: Eric McKeeth
Date:
Subject: Re: Covering Indexes