Hard limit on WAL space used (because PANIC sucks) - Mailing list pgsql-hackers

In the "Redesigning checkpoint_segments" thread, many people opined that 
there should be a hard limit on the amount of disk space used for WAL: 
http://www.postgresql.org/message-id/CA+TgmoaOkgZb5YsmQeMg8ZVqWMtR=6S4-PPd+6jiy4OQ78ihUA@mail.gmail.com. 
I'm starting a new thread on that, because that's mostly orthogonal to 
redesigning checkpoint_segments.

The current situation is that if you run out of disk space while writing 
WAL, you get a PANIC, and the server shuts down. That's awful. We can 
try to avoid that by checkpointing early enough, so that we can remove 
old WAL segments to make room for new ones before you run out, but 
unless we somehow throttle or stop new WAL insertions, it's always going 
to be possible to use up all disk space. A typical scenario where that 
happens is when archive_command fails for some reason; even a checkpoint 
can't remove old, unarchived segments in that case. But it can happen 
even without WAL archiving.

I've seen a case, where it was even worse than a PANIC and shutdown. 
pg_xlog was on a separate partition that had nothing else on it. The 
partition filled up, and the system shut down with a PANIC. Because 
there was no space left, it could not even write the checkpoint after 
recovery, and thus refused to start up again. There was nothing else on 
the partition that you could delete to make space. The only recourse 
would've been to add more disk space to the partition (impossible), or 
manually delete an old WAL file that was not needed to recover from the 
latest checkpoint (scary). Fortunately this was a test system, so we 
just deleted everything.

So we need to somehow stop new WAL insertions from happening, before 
it's too late.

Peter Geoghegan suggested one method here: 
http://www.postgresql.org/message-id/flat/CAM3SWZQcyNxvPaskr-pxm8DeqH7_qevW7uqbhPCsg1FpSxKpoQ@mail.gmail.com. 
I don't think that exact proposal is going to work very well; throttling 
WAL flushing by holding WALWriteLock in WAL writer can have knock-on 
effects on the whole system, as Robert Haas mentioned. Also, it'd still 
be possible to run out of space, just more difficult.

To make sure there is enough room for the checkpoint to finish, other 
WAL insertions have to stop some time before you completely run out of 
disk space. The question is how to do that.

A naive idea is to check if there's enough preallocated WAL space, just 
before inserting the WAL record. However, it's too late to check that in 
XLogInsert; once you get there, you're already holding exclusive locks 
on data pages, and you are in a critical section so you can't back out. 
At that point, you have to write the WAL record quickly, or the whole 
system will suffer. So we need to act earlier.

A more workable idea is to sprinkle checks in higher-level code, before 
you hold any critical locks, to check that there is enough preallocated 
WAL. Like, at the beginning of heap_insert, heap_update, etc., and all 
similar indexam entry points. I propose that we maintain a WAL 
reservation system in shared memory. First of all, keep track of how 
much preallocated WAL there is left (and try to create more if needed). 
Also keep track of a different number: the amount of WAL pre-reserved 
for future insertions. Before entering the critical section, increase 
the reserved number with a conservative estimate (ie. high enough) of 
how much WAL space you need, and check that there is still enough 
preallocated WAL to satisfy all the reservations. If not, throw an error 
or sleep until there is. After you're done with the insertion, release 
the reservation by decreasing the number again.

A shared reservation counter like that could become a point of 
contention. One optimization is keep a constant reservation of, say, 32 
KB for each backend. That's enough for most operations. Change the logic 
so that you check if you've exceeded the reserved amount of space 
*after* writing the WAL record, while you're holding WALInsertLock 
anyway. If you do go over the limit, set a flag in backend-private 
memory indicating that the *next* time you're about to enter a critical 
section where you will write a WAL record, you check again if more space 
has been made available.

- Heikki



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Partitioning performance: cache stringToNode() of pg_constraint.ccbin
Next
From: Bruce Momjian
Date:
Subject: pg_ugprade use of --check and --link