Re: Hard limit on WAL space used (because PANIC sucks) - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Hard limit on WAL space used (because PANIC sucks)
Date
Msg-id CA+U5nM+ipFyK_cNkP=NdF72mTpMzyv=hsFaN-Si1Wx=85PmP+A@mail.gmail.com
Whole thread Raw
In response to Re: Hard limit on WAL space used (because PANIC sucks)  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Hard limit on WAL space used (because PANIC sucks)  (Greg Stark <stark@mit.edu>)
List pgsql-hackers
On 21 January 2014 18:35, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> On 6 June 2013 16:00, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
>>> The current situation is that if you run out of disk space while writing
>>> WAL, you get a PANIC, and the server shuts down. That's awful.
>
>> I don't see we need to prevent WAL insertions when the disk fills. We
>> still have the whole of wal_buffers to use up. When that is full, we
>> will prevent further WAL insertions because we will be holding the
>> WALwritelock to clear more space. So the rest of the system will lock
>> up nicely, like we want, apart from read-only transactions.
>
> I'm not sure that "all writing transactions lock up hard" is really so
> much better than the current behavior.

Lock up momentarily, until the situation clears. But my proposal would
allow the situation to fully clear, i.e. all WAL files could be
deleted as soon as replication/archiving has caught up. The current
behaviour doesn't automatically correct itself as this proposal would.
My proposal is also fully safe in line with synchronous replication,
as well as zero performance overhead for mainline processing.

> My preference would be that we simply start failing writes with ERRORs
> rather than PANICs.

Yes, that is what I am proposing, amongst other points.

> I'm not real sure ATM why this has to be a PANIC
> condition.  Probably the cause is that it's being done inside a critical
> section, but could we move that?

Yes, I think so.

>> Instead of PANICing, we should simply signal the checkpointer to
>> perform a shutdown checkpoint.
>
> And if that fails for lack of disk space?

I proposed a way to ensure it wouldn't fail for that, at least on pg_xlog space.

> In any case, what you're
> proposing sounds like a lot of new complication in a code path that
> is necessarily never going to be terribly well tested.

It's the smallest amount of change proposed so far... I agree on the
danger of untested code.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Incorrectly reporting config errors
Next
From: Adrian Klaver
Date:
Subject: Re: Incorrectly reporting config errors