Re: WAL Rate Limiting - Mailing list pgsql-hackers

From Robert Haas
Subject Re: WAL Rate Limiting
Date
Msg-id CA+Tgmob6Ha+sBHaXEbR8+Y67i=aTyG1FaV-WEtJ7Pv+td8DtnA@mail.gmail.com
Whole thread Raw
In response to Re: WAL Rate Limiting  (Greg Stark <stark@mit.edu>)
Responses Re: WAL Rate Limiting  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
On Wed, Feb 19, 2014 at 8:28 AM, Greg Stark <stark@mit.edu> wrote:
> On Mon, Jan 20, 2014 at 5:37 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
>> Agreed; that was the original plan, but implementation delays
>> prevented the whole vision/discussion/implementation. Requirements
>> from various areas include WAL rate limiting for replication, I/O rate
>> limiting, hard CPU and I/O limits for security and mixed workload
>> coexistence.
>>
>> I'd still like to get something on this in 9.4 that alleviates the
>> replication issues, leaving wider changes for later releases.
>
> My first reaction was that we should just have a generic I/O resource
> throttling. I was only convinced this was a reasonable idea by the
> replication use case. It would help me to understand the specific
> situations where replication breaks down due to WAL bandwidth
> starvation. Heroku has had some problems with slaves falling behind
> though the immediate problems that causes is the slave filling up disk
> which we could solve more directly by switching to archive mode rather
> than slowing down the master.
>
> But I would suggest you focus on a specific use case that's
> problematic so we can judge better if the implementation is really
> fixing it.
>
>> The vacuum_* parameters don't allow any control over WAL production,
>> which is often the limiting factor. I could, for example, introduce a
>> new parameter for vacuum_cost_delay that provides a weighting for each
>> new BLCKSZ chunk of WAL, then rename all parameters to a more general
>> form. Or I could forget that and just press ahead with the patch as
>> is, providing a cleaner interface in next release.
>>
>>> It's also interesting to wonder about the relationship to
>>> CHECK_FOR_INTERRUPTS --- although I think that currently, we assume
>>> that that's *cheap* (1 test and branch) as long as nothing is pending.
>>> I don't want to see a bunch of arithmetic added to it.
>>
>> Good point.
>
> I think it should be possible to actually merge it into
> CHECK_FOR_INTERRUPTS. Have a single global flag
> io_done_since_check_for_interrupts which is set to 0 after each
> CHECK_FOR_INTERRUPTS and set to 1 whenever any wal is written. Then
> CHECK_FOR_INTERRUPTS turns into two tests and branches instead of one
> in the normal case.
>
> In fact you could do all the arithmetic when you do the wal write.
> Only set the flag if the bandwidth consumed is above the budget. Then
> the flag should only ever be set when you're about to sleep.
>
> I would dearly love to see a generic I/O bandwidth limits so it would
> be nice to see a nicely general pattern here that could be extended
> even if we only target wal this release.
>
> I'm going to read the existing patch now, do you think it's ready to
> go or did you want to do more work based on the feedback?

Well, *I* don't think this is ready to go.  A WAL rate limit that only
limits WAL sometimes still doesn't impress me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: David Fetter
Date:
Subject: Re: Re: [COMMITTERS] pgsql: Add a GUC to report whether data page checksums are enabled.
Next
From: Greg Stark
Date:
Subject: Re: should we add a XLogRecPtr/LSN SQL type?