Home > mailing lists

Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance - Mailing list pgsql-hackers

From	Jim Nasby
Subject	Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Date	January 15, 2014 06:54:32
Msg-id	52D6066C.9020100@nasby.net Whole thread Raw
In response to	Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance (Dave Chinner <david@fromorbit.com>)
Responses	Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance (Dave Chinner <david@fromorbit.com>)
List	pgsql-hackers

Tree view

On 1/14/14, 3:41 PM, Dave Chinner wrote:
> On Tue, Jan 14, 2014 at 09:40:48AM -0500, Robert Haas wrote:
>> On Mon, Jan 13, 2014 at 5:26 PM, Mel Gorman <mgorman@suse.de> wrote:
> IOWs, using sync_file_range() does not avoid the need to fsync() a
> file for data integrity purposes...

I belive the PG community understands that, but thanks for the heads-up.

>> Whether the problem is with the system
>> call or the programmer is harder to determine.  I think the problem is
>> in part that it's not exactly clear when we should call it.  So
>> suppose we want to do a checkpoint.  What we used to do a long time
>> ago is write everything, and then fsync it all, and then call it good.
>>   But that produced horrible I/O storms.  So what we do now is do the
>> writes over a period of time, with sleeps in between, and then fsync
>> it all at the end, hoping that the kernel will write some of it before
>> the fsyncs arrive so that we don't get a huge I/O spike.
>> And that sorta works, and it's definitely better than doing it all at
>> full speed, but it's pretty imprecise.  If the kernel doesn't write
>> enough of the data out in advance, then there's still a huge I/O storm
>> when we do the fsyncs and everything grinds to a halt.  If it writes
>> out more data than needed in advance, it increases the total number of
>> physical writes because we get less write-combining, and that hurts
>> performance, too.

I think there's a pretty important bit that Robert didn't mention: we have a specific *time* target for when we want
allthe fsync's to complete. People that have problems here tend to tune checkpoints to complete every 5-15 minutes, and
theywant the write traffic for the checkpoint spread out over 90% of that time interval. To put it another way, fsync's
shouldbe done when 90% of the time to the next checkpoint hits, but preferably not a lot before then.

> Yup, the kernel defaults to maximising bulk write throughput, which
> means it waits to the last possible moment to issue write IO. And
> that's exactly to maximise write combining, optimise delayed
> allocation, etc. There are many good reasons for doing this, and for
> the majority of workloads it is the right behaviour to have.
>
> It sounds to me like you want the kernel to start background
> writeback earlier so that it doesn't build up as much dirty data
> before you require a flush. There are several ways to do this by
> tweaking writeback knobs. The simplest is probably just to set
> /proc/sys/vm/dirty_background_bytes to an appropriate threshold (say
> 50MB) and dirty_expire_centiseconds to a few seconds so that
> background writeback starts and walks all dirty inodes almost
> immediately. This will keep a steady stream of low level background
> IO going, and fsync should then not take very long.

Except that still won't throttle writes, right? That's the big issue here: our users often can't tolerate big spikes in
IOlatency. They want user requests to always happen within a specific amount of time.

So while delaying writes potentially reduces the total amount of data you're writing, users that run into problems here
ultimatelycare more about ensuring that their foreground IO completes in a timely fashion.

> Fundamentally, though, we need bug reports from people seeing these
> problems when they see them so we can diagnose them on their
> systems. Trying to discuss/diagnose these problems without knowing
> anything about the storage, the kernel version, writeback
> thresholds, etc really doesn't work because we can't easily
> determine a root cause.

So is lsf-pc@linux-foundation.org the best way to accomplish that?

Also, along the lines of collaboration, it would also be awesome to see kernel hackers at PGCon (http://pgcon.org) for
furtherdiscussion of this stuff. That is the conference that has more Postgres internal developers than any other.
There'sa variety of different ways collaboration could happen there, so it's probably best to start a separate
discussionwith those from the linux community who'd be interested in attending. PGCon also directly follows BSDCan
(http://bsdcan.org)at the same venue... so we could potentially kill two OS birds with one stone, so to speak... :) If
there'senough interest we could potentially do a "mini Postgres/OS conference" in-between BSDCan and the formal PGCon.
There'salso potential for the Postgres community to sponsor attendance for kernel hackers if money is a factor.

Like I said... best to start a separate thread if there's significant interest on meeting at PGCon. :)
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net

pgsql-hackers by date:

From: Craig Ringer
Date: 15 January 2014, 06:52:38
Subject: Re: WAL Rate Limiting

From: Jim Nasby
Date: 15 January 2014, 07:01:56
Subject: Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance

Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance - Mailing list pgsql-hackers

Previous

Next