Home > mailing lists

Re: checkpoint writeback via sync_file_range - Mailing list pgsql-hackers

From	Greg Smith
Subject	Re: checkpoint writeback via sync_file_range
Date	January 13, 2012 02:26:40
Msg-id	4F0FA454.3030604@2ndQuadrant.com Whole thread Raw
In response to	Re: checkpoint writeback via sync_file_range (Andres Freund <andres@anarazel.de>)
Responses	Re: checkpoint writeback via sync_file_range
List	pgsql-hackers

Tree view

On 1/11/12 9:25 AM, Andres Freund wrote:
> The heavy pressure putting it directly in the writeback queue
> leads to less efficient io because quite often it won't reorder sensibly with
> other io anymore and thelike. At least that was my experience in using it with
> in another application.

Sure, this is one of the things I was cautioning about in the Double 
Writes thread, with VACUUM being the worst such case I've measured.

The thing to realize here is that the data we're talking about must be 
flushed to disk in the near future.  And Linux will happily cache 
gigabytes of it.  Right now, the database asks for that to be forced to 
disk via fsync, which means in chunks that can be large as a gigabyte.

Let's say we have a traditional storage array and there's competing 
activity.  10MB/s would be a good random I/O write rate in that 
situation.  A single fsync that forces 1GB out at that rate will take 
*100 seconds*.  And I've seen exactly that when trying to--about 80 
seconds is my current worst checkpoint stall ever.

And we don't have a latency vs. throughput knob any finer than that.  If 
one is added, and you turn it too far toward latency, throughput is 
going to tank for the reasons you've also seen.  Less reordering, 
elevator sorting, and write combining.  If the database isn't going to 
micro-manage the writes, it needs to give the OS room to do that work 
for it.

The most popular OS level approach to adjusting for this trade-off seems 
to be "limit the cache size".  That hasn't worked out very well when 
I've tried it, again getting back to not having enough working room for 
writes queued to reorganize them usefully.  One theory I've considered 
is that we might improve the VACUUM side of that using the same 
auto-tuning approach that's been applied to two other areas now:  scale 
the maximum size of the ring buffers based on shared_buffers.  I'm not 
real confident in that idea though, because ultimately it won't change 
the rate at which dirty buffers from VACUUM are evicted--and that's the 
source of the bottleneck in that area.

There is one piece of information the database knows, but it isn't 
communicating well to the OS yet.  I could do a better job of advising 
how to prioritize the writes that must happen soon--but not necessarily 
right now.  Yes, forcing them into write-back will be counterproductive 
from a throughput perspective.  The longer they sit at the "Dirty" cache 
level above that, the better the odds they'll be done efficiently.  But 
this is the checkpoint process we're talking about here.  It's going to 
force the information to disk soon regardless.  An intermediate step 
pushing to write-back should give the OS a bit more room to move around 
than fsync does, making the potential for a latency gain here seem quite 
real.  We'll see how the benchmarking goes.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

pgsql-hackers by date:

From: Alex Hunsaker
Date: 13 January 2012, 01:29:03
Subject: Re: [COMMITTERS] pgsql: Fix breakage from earlier plperl fix.

From: Fujii Masao
Date: 13 January 2012, 02:58:57
Subject: Re: log messages for archive recovery progress

Re: checkpoint writeback via sync_file_range - Mailing list pgsql-hackers

Previous

Next