Re: checkpoint writeback via sync_file_range - Mailing list pgsql-hackers

From Greg Smith
Subject Re: checkpoint writeback via sync_file_range
Date
Msg-id 4F0D1234.1020300@2ndQuadrant.com
Whole thread Raw
In response to checkpoint writeback via sync_file_range  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: checkpoint writeback via sync_file_range  (Simon Riggs <simon@2ndQuadrant.com>)
Re: checkpoint writeback via sync_file_range  (Florian Weimer <fweimer@bfk.de>)
Re: checkpoint writeback via sync_file_range  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On 1/10/12 9:14 PM, Robert Haas wrote:
> Based on that, I whipped up the attached patch, which,
> if sync_file_range is available, simply iterates through everything
> that will eventually be fsync'd before beginning the write phase and
> tells the Linux kernel to put them all under write-out.

I hadn't really thought of using it that way.  The kernel expects that 
when this is called the normal way, you're going to track exactly which 
segments you want it to sync.  And that data isn't really passed through 
the fsync absorption code yet; the list of things to fsync has already 
lost that level of detail.

What you're doing here doesn't care though, and I hadn't considered that 
SYNC_FILE_RANGE_WRITE could be used that way on my last pass through its 
docs.  Used this way, it's basically fsync without the wait or 
guarantee; it just tries to push what's already dirty further ahead of 
the write queue than those writes would otherwise be.

One idea I was thinking about here was building a little hash table 
inside of the fsync absorb code, tracking how many absorb operations 
have happened for whatever the most popular relation files are.  The 
idea is that we might say "use sync_file_range every time <N> calls for 
a relation have come in", just to keep from ever accumulating too many 
writes to any one file before trying to nudge some of it out of there. 
The bat that keeps hitting me in the head here is that right now, a 
single fsync might have a full 1GB of writes to flush out, perhaps 
because it extended a table and then write more than that to it.  And in 
everything but a SSD or giant SAN cache situation, 1GB of I/O is just 
too much to fsync at a time without the OS choking a little on it.

> I don't know that I have a suitable place to test this, and I'm not
> quite sure what a good test setup would look like either, so while
> I've tested that this appears to issue the right kernel calls, I am
> not sure whether it actually fixes the problem case.

I'll put this into my testing queue after the upcoming CF starts.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Sending notifications from the master to the standby
Next
From: Pavel Stehule
Date:
Subject: Re: JSON for PG 9.2