Thread: sync_file_range()

sync_file_range()

From

Christopher Kings-Lynne

Date:

19 June 2006, 02:42:48

http://lwn.net/Articles/178199/

Check out the article on sync_file_range():

----
long sync_file_range(int fd, loff_t offset, loff_t nbytes, int flags);

This call will synchronize a file's data to disk, starting at the given 
offset and proceeding for nbytes bytes (or to the end of the file if 
nbytes is zero). How the synchronization is done is controlled by flags:
    * SYNC_FILE_RANGE_WAIT_BEFORE blocks the calling process until any 
already in-progress writeout of pages (in the given range) completes.
    * SYNC_FILE_RANGE_WRITE starts writeout of any dirty pages in the 
given range which are not already under I/O.
    * SYNC_FILE_RANGE_WAIT_AFTER blocks the calling process until the 
newly-initiated writes complete.

An application which wants to initiate writeback of all dirty pages 
should provide the first two flags. Providing all three flags guarantees 
that those pages are actually on disk when the call returns.
----

Is that at all useful for PostgreSQL's purposes?

Chris

Re: sync_file_range()

From

ITAGAKI Takahiro

Date:

19 June 2006, 02:56:19

Christopher Kings-Lynne <chris.kings-lynne@calorieking.com> wrote:

> http://lwn.net/Articles/178199/
> Check out the article on sync_file_range():

> Is that at all useful for PostgreSQL's purposes?

I'm interested in it, with which we could improve responsiveness during
checkpoints. Though it is Linux specific system call, but we could use
the combination of mmap() and msync() instead of it; I mean we can use
mmap only to flush dirty pages, not to read or write pages.

---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: sync_file_range()

From

"Qingqing Zhou"

Date:

19 June 2006, 04:33:10

"ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp> wrote
>
>
> I'm interested in it, with which we could improve responsiveness during
> checkpoints. Though it is Linux specific system call, but we could use
> the combination of mmap() and msync() instead of it; I mean we can use
> mmap only to flush dirty pages, not to read or write pages.
>

Can you specify details? As the TODO item inidcates, if we mmap data file, a
serious problem is that we don't know when the data pages hit the disks -- 
so that we may voilate the WAL rule.

Regards,
Qingqing

Re: sync_file_range()

From

ITAGAKI Takahiro

Date:

19 June 2006, 07:33:59

"Qingqing Zhou" <zhouqq@cs.toronto.edu> wrote:

> > I'm interested in it, with which we could improve responsiveness during
> > checkpoints. Though it is Linux specific system call, but we could use
> > the combination of mmap() and msync() instead of it; I mean we can use
> > mmap only to flush dirty pages, not to read or write pages.
> 
> Can you specify details? As the TODO item inidcates, if we mmap data file, a
> serious problem is that we don't know when the data pages hit the disks -- 
> so that we may voilate the WAL rule.

I'm thinking about fuzzy checkpoints, where we writes and flushes buffers
as need as we should. Then sync_file_range() helps us to control to flush
buffers by better granularity. We can stretch a checkpoint length to avoid
storage-overload at a burst, using sync_file_range() and cost-based delay,
like vacuum.

I did not want to modify buffers by mmap, just to say the following
pseudo-code. (I don't know it works in fact...)

my_sync_file_range(fd, offset, nbytes, ...)
{   void *p = mmap(NULL, nbytes, ..., fd, offset);   msync(p, nbytes, MS_ASYNC);   munmap(p, nbytes);
}

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: sync_file_range()

From

Simon Riggs

Date:

19 June 2006, 08:29:35

On Mon, 2006-06-19 at 15:32 +0800, Qingqing Zhou wrote:
> "ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp> wrote
> >
> >
> > I'm interested in it, with which we could improve responsiveness during
> > checkpoints. Though it is Linux specific system call, but we could use
> > the combination of mmap() and msync() instead of it; I mean we can use
> > mmap only to flush dirty pages, not to read or write pages.
> >
> 
> Can you specify details? As the TODO item inidcates, if we mmap data file, a
> serious problem is that we don't know when the data pages hit the disks -- 
> so that we may voilate the WAL rule.

Can't see where we'd use it.

We fsync the xlog at transaction commit, so only the leading edge needs
to be synced - would the call help there? Presumably the OS can already
locate all blocks associated with a particular file fairly quickly
without doing a full cache scan.

Other files are fsynced at checkpoint - always all dirty blocks in the
whole file.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: sync_file_range()

From

Florian Weimer

Date:

19 June 2006, 14:35:25

* Simon Riggs:

> Other files are fsynced at checkpoint - always all dirty blocks in the
> whole file.

Optionally, sync_file_range does not block the calling process, so
it's very easy to flush all files at once, which could in theory
reduce seeking overhead.

Re: sync_file_range()

From

Greg Stark

Date:

19 June 2006, 16:05:20

Simon Riggs <simon@2ndquadrant.com> writes:

> On Mon, 2006-06-19 at 15:32 +0800, Qingqing Zhou wrote:
> > "ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp> wrote
> > >
> > >
> > > I'm interested in it, with which we could improve responsiveness during
> > > checkpoints. Though it is Linux specific system call, but we could use
> > > the combination of mmap() and msync() instead of it; I mean we can use
> > > mmap only to flush dirty pages, not to read or write pages.
> > >
> > 
> > Can you specify details? As the TODO item inidcates, if we mmap data file, a
> > serious problem is that we don't know when the data pages hit the disks -- 
> > so that we may voilate the WAL rule.
> 
> Can't see where we'd use it.
> 
> We fsync the xlog at transaction commit, so only the leading edge needs
> to be synced - would the call help there? Presumably the OS can already
> locate all blocks associated with a particular file fairly quickly
> without doing a full cache scan.

Well in theory the transaction being committed isn't necessarily the "leading
edge", there could be more work from other transactions since the last work
this transaction actually did. However I can't see that actually helping
performance much if at all. There can't be much, and writing the data it
doesn't really matter much how much data it writes -- what really matters is
rotational and seek latency anyways.

> Other files are fsynced at checkpoint - always all dirty blocks in the
> whole file.

Well couldn't it be useful for checkpoints if it there was some way to know
which buffers had been touched since the last checkpoint? There could be a lot
of buffers dirtied since the checkpoint began and those don't really need to
be synced do they?

Or it could be used to control the rate at which the files are checkpointed.

Come to think of it I wonder whether there's anything to be gained by using
smaller files for tables. Instead of 1G files maybe 256M files or something
like that to reduce the hit of fsyncing a file.

-- 
greg

Re: sync_file_range()

From

Simon Riggs

Date:

19 June 2006, 16:54:14

On Mon, 2006-06-19 at 15:04 -0400, Greg Stark wrote:

> > We fsync the xlog at transaction commit, so only the leading edge needs
> > to be synced - would the call help there? Presumably the OS can already
> > locate all blocks associated with a particular file fairly quickly
> > without doing a full cache scan.
> 
> Well in theory the transaction being committed isn't necessarily the "leading
> edge", there could be more work from other transactions since the last work
> this transaction actually did. 

Near enough.

> > Other files are fsynced at checkpoint - always all dirty blocks in the
> > whole file.
> 
> Well couldn't it be useful for checkpoints if it there was some way to know
> which buffers had been touched since the last checkpoint? There could be a lot
> of buffers dirtied since the checkpoint began and those don't really need to
> be synced do they?

Qingqing had a proposal for something like that, but seemed not worth it
after analysis.

--  Simon Riggs EnterpriseDB          http://www.enterprisedb.com

Re: sync_file_range()

From

Tom Lane

Date:

19 June 2006, 22:35:46

Greg Stark <gsstark@mit.edu> writes:
> Come to think of it I wonder whether there's anything to be gained by using
> smaller files for tables. Instead of 1G files maybe 256M files or something
> like that to reduce the hit of fsyncing a file.

Actually probably not.  The weak part of our current approach is that we
tell the kernel "sync this file", then "sync that file", etc, in a more
or less random order.  This leads to a probably non-optimal sequence of
disk accesses to complete a checkpoint.  What we would really like is a
way to tell the kernel "sync all these files, and let me know when
you're done" --- then the kernel and hardware have some shot at
scheduling all the writes in an intelligent fashion.

sync_file_range() is not that exactly, but since it lets you request
syncing and then go back and wait for the syncs later, we could get the
desired effect with two passes over the file list.  (If the file list
is longer than our allowed number of open files, though, the extra
opens/closes could hurt.)

Smaller files would make the I/O scheduling problem worse not better.
Indeed, I've been wondering lately if we shouldn't resurrect
LET_OS_MANAGE_FILESIZE and make that the default on systems with
largefile support.  If nothing else it would cut down on open/close
overhead on very large relations.
        regards, tom lane

Re: sync_file_range()

From

Simon Riggs

Date:

20 June 2006, 05:45:27

On Mon, 2006-06-19 at 21:35 -0400, Tom Lane wrote:
> Greg Stark <gsstark@mit.edu> writes:
> > Come to think of it I wonder whether there's anything to be gained by using
> > smaller files for tables. Instead of 1G files maybe 256M files or something
> > like that to reduce the hit of fsyncing a file.

> sync_file_range() is not that exactly, but since it lets you request
> syncing and then go back and wait for the syncs later, we could get the
> desired effect with two passes over the file list.  (If the file list
> is longer than our allowed number of open files, though, the extra
> opens/closes could hurt.)

So we would use the async properties of sync, but not the file range
support? Sounds like it could help with multiple filesystems.

> Indeed, I've been wondering lately if we shouldn't resurrect
> LET_OS_MANAGE_FILESIZE and make that the default on systems with
> largefile support.  If nothing else it would cut down on open/close
> overhead on very large relations.

Agreed.

--  Simon Riggs EnterpriseDB          http://www.enterprisedb.com

Re: sync_file_range()

From

"Zeugswetter Andreas DCP SD"

Date:

20 June 2006, 06:43:40

> > Indeed, I've been wondering lately if we shouldn't resurrect
> > LET_OS_MANAGE_FILESIZE and make that the default on systems with
> > largefile support.  If nothing else it would cut down on open/close
> > overhead on very large relations.

I'd still put some limit on the filesize, else you cannot manually
distribute a table across spindles anymore. Also some backup solutions
are not too happy with too large files eighter (they have trouble
with staging the backup). I would suggest something like 32 Gb.

Andreas

Re: sync_file_range()

From

Tom Lane

Date:

20 June 2006, 10:52:41

Simon Riggs <simon@2ndquadrant.com> writes:
> So we would use the async properties of sync, but not the file range
> support?

That's the part of it that looked potentially useful to me, anyway.
I don't see any value for us in syncing just part of a file, because
we don't have enough disk layout knowledge to make intelligent choices
of what to sync.  I think the OP had some idea of having the bgwriter
write and then force-sync individual pages, but what good is that?
Once we've done the write() the page is exposed to the kernel's write
scheduler and should be written at an intelligent time.  Trying to
force sync in advance of our own real need for it to be synced (ie
the next checkpoint) doesn't seem to me to offer any benefit.
        regards, tom lane

Re: sync_file_range()

From

Tom Lane

Date:

20 June 2006, 10:59:25

"Zeugswetter Andreas DCP SD" <ZeugswetterA@spardat.at> writes:
> Indeed, I've been wondering lately if we shouldn't resurrect 
> LET_OS_MANAGE_FILESIZE and make that the default on systems with 
> largefile support.  If nothing else it would cut down on open/close 
> overhead on very large relations.

> I'd still put some limit on the filesize, else you cannot manually
> distribute a table across spindles anymore. Also some backup solutions
> are not too happy with too large files eighter (they have trouble
> with staging the backup). I would suggest something like 32 Gb.

Well, some people would find those arguments compelling and some
wouldn't.  We already have a manually configurable RELSEG_SIZE,
so people who want a 32Gb or whatever segment size can have it.
But if you're dealing with terabyte-sized tables that's still a lot
of segments.

What I'd be inclined to do is allow people to set RELSEG_SIZE = 0
in pg_config_manual.h to select the unsegmented option.  That way
we already have the infrastructure in pg_control etc to ensure that
the database layout matches the backend.
        regards, tom lane

Re: sync_file_range()

From

"Zeugswetter Andreas DCP SD"

Date:

20 June 2006, 11:22:14

"Tom Lane" <tgl@sss.pgh.pa.us> writes:
> > Indeed, I've been wondering lately if we shouldn't resurrect
> > LET_OS_MANAGE_FILESIZE and make that the default on systems with
> > largefile support.  If nothing else it would cut down on open/close
> > overhead on very large relations.
>
> > I'd still put some limit on the filesize, else you cannot manually
> > distribute a table across spindles anymore. Also some
> backup solutions
> > are not too happy with too large files eighter (they have
> trouble with
> > staging the backup). I would suggest something like 32 Gb.
>
> Well, some people would find those arguments compelling and
> some wouldn't.  We already have a manually configurable
> RELSEG_SIZE, so people who want a 32Gb or whatever segment
> size can have it.
> But if you're dealing with terabyte-sized tables that's still
> a lot of segments.
>
> What I'd be inclined to do is allow people to set RELSEG_SIZE
> = 0 in pg_config_manual.h to select the unsegmented option.
> That way we already have the infrastructure in pg_control etc
> to ensure that the database layout matches the backend.

That sounds perfect. Still leaves the question of what to default to ?

Another issue is, that we would probably need to detect large file
support of the underlying filesystem, else we might fail at runtime :-(

Andreas