Re: sync_file_range() - Mailing list pgsql-hackers

From Tom Lane
Subject Re: sync_file_range()
Date
Msg-id 23179.1150767330@sss.pgh.pa.us
Whole thread Raw
In response to Re: sync_file_range()  (Greg Stark <gsstark@mit.edu>)
Responses Re: sync_file_range()  (Simon Riggs <simon@2ndquadrant.com>)
List pgsql-hackers
Greg Stark <gsstark@mit.edu> writes:
> Come to think of it I wonder whether there's anything to be gained by using
> smaller files for tables. Instead of 1G files maybe 256M files or something
> like that to reduce the hit of fsyncing a file.

Actually probably not.  The weak part of our current approach is that we
tell the kernel "sync this file", then "sync that file", etc, in a more
or less random order.  This leads to a probably non-optimal sequence of
disk accesses to complete a checkpoint.  What we would really like is a
way to tell the kernel "sync all these files, and let me know when
you're done" --- then the kernel and hardware have some shot at
scheduling all the writes in an intelligent fashion.

sync_file_range() is not that exactly, but since it lets you request
syncing and then go back and wait for the syncs later, we could get the
desired effect with two passes over the file list.  (If the file list
is longer than our allowed number of open files, though, the extra
opens/closes could hurt.)

Smaller files would make the I/O scheduling problem worse not better.
Indeed, I've been wondering lately if we shouldn't resurrect
LET_OS_MANAGE_FILESIZE and make that the default on systems with
largefile support.  If nothing else it would cut down on open/close
overhead on very large relations.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Theo Schlossnagle
Date:
Subject: Re: Generic Monitoring Framework Proposal
Next
From: "Qingqing Zhou"
Date:
Subject: shall we have a TRACE_MEMORY mode