On Fri, 2012-07-13 at 15:21 -0400, Tom Lane wrote:
> No, that's incorrect: the fadvise is there, inside pg_flush_data() which
> is done during the writing phase.
Yeah, Andres pointed that out, also.
> It seems to me that if we think
> sync_file_range is a win, we ought to be using it in pg_flush_data just
> as much as in initdb.
It was pretty clearly a win for initdb, but I wasn't convinced it was a
good idea for other use cases.
The mechanism is outlined in the email you linked below. To paraphrase,
fadvise tries to put the data in the io scheduler queue (still in the
kernel before going to the device), and gives up if there is no room;
sync_file_range waits for there to be room if none is available.
For the case of initdb, the data needing to be fsync'd is effectively
constant, and it's a lot of small files. If the requests don't make it
to the io scheduler queue before fsync, the kernel doesn't have an
opportunity to schedule them properly.
But for larger amounts of data copying (like ALTER DATABASE SET
TABLESPACE), it seemed like there was more risk that sync_file_range
would starve out other processes by continuously filling up the io
scheduler queue (I'm not sure if there are protections against that or
not). Also, if the files are larger, a single fsync represents more
data, and the kernel may be able to schedule it well enough anyway.
I'm not an authority in this area though, so if you are comfortable
extending sync_file_range to copydir() as well, that's fine with me.
> However, I'm a bit confused because in
> http://archives.postgresql.org/pgsql-hackers/2012-03/msg01098.php
> you said
>
> > So, it looks like fadvise is the "right" thing to do, but I expect we'll
>
> Was that backwards? If not, why are we bothering with taking any
> portability risks by adding use of sync_file_range?
That part of the email was less than conclusive, and I can't remember
exactly what point I was trying to make. sync_file_range is a big
practical win for the reasons I mentioned above (and also avoids some
unpleasant noises coming from the disk drive).
Regards,Jeff Davis