On 2015-08-11 17:15:22 +0200, Fabien COELHO wrote:
> +void
> +PerformFileFlush(FileFlushContext * context)
> +{
> + if (context->ncalls != 0)
> + {
> + int rc;
> +
> +#if defined(HAVE_SYNC_FILE_RANGE)
> +
> + /* Linux: tell the memory manager to move these blocks to io so
> + * that they are considered for being actually written to disk.
> + */
> + rc = sync_file_range(context->fd, context->offset, context->nbytes,
> + SYNC_FILE_RANGE_WRITE);
> +
> +#elif defined(HAVE_POSIX_FADVISE)
> +
> + /* Others: say that data should not be kept in memory...
> + * This is not exactly what we want to say, because we want to write
> + * the data for durability but we may need it later nevertheless.
> + * It seems that Linux would free the memory *if* the data has
> + * already been written do disk, else the "dontneed" call is ignored.
> + * For FreeBSD this may have the desired effect of moving the
> + * data to the io layer, although the system does not seem to
> + * take into account the provided offset & size, so it is rather
> + * rough...
> + */
> + rc = posix_fadvise(context->fd, context->offset, context->nbytes,
> + POSIX_FADV_DONTNEED);
> +
> +#endif
> +
> + if (rc < 0)
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("could not flush block " INT64_FORMAT
> + " on " INT64_FORMAT " blocks in file \"%s\": %m",
> + context->offset / BLCKSZ,
> + context->nbytes / BLCKSZ,
> + context->filename)));
> + }
I'm a bit wary that this might cause significant regressions on
platforms not supporting sync_file_range, but support posix_fadvise()
for workloads that are bigger than shared_buffers. Consider what happens
if the workload does *not* fit into shared_buffers but *does* fit into
the OS's buffer cache. Suddenly reads will go to disk again, no?
Greetings,
Andres Freund