Re: checkpointer continuous flushing - Mailing list pgsql-hackers

From Andres Freund
Subject Re: checkpointer continuous flushing
Date
Msg-id 20150817114138.GG3522@awork2.anarazel.de
Whole thread Raw
In response to Re: checkpointer continuous flushing  (Fabien COELHO <coelho@cri.ensmp.fr>)
List pgsql-hackers
On 2015-08-11 17:15:22 +0200, Fabien COELHO wrote:
> +void
> +PerformFileFlush(FileFlushContext * context)
> +{
> +    if (context->ncalls != 0)
> +    {
> +        int rc;
> +
> +#if defined(HAVE_SYNC_FILE_RANGE)
> +
> +        /* Linux: tell the memory manager to move these blocks to io so
> +         * that they are considered for being actually written to disk.
> +         */
> +        rc = sync_file_range(context->fd, context->offset, context->nbytes,
> +                             SYNC_FILE_RANGE_WRITE);
> +
> +#elif defined(HAVE_POSIX_FADVISE)
> +
> +        /* Others: say that data should not be kept in memory...
> +         * This is not exactly what we want to say, because we want to write
> +         * the data for durability but we may need it later nevertheless.
> +         * It seems that Linux would free the memory *if* the data has
> +         * already been written do disk, else the "dontneed" call is ignored.
> +         * For FreeBSD this may have the desired effect of moving the
> +         * data to the io layer, although the system does not seem to
> +         * take into account the provided offset & size, so it is rather
> +         * rough...
> +         */
> +        rc = posix_fadvise(context->fd, context->offset, context->nbytes,
> +                           POSIX_FADV_DONTNEED);
> +
> +#endif
> +
> +        if (rc < 0)
> +            ereport(ERROR,
> +                    (errcode_for_file_access(),
> +                     errmsg("could not flush block " INT64_FORMAT
> +                            " on " INT64_FORMAT " blocks in file \"%s\": %m",
> +                            context->offset / BLCKSZ,
> +                            context->nbytes / BLCKSZ,
> +                            context->filename)));
> +    }

I'm a bit wary that this might cause significant regressions on
platforms not supporting sync_file_range, but support posix_fadvise()
for workloads that are bigger than shared_buffers. Consider what happens
if the workload does *not* fit into shared_buffers but *does* fit into
the OS's buffer cache. Suddenly reads will go to disk again, no?

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: checkpointer continuous flushing
Next
From: Andres Freund
Date:
Subject: Re: Warnings around booleans