Home > mailing lists

Re: Linux kernel impact on PostgreSQL performance - Mailing list pgsql-hackers

From	Heikki Linnakangas
Subject	Re: Linux kernel impact on PostgreSQL performance
Date	January 14, 2014 09:11:42
Msg-id	52D4FF40.5040404@vmware.com Whole thread Raw
In response to	Re: Linux kernel impact on PostgreSQL performance (Mel Gorman <mgorman@suse.de>)
Responses	Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
List	pgsql-hackers

Tree view

On 01/14/2014 12:26 AM, Mel Gorman wrote:
> On Mon, Jan 13, 2014 at 03:15:16PM -0500, Robert Haas wrote:
>> The other thing that comes to mind is the kernel's caching behavior.
>> We've talked a lot over the years about the difficulties of getting
>> the kernel to write data out when we want it to and to not write data
>> out when we don't want it to.
>
> Is sync_file_range() broke?
>
>> When it writes data back to disk too
>> aggressively, we get lousy throughput because the same page can get
>> written more than once when caching it for longer would have allowed
>> write-combining.
>
> Do you think that is related to dirty_ratio or dirty_writeback_centisecs?
> If it's dirty_writeback_centisecs then that would be particularly tricky
> because poor interactions there would come down to luck basically.

>> When it doesn't write data to disk aggressively
>> enough, we get huge latency spikes at checkpoint time when we call
>> fsync() and the kernel says "uh, what? you wanted that data *on the
>> disk*? sorry boss!" and then proceeds to destroy the world by starving
>> the rest of the system for I/O for many seconds or minutes at a time.
>
> Ok, parts of that are somewhat expected. It *may* depend on the
> underlying filesystem. Some of them handle fsync better than others. If
> you are syncing the whole file though when you call fsync then you are
> potentially burned by having to writeback dirty_ratio amounts of memory
> which could take a substantial amount of time.
>
>> We've made some desultory attempts to use sync_file_range() to improve
>> things here, but I'm not sure that's really the right tool, and if it
>> is we don't know how to use it well enough to obtain consistent
>> positive results.
>
> That implies that either sync_file_range() is broken in some fashion we
> (or at least I) are not aware of and that needs kicking.

Let me try to explain the problem: Checkpoints can cause an I/O spike, 
which slows down other processes.

When it's time to perform a checkpoint, PostgreSQL will write() all 
dirty buffers from the PostgreSQL buffer cache, and finally perform an 
fsync() to flush the writes to disk. After that, we know the data is 
safely on disk.

In older PostgreSQL versions, the write() calls would cause an I/O storm 
as the OS cache quickly fills up with dirty pages, up to dirty_ratio, 
and after that all subsequent write()s block. That's OK as far as the 
checkpoint is concerned, but it significantly slows down queries running 
at the same time. Even a read-only query often needs to write(), to 
evict a dirty page from the buffer cache to make room for a different 
page. We made that less painful by adding sleeps between the write() 
calls, so that they are trickled over a long period of time and 
hopefully stay below dirty_ratio at all times. However, we still have to 
perform the fsync()s after the writes(), and sometimes that still causes 
a similar I/O storm.

The checkpointer is not in a hurry. A checkpoint typically has 10-30 
minutes to finish, before it's time to start the next checkpoint, and 
even if it misses that deadline that's not too serious either. But the 
OS doesn't know that, and we have no way of telling it.

As a quick fix, some sort of a lazy fsync() call would be nice. It would 
behave just like fsync() but it would not change the I/O scheduling at 
all. Instead, it would sleep until all the pages have been flushed to 
disk, at the speed they would've been without the fsync() call.

Another approach would be to give the I/O that the checkpointer process 
initiates a lower priority. This would be slightly preferable, because 
PostgreSQL could then issue the writes() as fast as it can, and have the 
checkpoint finish earlier when there's not much other load. Last I 
looked into this (which was a long time ago), there was no suitable 
priority system for writes, only reads.

- Heikki

pgsql-hackers by date:

From: Kyotaro HORIGUCHI
Date: 14 January 2014, 09:10:47
Subject: Re: Using indices for UNION.

From: Kyotaro HORIGUCHI
Date: 14 January 2014, 09:16:28
Subject: Re: Using indices for UNION.

Re: Linux kernel impact on PostgreSQL performance - Mailing list pgsql-hackers

Previous

Next