Thread: Are random writes optimized sequentially by Linux kernel?

Are random writes optimized sequentially by Linux kernel?

From
"Dmitry Koterov"
Date:
Hello.

Suppose I perform 1000 RANDOM writes into a file. These writes are saved into Linux writeback buffer and are flushed to the disc asynchronously, that's OK.

The question is: will physical writes be performed later in the sequence of physical SECTOR position on the disc (minimizing head seeking)? Or Linux background writer knows nothing about physical on-disc placement and flushes data in order it is saved in the RAM?

E.g., if I write in the application:

a) block 835
b) block 136
c) block 956
d) block 549
e) block 942

dows the Linux background writer save flush them e.g. in physical order "136 - 549 - 835 - 942 - 956" or not?

Re: Are random writes optimized sequentially by Linux kernel?

From
david@lang.hm
Date:
On Wed, 7 Jan 2009, Dmitry Koterov wrote:

> Hello.
>
> Suppose I perform 1000 RANDOM writes into a file. These writes are saved
> into Linux writeback buffer and are flushed to the disc asynchronously,
> that's OK.
>
> The question is: will physical writes be performed later in the sequence of
> physical SECTOR position on the disc (minimizing head seeking)? Or Linux
> background writer knows nothing about physical on-disc placement and flushes
> data in order it is saved in the RAM?
>
> E.g., if I write in the application:
>
> a) block 835
> b) block 136
> c) block 956
> d) block 549
> e) block 942
>
> dows the Linux background writer save flush them e.g. in physical order "136
> - 549 - 835 - 942 - 956" or not?

yes, the linux IO scheduler will combine and re-order write requests.

they may end up being done 835-942-956-549-136 if the system thinks the
head happens to be past 549 and moving up when the requests hit the IO
system.

David Lang

Re: Are random writes optimized sequentially by Linux kernel?

From
"Dmitry Koterov"
Date:
OK, thank you.

Now - PostgreSQL-related question. If the system reorders writes to minimize seeking, I suppose that in heavy write-loaded PostgreSQL instalation dstat (or iostat) realtime write statistics should be close to the maximum possible value reported by bonnie++ (or simple dd) utility.

So, if, for example, I have in a heavy-loaded PostgreSQL installation:
- with a 50MB/s write speed limit reported by bonnie++ or dd (on a clean system),
- under a heavy PostgreSQL load the write throughput is only 10MB/s (synchronous_commit is off, checkpoint is executed every 10 minutes or even more),
- writeback buffer (accordingly to /proc/meminfo) is not fully filled,
- sometimes INSERTs or UPDATEs slow down in 10 second and more with no explicit correlation with checkpoints

then - something goes wrong?

What I am trying to understand - why does the system fall to a writing bottleneck (e.g. 10MB/s) much before it achieves the maximum disk throughput (e.g. 50MB/s). How could it happen if the Linux IO scheduler reorders write operations, so time for seeking is minimal?



Or, better, I can reformulate the question. In which cases PostgreSQL may stall on INSERT/UPDATE operation if synchronous_commit is off and there are no locking between transactions? In which cases these operations lost their deterministic time (in theory) and may slowdown in 100-1000 times?



On Wed, Jan 7, 2009 at 10:54 PM, <david@lang.hm> wrote:
On Wed, 7 Jan 2009, Dmitry Koterov wrote:

Hello.

Suppose I perform 1000 RANDOM writes into a file. These writes are saved
into Linux writeback buffer and are flushed to the disc asynchronously,
that's OK.

The question is: will physical writes be performed later in the sequence of
physical SECTOR position on the disc (minimizing head seeking)? Or Linux
background writer knows nothing about physical on-disc placement and flushes
data in order it is saved in the RAM?

E.g., if I write in the application:

a) block 835
b) block 136
c) block 956
d) block 549
e) block 942

dows the Linux background writer save flush them e.g. in physical order "136
- 549 - 835 - 942 - 956" or not?

yes, the linux IO scheduler will combine and re-order write requests.

they may end up being done 835-942-956-549-136 if the system thinks the head happens to be past 549 and moving up when the requests hit the IO system.

David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Are random writes optimized sequentially by Linux kernel?

From
david@lang.hm
Date:
On Thu, 8 Jan 2009, Dmitry Koterov wrote:

> OK, thank you.
>
> Now - PostgreSQL-related question. If the system reorders writes to minimize
> seeking, I suppose that in heavy write-loaded PostgreSQL instalation dstat
> (or iostat) realtime write statistics should be close to the maximum
> possible value reported by bonnie++ (or simple dd) utility.

this is not the case for a couple of reasons

1. bonnie++ and dd tend to write in one area, so seeks are not as big a
factor as writing across multiple areas

2. postgres doesn't do the simple writes like you described earlier

it does something like

write 123-124-fsync-586-354-257-fsync-123-124-125-fsync

(writes to the WAL journal, syncs it to make sure it's safe, then writes
to the destinations, the n syncs, then updates the WAL to record that it's
written....)

the fsync basicly tells the system, don't write anything more until these
are done. and interrupts the nice write pattern.

you can address this by having large battery-backed caches that you write
to and they batch things out to disk more efficiantly.

or you can put your WAL on a seperate drive so that the syncs on that
don't affect the data drives (but you will still have syncs on the data
disks, just not as many of them)

David Lang

Re: Are random writes optimized sequentially by Linux kernel?

From
Greg Smith
Date:
On Wed, 7 Jan 2009, Dmitry Koterov wrote:

> The question is: will physical writes be performed later in the sequence of
> physical SECTOR position on the disc (minimizing head seeking)? Or Linux
> background writer knows nothing about physical on-disc placement and flushes
> data in order it is saved in the RAM?

The part of Linux that does this is called the elevator algorithm, and
even the simplest I/O scheduler (the no-op one) does a merge+sort to
schedule physical writes.  The classic intro paper on this subject is
http://www.linuxinsight.com/files/ols2004/pratt-reprint.pdf

> What I am trying to understand - why does the system fall to a writing
> bottleneck (e.g. 10MB/s) much before it achieves the maximum disk
> throughput (e.g. 50MB/s). How could it happen if the Linux IO scheduler
> reorders write operations, so time for seeking is minimal?

I think you're underestimating how much impact even a minimal amount of
seeking has.  If the disk head has to move at all beyond a single track
seek, you won't get anywhere close to the rated sequential speed on the
drive even if elevator sorting is helping out.  And the minute a
checkpoint is involved, with its requisite fsync at the end, all the
blocks related to that are going to be forced out of the write cache
without any chance for merge+sort to lower the average disk I/O--unless
you spread that checkpoint write over a long period so pdflush can trickle
to blocks out to disk.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Are random writes optimized sequentially by Linux kernel?

From
"M. Edward (Ed) Borasky"
Date:
david@lang.hm wrote:
> On Thu, 8 Jan 2009, Dmitry Koterov wrote:
>
>> OK, thank you.
>>
>> Now - PostgreSQL-related question. If the system reorders writes to
>> minimize
>> seeking, I suppose that in heavy write-loaded PostgreSQL instalation
>> dstat
>> (or iostat) realtime write statistics should be close to the maximum
>> possible value reported by bonnie++ (or simple dd) utility.
>
> this is not the case for a couple of reasons
>
> 1. bonnie++ and dd tend to write in one area, so seeks are not as big a
> factor as writing across multiple areas
>
> 2. postgres doesn't do the simple writes like you described earlier
>
> it does something like
>
> write 123-124-fsync-586-354-257-fsync-123-124-125-fsync
>
> (writes to the WAL journal, syncs it to make sure it's safe, then writes
> to the destinations, the n syncs, then updates the WAL to record that
> it's written....)
>
> the fsync basicly tells the system, don't write anything more until
> these are done. and interrupts the nice write pattern.
>
> you can address this by having large battery-backed caches that you
> write to and they batch things out to disk more efficiantly.
>
> or you can put your WAL on a seperate drive so that the syncs on that
> don't affect the data drives (but you will still have syncs on the data
> disks, just not as many of them)
>
> David Lang
>

1. There are four Linux I/O schedulers to choose from in the 2.6 kernel.
If you *aren't* on the 2.6 kernel, give me a shout when you are. :)

2. You can choose the scheduler in use "on the fly". This means you can
set up a benchmark of your *real-world* application, and run it four
times, once with each scheduler, *without* having to reboot or any of
that nonsense. That said, you will probably want to introduce some kind
of "page cache poisoning" technique between these runs to force your
benchmark to deal with every block of data at least once off the hard drive.

3. As I learned a few weeks ago, even simple 160 GB single SATA drives
now have some kind of scheduling algorithm built in, so your tests may
not show significant differences between the four schedulers. This is
even more the case for high-end SANs. You simply must test with your
real workload, rather than using bonnie++, iozone, or fio, to make an
intelligent scheduler choice.

4. For those that absolutely need fine-grained optimization, there is an
open-source tool called "blktrace" that is essentially a "sniffer for
I/O". It is maintained by Jens Axboe of Oracle, who also maintains the
Linux block I/O layer! There is a "driver" called "seekwatcher", also
open source and maintained by Chris Mason of Oracle, that will give you
visualizations of the "blktrace" results. In any event, if you need to
know, you can find out exactly what the scheduler is doing block by
block with "blktrace".

You can track all of this magic down via Google. If there's enough
interest and I have some free cycles, I'll post an extended "howto" on
doing this. But it only took me a week or so to figure it out from
scratch, and the documentation on "seekwatcher" and "blktrace" is
excellent.