Re: Maximum transaction rate - Mailing list pgsql-general

From Marco Colombo
Subject Re: Maximum transaction rate
Date
Msg-id 49C3E010.2000401@esiway.net
Whole thread Raw
In response to Re: Maximum transaction rate  (Ron Mayer <rm_pg@cheapcomplexdevices.com>)
List pgsql-general
Ron Mayer wrote:
> Marco Colombo wrote:
>> Yes, but we knew it already, didn't we? It's always been like
>> that, with IDE disks and write-back cache enabled, fsync just
>> waits for the disk reporting completion and disks lie about
>
> I've looked hard, and I have yet to see a disk that lies.

No, "lie" in the sense they report completion before the data
hit the platters. Of course, that's the expected behaviour with
write-back caches.

> ext3, OTOH seems to lie.

ext3 simply doesn't know, it interfaces with a block device,
which does the caching (OS level) and the reordering (e.g. elevator
algorithm). ext3 doesn't directly send commands to the disk,
neither manages the OS cache.

When software raid and device mapper come into play, you have
"virtual" block devices built on top of other block devices.

My home desktop has ext3 on top of a dm device (/dev/mapper/something,
a LV set up by LVM in this case), on top of a raid1 device (/dev/mdX),
on top of /dev/sdaX and /dev/sdbX, which, in a way, on their own
are blocks device built on others, /dev/sda and /dev/sdb (you don't
actually send commands to partitions, do you? although the mapping
"sector offset relative to partition -> real sector on disk" is
trivial).

Each of these layers potentially caches writes and reorders them, it's
the job of a block device, although it makes sense at most only for
the last one, the one that controls the disk. Anyway there isn't
much ext3 can do, but posting wb and flush requests to the block
device at the top of the "stack".

> IDE drives happily report whether they support write barriers
> or not, which you can see with the command:
> %hdparm -I /dev/hdf | grep FLUSH_CACHE_EXT

Of course a write barrier is not a cache flush. A flush is
synchronous, a write barrier asyncronous. The disk supports
flushing, not write barriers. Well, technically if you can
control the ordering of the requests, that's barriers proper.
With SCSI you can, IIRC. But a cache flush is, well, a flush.

> Linux kernels since 2005 or so check for this feature.  It'll
> happily tell you which of your devices don't support it.
>   %dmesg | grep 'disabling barriers'
>   JBD: barrier-based sync failed on md1 - disabling barriers
> And for devices that do, it will happily send IDE FLUSH CACHE
> commands to IDE drives that support the feature.   At the same
> time Linux kernels started sending the very similar. SCSI
> SYNCHRONIZE CACHE commands.

>> Anyway, it's the block device job to control disk caches. A
>> filesystem is just a client to the block device, it posts a
>> flush request, what happens depends on the block device code.
>> The FS doesn't talk to disks directly. And a write barrier is
>> not a flush request, is a "please do not reorder" request.
>> On fsync(), ext3 issues a flush request to the block device,
>> that's all it's expected to do.
>
> But AFAICT ext3 fsync() only tell the block device to
> flush disk caches if the inode was changed.

No, ext3 posts a write barrier request when the inode changes and it
commits the journal, which is not a flush. [*]

> Or, at least empirically if I modify a file and do
> fsync(fd); on ext3 it does not wait until the disk
> spun to where it's supposed to spin.   But if I put
> a couple fchmod()'s right before the fsync() it does.

If you were right, and ext3 didn't wait, it would make no
difference to have disk cache enabled or not, on fsync.
My test shows a 50x speedup when turning the disk cache on.
So for sure ext3 is waiting for the block device to report
completion. It's the block device that - on flush - doesn't
issue a FLUSH command to the disk.

.TM.

[*] A barrier ends up in a FLUSH for the disk, but it doesn't
mean it's synchronous, like a real flush. Even journal updates done
with barriers don't mean "hit the disk now", they just mean "keep
order" when writing. If you turn off automatic page cache flushing
and if you have zero memory pressure, a write request with a
barrier may stay forever in the OS cache, at least in theory.

Imagine you don't have bdflush and nothing reclaims resources: days
of activity may stay in RAM, as far as write barriers are concerned.
Now someone types 'sync' as root. The block device starts flushing
dirty pages, reordering writes, but honoring barriers, that is,
it reorders anything up to the first barrier, posts write requests
to the disk, issues a FLUSH command then waits until the flush
is completed. Then "consumes" the barrier, and starts processing
writes, reordering them up to the next barrier, and so on.
So yes, a barrier turns into a FLUSH command for the disk. But in
this scenario, days have passed since the original write/barrier request
from the filesystem.

Compare with a fsync(). Even in the above scenario, a fsync() should
end up in a FLUSH command to the disk, and wait for the request to
complete, before awakening the process that issued it. So the filesystem
has to request a flush operation to the block device, not a barrier.
And so it does.

If it turns out that the block device just issues writes but no FLUSH
command to disks, that's not the FS fault. And issuing barrier requests
won't change anything.

All this in theory. In practice there may be implementation details
that make things different. I've read that in the linux kernel at some
time (maybe even now) only one outstanding write barrier is possibile
in the stack of block devices. So I guess that a second write barrier
request triggers a real disk flush. That's why when you use fchmod()
repeatedly, you see all those flushes. But technically it's a side
effect and I think at closer analysis you may notice it's always lagging
one request behind, which you don't see just by looking at numbers or
listening to disk noise.

So, multiple journal commits may really help in having the disk cache
flushed as a side effect, but I think the bug is elsewhere. The day
linux supports multiple outstanding wb requests, that stops working.

It's the block devices that should be fixed, so that it performs
a cache FLUSH when the filesystem asks for a flush. Why they don't do
that today, it's a mistery to me, but I think there must be something
that I'm missing.

Anyway, the point here is that using LVM is no less safe than using
directly an IDE block device. There may be filesystems that on fsync
issue not only a flush request, but also a journal commit, with
attacched barrier requests, thus getting the Right Thing done by
double side effect. And yes ext3 is NOT among them, unless you trigger
those commits with the fchmod() dance.

pgsql-general by date:

Previous
From: "Will Rutherdale (rutherw)"
Date:
Subject: Re: Is there a meaningful benchmark?
Next
From: Sleepless
Date:
Subject: Re: postgreSQL & amazon ec2 cloud