Re: Maximum transaction rate - Mailing list pgsql-general
From | Marco Colombo |
---|---|
Subject | Re: Maximum transaction rate |
Date | |
Msg-id | 49C3E010.2000401@esiway.net Whole thread Raw |
In response to | Re: Maximum transaction rate (Ron Mayer <rm_pg@cheapcomplexdevices.com>) |
List | pgsql-general |
Ron Mayer wrote: > Marco Colombo wrote: >> Yes, but we knew it already, didn't we? It's always been like >> that, with IDE disks and write-back cache enabled, fsync just >> waits for the disk reporting completion and disks lie about > > I've looked hard, and I have yet to see a disk that lies. No, "lie" in the sense they report completion before the data hit the platters. Of course, that's the expected behaviour with write-back caches. > ext3, OTOH seems to lie. ext3 simply doesn't know, it interfaces with a block device, which does the caching (OS level) and the reordering (e.g. elevator algorithm). ext3 doesn't directly send commands to the disk, neither manages the OS cache. When software raid and device mapper come into play, you have "virtual" block devices built on top of other block devices. My home desktop has ext3 on top of a dm device (/dev/mapper/something, a LV set up by LVM in this case), on top of a raid1 device (/dev/mdX), on top of /dev/sdaX and /dev/sdbX, which, in a way, on their own are blocks device built on others, /dev/sda and /dev/sdb (you don't actually send commands to partitions, do you? although the mapping "sector offset relative to partition -> real sector on disk" is trivial). Each of these layers potentially caches writes and reorders them, it's the job of a block device, although it makes sense at most only for the last one, the one that controls the disk. Anyway there isn't much ext3 can do, but posting wb and flush requests to the block device at the top of the "stack". > IDE drives happily report whether they support write barriers > or not, which you can see with the command: > %hdparm -I /dev/hdf | grep FLUSH_CACHE_EXT Of course a write barrier is not a cache flush. A flush is synchronous, a write barrier asyncronous. The disk supports flushing, not write barriers. Well, technically if you can control the ordering of the requests, that's barriers proper. With SCSI you can, IIRC. But a cache flush is, well, a flush. > Linux kernels since 2005 or so check for this feature. It'll > happily tell you which of your devices don't support it. > %dmesg | grep 'disabling barriers' > JBD: barrier-based sync failed on md1 - disabling barriers > And for devices that do, it will happily send IDE FLUSH CACHE > commands to IDE drives that support the feature. At the same > time Linux kernels started sending the very similar. SCSI > SYNCHRONIZE CACHE commands. >> Anyway, it's the block device job to control disk caches. A >> filesystem is just a client to the block device, it posts a >> flush request, what happens depends on the block device code. >> The FS doesn't talk to disks directly. And a write barrier is >> not a flush request, is a "please do not reorder" request. >> On fsync(), ext3 issues a flush request to the block device, >> that's all it's expected to do. > > But AFAICT ext3 fsync() only tell the block device to > flush disk caches if the inode was changed. No, ext3 posts a write barrier request when the inode changes and it commits the journal, which is not a flush. [*] > Or, at least empirically if I modify a file and do > fsync(fd); on ext3 it does not wait until the disk > spun to where it's supposed to spin. But if I put > a couple fchmod()'s right before the fsync() it does. If you were right, and ext3 didn't wait, it would make no difference to have disk cache enabled or not, on fsync. My test shows a 50x speedup when turning the disk cache on. So for sure ext3 is waiting for the block device to report completion. It's the block device that - on flush - doesn't issue a FLUSH command to the disk. .TM. [*] A barrier ends up in a FLUSH for the disk, but it doesn't mean it's synchronous, like a real flush. Even journal updates done with barriers don't mean "hit the disk now", they just mean "keep order" when writing. If you turn off automatic page cache flushing and if you have zero memory pressure, a write request with a barrier may stay forever in the OS cache, at least in theory. Imagine you don't have bdflush and nothing reclaims resources: days of activity may stay in RAM, as far as write barriers are concerned. Now someone types 'sync' as root. The block device starts flushing dirty pages, reordering writes, but honoring barriers, that is, it reorders anything up to the first barrier, posts write requests to the disk, issues a FLUSH command then waits until the flush is completed. Then "consumes" the barrier, and starts processing writes, reordering them up to the next barrier, and so on. So yes, a barrier turns into a FLUSH command for the disk. But in this scenario, days have passed since the original write/barrier request from the filesystem. Compare with a fsync(). Even in the above scenario, a fsync() should end up in a FLUSH command to the disk, and wait for the request to complete, before awakening the process that issued it. So the filesystem has to request a flush operation to the block device, not a barrier. And so it does. If it turns out that the block device just issues writes but no FLUSH command to disks, that's not the FS fault. And issuing barrier requests won't change anything. All this in theory. In practice there may be implementation details that make things different. I've read that in the linux kernel at some time (maybe even now) only one outstanding write barrier is possibile in the stack of block devices. So I guess that a second write barrier request triggers a real disk flush. That's why when you use fchmod() repeatedly, you see all those flushes. But technically it's a side effect and I think at closer analysis you may notice it's always lagging one request behind, which you don't see just by looking at numbers or listening to disk noise. So, multiple journal commits may really help in having the disk cache flushed as a side effect, but I think the bug is elsewhere. The day linux supports multiple outstanding wb requests, that stops working. It's the block devices that should be fixed, so that it performs a cache FLUSH when the filesystem asks for a flush. Why they don't do that today, it's a mistery to me, but I think there must be something that I'm missing. Anyway, the point here is that using LVM is no less safe than using directly an IDE block device. There may be filesystems that on fsync issue not only a flush request, but also a journal commit, with attacched barrier requests, thus getting the Right Thing done by double side effect. And yes ext3 is NOT among them, unless you trigger those commits with the fchmod() dance.
pgsql-general by date: