Thread: ext4 finally doing the right thing
A few months ago the worst of the bugs in the ext4 fsync code started clearing up, with http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5f3481e9a80c240f169b36ea886e2325b9aeb745 as a particularly painful one. That made it into the 2.6.32 kernel released last month. Some interesting benchmark news today suggests a version of ext4 that might actually work for databases is showing up in early packaged distributions: http://www.phoronix.com/scan.php?page=article&item=ubuntu_lucid_alpha2&num=3 Along with the massive performance drop that comes from working fsync. See http://www.phoronix.com/scan.php?page=article&item=linux_perf_regressions&num=2 for background about this topic from when the issue was discovered: "[This change] is required for safe behavior with volatile write caches on drives. You could mount with -o nobarrier and [the performance drop] would go away, but a sequence like write->fsync->lose power->reboot may well find your file without the data that you synced, if the drive had write caches enabled. If you know you have no write cache, or that it is safely battery backed, then you can mount with -o nobarrier, and not incur this penalty." The pgbench TPS figure Phoronix has been reporting has always been a fictitious one resulting from unsafe write caching. With 2.6.32 released with ext4 defaulting to proper behavior on fsync, that's going to make for a very interesting change. On one side, we might finally be able to use regular drives with their caches turned on safely, taking advantage of the cache for other writes while doing the right thing with the database writes. On the other, anyone who believed the fictitious numbers before is going to be in a rude surprise and think there's a massive regression here. There's some potential for this to show PostgreSQL in a bad light, when people discover they really only can get ~100 commits/second out of cheap hard drives and assume the database is to blame. Interesting times. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.co
On Fri, 2010-01-15 at 22:05 -0500, Greg Smith wrote: > A few months ago the worst of the bugs in the ext4 fsync code started > clearing up, with > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5f3481e9a80c240f169b36ea886e2325b9aeb745 > as a particularly painful one. Wow, thanks for the heads-up! > On one side, we might finally be > able to use regular drives with their caches turned on safely, taking > advantage of the cache for other writes while doing the right thing with > the database writes. That could be good news. What's your opinion on the practical performance impact? If it doesn't need to be fsync'd, the kernel probably shouldn't have written it to the disk yet anyway, right (I'm assuming here that the OS buffer cache is much larger than the disk write cache)? Regards, Jeff Davis
On one side, we might finally be able to use regular drives with their caches turned on safely, taking advantage of the cache for other writes while doing the right thing with the database writes.That could be good news. What's your opinion on the practical performance impact? If it doesn't need to be fsync'd, the kernel probably shouldn't have written it to the disk yet anyway, right (I'm assuming here that the OS buffer cache is much larger than the disk write cache)?
I know they just tweaked this area recently so this may be a bit out of date, but kernels starting with 2.6.22 allow you to get up to 10% of memory dirty before getting really aggressive about writing things out, with writes starting to go heavily at 5%. So even with a 1GB server, you could easily find 100MB of data sitting in the kernel buffer cache ahead of a database write that needs to hit disc. Once you start considering the case with modern hardware, where even my desktop has 8GB of RAM and most serious servers I see have 32GB, you can easily have gigabytes of such data queued in front of the write that now needs to hit the platter.
The dream is that a proper barrier implementation will then shuffle your important write to the front of that queue, without waiting for everything else to clear first. The exact performance impact depends on how many non-database writes happen. But even on a dedicated database disk, it should still help because there are plenty of non-sync'd writes coming out the background writer via its routine work and the checkpoint writes. And the ability to fully utilize the write cache on the individual drives, on commodity hardware, without risking database corruption would make life a lot easier.
-- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
That doesn't sound right. The kernel having 10% of memory dirty doesn't mean there's a queue you have to jump at all. You don't get into any queue until the kernel initiates write-out which will be based on the usage counters -- basically a lru. fsync and cousins like sync_file_range and posix_fadvise(DONT_NEED) in initiate write-out right away.
How many pending write-out requests for how much data the kernel should keep active is another question but I imagine it has more to do with storage hardware than how much memory your system has. And for most hardware it's probably on the order of megabytes or less.
greg
On 20 Jan 2010 21:19, "Greg Smith" <greg@2ndquadrant.com> wrote:Jeff Davis wrote: > > >> On one side, we might finally be >> able to use regular drives with their ... I know they just tweaked this area recently so this may be a bit out of date, but kernels starting with 2.6.22 allow you to get up to 10% of memory dirty before getting really aggressive about writing things out, with writes starting to go heavily at 5%. So even with a 1GB server, you could easily find 100MB of data sitting in the kernel buffer cache ahead of a database write that needs to hit disc. Once you start considering the case with modern hardware, where even my desktop has 8GB of RAM and most serious servers I see have 32GB, you can easily have gigabytes of such data queued in front of the write that now needs to hit the platter.
The dream is that a proper barrier implementation will then shuffle your important write to the front of that queue, without waiting for everything else to clear first. The exact performance impact depends on how many non-database writes happen. But even on a dedicated database disk, it should still help because there are plenty of non-sync'd writes coming out the background writer via its routine work and the checkpoint writes. And the ability to fully utilize the write cache on the individual drives, on commodity hardware, without risking database corruption would make life a lot easier.-- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Greg Stark wrote: > > That doesn't sound right. The kernel having 10% of memory dirty > doesn't mean there's a queue you have to jump at all. You don't get > into any queue until the kernel initiates write-out which will be > based on the usage counters -- basically a lru. fsync and cousins like > sync_file_range and posix_fadvise(DONT_NEED) in initiate write-out > right away. > Most safe ways ext3 knows how to initiate a write-out on something that must go (because it's gotten an fsync on data there) requires flushing every outstanding write to that filesystem along with it. So as soon as a single WAL write shows up, bam! The whole cache is emptied (or at least everything associated with that filesystem), and the caller who asked for that little write is stuck waiting for everything to clear before their fsync returns success. This particular issue absolutely killed Firefox when they switched to using SQLite not too long ago; high-level discussion at http://shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/ and confirmation/discussion of the issue on lkml at https://kerneltrap.org/mailarchive/linux-fsdevel/2008/5/26/1941354 . Note the comment from the first article saying "those delays can be 30 seconds or more". On multiple occasions, I've measured systems with dozens of disks in a high-performance RAID1+0 with battery-backed controller that could grind to a halt for 10, 20, or more seconds in this situation, when running pgbench on a big database. As was the case on the latest one I saw, if you've got 32GB of RAM and have let 3.2GB of random I/O from background writer/checkpoint writes back up because Linux has been lazy about getting to them, that takes a while to clear no matter how good the underlying hardware. Write barriers were supposed to improve all this when added to ext3, but they just never seemed to work right for many people. After reading that lkml thread, among others, I know I was left not trusting anything beyond the simplest path through this area of the filesystem. Slow is better than corrupted. So the good news I was relaying is that it looks like this finally work on ext4, giving it the behavior you described and expected, but that's not actually been there until now. I was hoping someone with more free time than me might be interested to go investigate further if I pointed the advance out. I'm stuck with too many production systems to play with new kernels at the moment, but am quite curious. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Both of those refer to the *drive* cache.
greg
On 21 Jan 2010 05:58, "Greg Smith" <greg@2ndquadrant.com> wrote:Greg Stark wrote: > > > That doesn't sound right. The kernel having 10% of memory dirty doesn't mean... Most safe ways ext3 knows how to initiate a write-out on something that must go (because it's gotten an fsync on data there) requires flushing every outstanding write to that filesystem along with it. So as soon as a single WAL write shows up, bam! The whole cache is emptied (or at least everything associated with that filesystem), and the caller who asked for that little write is stuck waiting for everything to clear before their fsync returns success.
This particular issue absolutely killed Firefox when they switched to using SQLite not too long ago; high-level discussion at http://shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/ and confirmation/discussion of the issue on lkml at https://kerneltrap.org/mailarchive/linux-fsdevel/2008/5/26/1941354 .
Note the comment from the first article saying "those delays can be 30 seconds or more". On multiple occasions, I've measured systems with dozens of disks in a high-performance RAID1+0 with battery-backed controller that could grind to a halt for 10, 20, or more seconds in this situation, when running pgbench on a big database. As was the case on the latest one I saw, if you've got 32GB of RAM and have let 3.2GB of random I/O from background writer/checkpoint writes back up because Linux has been lazy about getting to them, that takes a while to clear no matter how good the underlying hardware.
Write barriers were supposed to improve all this when added to ext3, but they just never seemed to work right for many people. After reading that lkml thread, among others, I know I was left not trusting anything beyond the simplest path through this area of the filesystem. Slow is better than corrupted.
So the good news I was relaying is that it looks like this finally work on ext4, giving it the behavior you described and expected, but that's not actually been there until now. I was hoping someone with more free time than me might be interested to go investigate further if I pointed the advance out. I'm stuck with too many production systems to play with new kernels at the moment, but am quite curious.-- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQu...
* Greg Smith <greg@2ndquadrant.com> [100121 00:58]: > Greg Stark wrote: >> >> That doesn't sound right. The kernel having 10% of memory dirty >> doesn't mean there's a queue you have to jump at all. You don't get >> into any queue until the kernel initiates write-out which will be >> based on the usage counters -- basically a lru. fsync and cousins like >> sync_file_range and posix_fadvise(DONT_NEED) in initiate write-out >> right away. >> > > Most safe ways ext3 knows how to initiate a write-out on something that > must go (because it's gotten an fsync on data there) requires flushing > every outstanding write to that filesystem along with it. So as soon as > a single WAL write shows up, bam! The whole cache is emptied (or at > least everything associated with that filesystem), and the caller who > asked for that little write is stuck waiting for everything to clear > before their fsync returns success. Sure, if your WAL is on the same FS as your data, you're going to get hit, and *especially* on ext3... But, I think that's one of the reasons people usually recommend putting WAL separate. Even if it's just another partition on the same (set of) disk(s), you get the benefit of not having to wait for all the dirty ext3 pages from your whole database FS to be flushed before the WAL write can complete on it's own FS. a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Attachment
* Greg Smith: > Note the comment from the first article saying "those delays can be 30 > seconds or more". On multiple occasions, I've measured systems with > dozens of disks in a high-performance RAID1+0 with battery-backed > controller that could grind to a halt for 10, 20, or more seconds in > this situation, when running pgbench on a big database. We see that quite a bit, too (we're still on ext3, mostly 2.6.26ish kernels). It seems that the most egregious issues (which even trigger the two minute kernel hangcheck timer) are related to CFQ. We don't see it on systems we have switched to the deadline I/O scheduler. But data on this is a bit sketchy. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
Aidan Van Dyk wrote: > Sure, if your WAL is on the same FS as your data, you're going to get > hit, and *especially* on ext3... > > But, I think that's one of the reasons people usually recommend putting > WAL separate. Separate disks can actually concentrate the problem. The writes to the data disk by checkpoints will also have fsync behind them eventually, so splitting out the WAL means you just push the big write backlog to a later point. So less frequently performance dives, but sometimes bigger. All of the systems I was mentioning seeing >10 second pauses on had a RAID-1 pair of WAL disks split from the main array. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
* Greg Smith <greg@2ndquadrant.com> [100121 09:49]: > Aidan Van Dyk wrote: >> Sure, if your WAL is on the same FS as your data, you're going to get >> hit, and *especially* on ext3... >> >> But, I think that's one of the reasons people usually recommend putting >> WAL separate. > > Separate disks can actually concentrate the problem. The writes to the > data disk by checkpoints will also have fsync behind them eventually, so > splitting out the WAL means you just push the big write backlog to a > later point. So less frequently performance dives, but sometimes > bigger. All of the systems I was mentioning seeing >10 second pauses on > had a RAID-1 pair of WAL disks split from the main array. That's right, so with the WAL split off on it's own disk, you don't wait on "WAL" for your checkpoint/data syncs, but you can build up a huge wait in the queue for main data (which can even block reads). Having WAL on the main disk means that (for most ext3), you sometimes have WAL writes taking longer, but the WAL fsyncs are keeping the backlog "down" in the main data area too. Now, with ext4 moving to full barrier/fsync support, we could get to the point where WAL in the main data FS can mimic the state where WAL is seperate, namely that WAL writes can "jump the queue" and be written without waiting for the data pages to be flushed down to disk, but also that you'll get the big backlog of data pages to flush when the first fsyncs on big data files start coming from checkpoints... a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Attachment
>Aidan Van Dyk <aidan@highrise.ca> wrote: > But, I think that's one of the reasons people usually recommend > putting WAL separate. Even if it's just another partition on the > same (set of) disk(s), you get the benefit of not having to wait > for all the dirty ext3 pages from your whole database FS to be > flushed before the WAL write can complete on it's own FS. [slaps forehead] I've been puzzling about why we're getting timeouts on one of two apparently identical (large) servers. We forgot to move the pg_xlog directory to the separate mount point we created for it on the same RAID. I didn't think to check that until I saw your post. -Kevin
> Now, with ext4 moving to full barrier/fsync support, we could get to the > point where WAL in the main data FS can mimic the state where WAL is > seperate, namely that WAL writes can "jump the queue" and be written > without waiting for the data pages to be flushed down to disk, but also > that you'll get the big backlog of data pages to flush when > the first fsyncs on big data files start coming from checkpoints... Does postgres write something to the logfile whenever a fsync() takes a suspiciously long amount of time ?
Pierre Frédéric Caillaud wrote: > > Does postgres write something to the logfile whenever a fsync() > takes a suspiciously long amount of time ? Not specifically. If you're logging statements that take a while, you can see this indirectly, but commits that just take much longer than usual. If you turn on log_checkpoints, the "sync time" is broken out for you, problems in this area can show up there too. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com