Thread: BBU Cache vs. spindles
I'm weighing options for a new server. In addition to PostgreSQL, this machine will handle some modest Samba and Rsync load. I will have enough RAM so the virtually all disk-read activity will be cached. The average PostgreSQL read activity will be modest - a mix of single-record and fairly large (reporting) result-sets. Writes will be modest as well but will come in brief (1-5 second) bursts of individual inserts. The rate of insert requests will hit 100-200/second for those brief bursts. So... Am I likely to be better off putting $$$ toward battery-backup on the RAID or toward adding a second RAID-set and splitting off the WAL traffic? Or something else? Cheers, Steve
On Oct 7, 2010, at 4:38 PM, Steve Crawford wrote: > I'm weighing options for a new server. In addition to PostgreSQL, this machine will handle some modest Samba and Rsyncload. > > I will have enough RAM so the virtually all disk-read activity will be cached. The average PostgreSQL read activity willbe modest - a mix of single-record and fairly large (reporting) result-sets. Writes will be modest as well but will comein brief (1-5 second) bursts of individual inserts. The rate of insert requests will hit 100-200/second for those briefbursts. > > So... > > Am I likely to be better off putting $$$ toward battery-backup on the RAID or toward adding a second RAID-set and splittingoff the WAL traffic? Or something else? A BBU is, what, $100 or so? Adding one seems a no-brainer to me. Dedicated WAL spindles are nice and all, but they're stillspinning media. Raid card cache is waaaay faster, and while it's best at bursty writes, it sounds like bursty writesare precisely what you have.
Ben Chobot wrote: > On Oct 7, 2010, at 4:38 PM, Steve Crawford wrote: > > > I'm weighing options for a new server. In addition to PostgreSQL, this machine will handle some modest Samba and Rsyncload. > > > > I will have enough RAM so the virtually all disk-read activity will be cached. The average PostgreSQL read activity willbe modest - a mix of single-record and fairly large (reporting) result-sets. Writes will be modest as well but will comein brief (1-5 second) bursts of individual inserts. The rate of insert requests will hit 100-200/second for those briefbursts. > > > > So... > > > > Am I likely to be better off putting $$$ toward battery-backup on the RAID or toward adding a second RAID-set and splittingoff the WAL traffic? Or something else? > > A BBU is, what, $100 or so? Adding one seems a no-brainer to me. > Dedicated WAL spindles are nice and all, but they're still spinning > media. Raid card cache is waaaay faster, and while it's best at bursty > writes, it sounds like bursty writes are precisely what you have. Totally agree! -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Wed, 2010-10-20 at 22:13 -0400, Bruce Momjian wrote: > Ben Chobot wrote: > > On Oct 7, 2010, at 4:38 PM, Steve Crawford wrote: > > > > > I'm weighing options for a new server. In addition to PostgreSQL, this machine will handle some modest Samba and Rsyncload. > > > > > > I will have enough RAM so the virtually all disk-read activity will be cached. The average PostgreSQL read activitywill be modest - a mix of single-record and fairly large (reporting) result-sets. Writes will be modest as well butwill come in brief (1-5 second) bursts of individual inserts. The rate of insert requests will hit 100-200/second forthose brief bursts. > > > > > > So... > > > > > > Am I likely to be better off putting $$$ toward battery-backup on the RAID or toward adding a second RAID-set and splittingoff the WAL traffic? Or something else? > > > > A BBU is, what, $100 or so? Adding one seems a no-brainer to me. > > Dedicated WAL spindles are nice and all, but they're still spinning > > media. Raid card cache is waaaay faster, and while it's best at bursty > > writes, it sounds like bursty writes are precisely what you have. > > Totally agree! BBU first, more spindles second. > > -- > Bruce Momjian <bruce@momjian.us> http://momjian.us > EnterpriseDB http://enterprisedb.com > > + It's impossible for everything to be true. + > -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt
On Wed, Oct 20, 2010 at 8:25 PM, Joshua D. Drake <jd@commandprompt.com> wrote: > On Wed, 2010-10-20 at 22:13 -0400, Bruce Momjian wrote: >> Ben Chobot wrote: >> > On Oct 7, 2010, at 4:38 PM, Steve Crawford wrote: >> > >> > > I'm weighing options for a new server. In addition to PostgreSQL, this machine will handle some modest Samba and Rsyncload. >> > > >> > > I will have enough RAM so the virtually all disk-read activity will be cached. The average PostgreSQL read activitywill be modest - a mix of single-record and fairly large (reporting) result-sets. Writes will be modest as well butwill come in brief (1-5 second) bursts of individual inserts. The rate of insert requests will hit 100-200/second forthose brief bursts. >> > > >> > > So... >> > > >> > > Am I likely to be better off putting $$$ toward battery-backup on the RAID or toward adding a second RAID-set andsplitting off the WAL traffic? Or something else? >> > >> > A BBU is, what, $100 or so? Adding one seems a no-brainer to me. >> > Dedicated WAL spindles are nice and all, but they're still spinning >> > media. Raid card cache is waaaay faster, and while it's best at bursty >> > writes, it sounds like bursty writes are precisely what you have. >> >> Totally agree! > > BBU first, more spindles second. Agreed. note that while you can get incredible burst performance from a battery backed cache, due to both caching and writing out of order, once the throughput begins to saturate at the speed of the disk array, the bbu cache is now only re-ordering really, as it will eventually fill up faster than the disks can take the writes, and you'll settle in at some percentage of your max tps you get for a short benchmark run. It's vitally important that once you put a BBU cache in place, you run a very long running transactional test (pgbench is a simple one to start with) that floods the io subsystem so you see what you're average throughput is with the WAL and data store getting flooded. I know on my system pgbench runs of a few minutes can be 3 or 4 times faster than runs that last for the better part of an hour.
On 10/20/2010 09:45 PM, Scott Marlowe wrote: > On Wed, Oct 20, 2010 at 8:25 PM, Joshua D. Drake<jd@commandprompt.com> wrote: > >> On Wed, 2010-10-20 at 22:13 -0400, Bruce Momjian wrote: >> >>> Ben Chobot wrote: >>> >>>> On Oct 7, 2010, at 4:38 PM, Steve Crawford wrote: >>>> >>>> >>>>> I'm weighing options for a new server. In addition to PostgreSQL, this machine will handle some modest Samba and Rsyncload. >>>>> >>>>> I will have enough RAM so the virtually all disk-read activity will be cached. The average PostgreSQL read activitywill be modest - a mix of single-record and fairly large (reporting) result-sets. Writes will be modest as well butwill come in brief (1-5 second) bursts of individual inserts. The rate of insert requests will hit 100-200/second forthose brief bursts. >>>>> >>>>> So... >>>>> >>>>> Am I likely to be better off putting $$$ toward battery-backup on the RAID or toward adding a second RAID-set and splittingoff the WAL traffic? Or something else? >>>>> >>>> A BBU is, what, $100 or so? Adding one seems a no-brainer to me. >>>> Dedicated WAL spindles are nice and all, but they're still spinning >>>> media. Raid card cache is waaaay faster, and while it's best at bursty >>>> writes, it sounds like bursty writes are precisely what you have. >>>> >>> Totally agree! >>> >> BBU first, more spindles second. >> > Agreed. note that while you can get incredible burst performance from > a battery backed cache, due to both caching and writing out of order, > once the throughput begins to saturate at the speed of the disk array, > the bbu cache is now only re-ordering really, as it will eventually > fill up faster than the disks can take the writes, and you'll settle > in at some percentage of your max tps you get for a short benchmark > run. It's vitally important that once you put a BBU cache in place, > you run a very long running transactional test (pgbench is a simple > one to start with) that floods the io subsystem so you see what you're > average throughput is with the WAL and data store getting flooded. I > know on my system pgbench runs of a few minutes can be 3 or 4 times > faster than runs that last for the better part of an hour. > > Thanks for all the replies. This is what I suspected but since I can't just buy one of everything to try, I wanted a sanity-check before spending the $$$. I am not too worried about saturating the controller cache as the current much lower spec machine can handle the sustained load just fine and the bursts are typically only 1-3 seconds long spaced a minute or more apart. Cheers, Steve
Scott Marlowe wrote: > On Wed, Oct 20, 2010 at 8:25 PM, Joshua D. Drake <jd@commandprompt.com> wrote: > > On Wed, 2010-10-20 at 22:13 -0400, Bruce Momjian wrote: > >> Ben Chobot wrote: > >> > On Oct 7, 2010, at 4:38 PM, Steve Crawford wrote: > >> > > >> > > I'm weighing options for a new server. In addition to PostgreSQL, this machine will handle some modest Samba andRsync load. > >> > > > >> > > I will have enough RAM so the virtually all disk-read activity will be cached. The average PostgreSQL read activitywill be modest - a mix of single-record and fairly large (reporting) result-sets. Writes will be modest as well butwill come in brief (1-5 second) bursts of individual inserts. The rate of insert requests will hit 100-200/second forthose brief bursts. > >> > > > >> > > So... > >> > > > >> > > Am I likely to be better off putting $$$ toward battery-backup on the RAID or toward adding a second RAID-set andsplitting off the WAL traffic? Or something else? > >> > > >> > A BBU is, what, $100 or so? Adding one seems a no-brainer to me. > >> > Dedicated WAL spindles are nice and all, but they're still spinning > >> > media. Raid card cache is waaaay faster, and while it's best at bursty > >> > writes, it sounds like bursty writes are precisely what you have. > >> > >> Totally agree! > > > > BBU first, more spindles second. > > Agreed. note that while you can get incredible burst performance from > a battery backed cache, due to both caching and writing out of order, > once the throughput begins to saturate at the speed of the disk array, > the bbu cache is now only re-ordering really, as it will eventually > fill up faster than the disks can take the writes, and you'll settle > in at some percentage of your max tps you get for a short benchmark > run. It's vitally important that once you put a BBU cache in place, > you run a very long running transactional test (pgbench is a simple > one to start with) that floods the io subsystem so you see what you're > average throughput is with the WAL and data store getting flooded. I > know on my system pgbench runs of a few minutes can be 3 or 4 times > faster than runs that last for the better part of an hour. With a BBU you can turn off full_page_writes, which should decrease the WAL traffic. However, I don't see this mentioned in our documentation. Should I add it? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Bruce Momjian <bruce@momjian.us> wrote: > With a BBU you can turn off full_page_writes My understanding is that that is not without risk. What happens if the WAL is written, there is a commit, but the data page has not yet been written to the controller? Don't we still have a torn page? -Kevin
Kevin Grittner wrote: > Bruce Momjian <bruce@momjian.us> wrote: > > > With a BBU you can turn off full_page_writes > > My understanding is that that is not without risk. What happens if > the WAL is written, there is a commit, but the data page has not yet > been written to the controller? Don't we still have a torn page? I don't see how full_page_writes affect non-written pages to the controller. full_page_writes is designed to guard against a partial write to a device. I don't think the raid cache can be partially written to, and the cache will not be cleared until the drive has fully writen the data to disk. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Bruce Momjian wrote: > With a BBU you can turn off full_page_writes, which should decrease the > WAL traffic. > > However, I don't see this mentioned in our documentation. Should I add > it? > What I would like to do is beef up the documentation with some concrete examples of how to figure out if your cache and associated write path are working reliably or not. It should be possible to include "does this handle full page writes correctly?" in that test suite. Until we have something like that, I'm concerned that bugs in filesystem or controller handling may make full_page_writes unsafe even with a BBU, and we'd have no way for people to tell if that's true or not. -- Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us
Bruce Momjian <bruce@momjian.us> wrote: > full_page_writes is designed to guard against a partial write to a > device. I don't think the raid cache can be partially written to So you're confident that an 8kB write to the controller will not be done as a series of smaller atomic writes by the OS file system? -Kevin
Kevin Grittner wrote: > Bruce Momjian <bruce@momjian.us> wrote: > > >> full_page_writes is designed to guard against a partial write to a >> device. I don't think the raid cache can be partially written to >> > > So you're confident that an 8kB write to the controller will not be > done as a series of smaller atomic writes by the OS file system? Sure, that happens. But if the BBU has gotten an fsync call after the 8K write, it shouldn't return success until after all 8K are in its cache. That's why full_page_writes should be safe on a system with BBU as Bruce is suggesting. But I'd like to see some independent proof of that fact, that includes some targeted tests users can run, before we start recommending that practice. -- Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us
Greg Smith <greg@2ndquadrant.com> wrote: > Kevin Grittner wrote: >> So you're confident that an 8kB write to the controller will not >> be done as a series of smaller atomic writes by the OS file >> system? > > Sure, that happens. But if the BBU has gotten an fsync call after > the 8K write, it shouldn't return success until after all 8K are > in its cache. I'm not concerned about an fsync after the controller has it; I'm concerned about a system crash in the middle of writing an 8K page to the controller. Other than the expected *size* of the window of time during which you're vulnerable, what does a BBU caching controller buy you in this regard? Can't the OS rearrange the writes of disk sectors after the 8K page is written to the OS cache so that the window might occasionally be rather large? -Kevin
Kevin Grittner wrote: > Greg Smith <greg@2ndquadrant.com> wrote: > > Kevin Grittner wrote: > > >> So you're confident that an 8kB write to the controller will not > >> be done as a series of smaller atomic writes by the OS file > >> system? > > > > Sure, that happens. But if the BBU has gotten an fsync call after > > the 8K write, it shouldn't return success until after all 8K are > > in its cache. > > I'm not concerned about an fsync after the controller has it; I'm > concerned about a system crash in the middle of writing an 8K page > to the controller. Other than the expected *size* of the window of > time during which you're vulnerable, what does a BBU caching > controller buy you in this regard? Can't the OS rearrange the > writes of disk sectors after the 8K page is written to the OS cache > so that the window might occasionally be rather large? If the write fails to the controller, the page is not flushed and PG does not continue. If the write fails, the fsync never happens, and hence PG stops. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Bruce Momjian <bruce@momjian.us> wrote: > If the write fails to the controller, the page is not flushed and > PG does not continue. If the write fails, the fsync never > happens, and hence PG stops. PG stops? This case at issue is when the OS crashes or the plug is pulled in the middle of writing a page. I don't think PG will normally have the option of a graceful stop after that. To quote the Fine Manual: http://www.postgresql.org/docs/current/interactive/runtime-config-wal.html#GUC-FULL-PAGE-WRITES | a page write that is in process during an operating system crash | might be only partially completed, leading to an on-disk page that | contains a mix of old and new data. The row-level change data | normally stored in WAL will not be enough to completely restore | such a page during post-crash recovery. Storing the full page | image guarantees that the page can be correctly restored Like I said, the only difference between the page being written to platters and to a BBU cache that I can see is the average size of the window of time in which you're vulnerable, not whether there is a window. I don't think you've really addressed that concern. -Kevin
Kevin Grittner wrote: > Bruce Momjian <bruce@momjian.us> wrote: > > > If the write fails to the controller, the page is not flushed and > > PG does not continue. If the write fails, the fsync never > > happens, and hence PG stops. > > PG stops? This case at issue is when the OS crashes or the plug is > pulled in the middle of writing a page. I don't think PG will > normally have the option of a graceful stop after that. To quote > the Fine Manual: If the OS crashes during a write or fsync, we have not committed the transaction. > > http://www.postgresql.org/docs/current/interactive/runtime-config-wal.html#GUC-FULL-PAGE-WRITES > > | a page write that is in process during an operating system crash > | might be only partially completed, leading to an on-disk page that > | contains a mix of old and new data. The row-level change data > | normally stored in WAL will not be enough to completely restore > | such a page during post-crash recovery. Storing the full page > | image guarantees that the page can be correctly restored > > Like I said, the only difference between the page being written to > platters and to a BBU cache that I can see is the average size of > the window of time in which you're vulnerable, not whether there is > a window. I don't think you've really addressed that concern. I assume we send a full 8k to the controller, and a failure during that write is not registered as a write. A disk drive is modifying permanent storage so there is always the possibility of that failing. I assume the BBU just rewrites that after startup. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Bruce Momjian <bruce@momjian.us> wrote: > I assume we send a full 8k to the controller, and a failure during > that write is not registered as a write. On what do you base that assumption? I assume that we send a full 8K to the OS cache, and the file system writes disk sectors according to its own algorithm. With either platters or BBU cache, the data is persisted on fsync; why do you see a risk with one but not the other? -Kevin
Kevin Grittner wrote: > Bruce Momjian <bruce@momjian.us> wrote: > > > I assume we send a full 8k to the controller, and a failure during > > that write is not registered as a write. > > On what do you base that assumption? I assume that we send a full > 8K to the OS cache, and the file system writes disk sectors > according to its own algorithm. With either platters or BBU cache, > the data is persisted on fsync; why do you see a risk with one but > not the other? Now that is an interesting question. We write 8k to the kernel, but the kernel doesn't have to honor those write sizes, so while we probably can't get a partial 512-byte block written to disk with an BBU (that isn't cleanup up by the BBU on reboot), we could get some 512-byte blocks of an 8k written and others not. I agree you are right and a BBU does not mean you can safely turn off full_page_writes. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Thursday 21 October 2010 21:42:06 Kevin Grittner wrote: > Bruce Momjian <bruce@momjian.us> wrote: > > I assume we send a full 8k to the controller, and a failure during > > that write is not registered as a write. > > On what do you base that assumption? I assume that we send a full > 8K to the OS cache, and the file system writes disk sectors > according to its own algorithm. With either platters or BBU cache, > the data is persisted on fsync; why do you see a risk with one but > not the other? At least on linux pages can certainly get written out in < 8kb batches if youre under memory pressure. Andres
Kevin Grittner wrote: > I assume that we send a full > 8K to the OS cache, and the file system writes disk sectors > according to its own algorithm. With either platters or BBU cache, > the data is persisted on fsync; why do you see a risk with one but > not the other I'd like a 10 minute argument please. I started to write something to refute this, only to clarify in my head the sequence of events that leads to the most questionable result, where I feel a bit less certain than I did before of the safety here. Here is the worst case I believe you're describing: 1) Transaction is written to the WAL and sync'd; client receives COMMIT. Since full_page_writes is off, the data in the WAL consists only of the delta of what changed on the page. 2) 8K database page is written to OS cache 3) PG calls fsync to force the database block out 4) OS writes first 4K block of the change to the BBU write cache. Worst case, this fills the cache, and it takes a moment for some random writes to process before it has space to buffer again (makes this more likely to happen, but it's not required to see the failure case here) 5) Sudden power interruption, second half of the page write is lost 6) Server restarts 7) That 4K write is now replayed from the battery's cache At this point, you now have a torn 8K page, with 1/2 old and 1/2 new data. Without a full page write in the WAL, is it always possible to restore its original state now? In theory, I think you do. Since the delta in the WAL should be overwriting all of the bytes that changed between the old and new version of the page, applying it on top of any four possible states here: 1) None of the data was written to the database page yet 2) The first 4K of data was written out 3) The second 4K of data was written out 4) All 8K was actually written out Should lead to the same result: an 8K page that includes the change that was in the WAL but not onto disk at the point when the crash happened. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
Greg Smith <greg@2ndquadrant.com> writes: > At this point, you now have a torn 8K page, with 1/2 old and 1/2 new > data. Right. > Without a full page write in the WAL, is it always possible to > restore its original state now? In theory, I think you do. Since the > delta in the WAL should be overwriting all of the bytes that changed > between the old and new version of the page, applying it on top of any > four possible states here: You've got entirely too simplistic a view of what the "delta" might be, I fear. In particular there are various sorts of changes that involve inserting the data carried in the WAL record and shifting pre-existing data around to make room, or removing an item and moving remaining data around. If you try to replay that type of action against a torn page, you'll get corrupted results. We could possibly redefine the WAL records so that they weren't just the minimum amount of data but carried every byte that'd changed on the page, and then I think what you're envisioning would work. But the records would be a lot bulkier. It's not clear to me that this would be a net savings over the current design, particularly not if there's a long interval between checkpoints. regards, tom lane
Kevin Grittner wrote: > With either platters or BBU cache, > the data is persisted on fsync; why do you see a risk with one but > not the other Forgot to address this part. The troublesome sequence if you don't have a BBU is: 1) WAL data is written to the OS cache 2) PG calls fsync 3) Data is tranferred into the drive's volatile, non battery-backed cache 4) Drive lies about data being on disk, says fsync is done 5) That 8K data page is written out to the OS cache, also with fsync, then onto the drive. It says it has that too. 6) Due to current disk head location, 4KB of the data page gets written out before it gets to the WAL data 7) System crashes Now you're dead. You've just torn a data page, but not written any of the data to the WAL necessary to reconstruct any valid version of that page. I think Kevin's point here may be that if your fsync isn't reliable, you're always in trouble. But if your fsync is good, even torn pages should be repairable by the deltas written to the WAL, as I described in the message I just sent before this one. That's true regardless of whether you achieved "non-lying fsync" with a BBU or by turning a drive's write cache off. There's nothing really special about the BBU beyond it behind the most common form of reliable write cache that works. You get the same properties at a slower rate with a drive that's configured to never lie about writes. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
Greg Smith <greg@2ndquadrant.com> wrote: > I think Kevin's point here may be that if your fsync isn't > reliable, you're always in trouble. But if your fsync is good, > even torn pages should be repairable by the deltas written to the > WAL I was actually just arguing that a BBU doesn't eliminate a risk here; if there is a risk with production-quality disk drives, there is a risk with a controller with a BBU cache. The BBU cache just tends to reduce the window of time in which corruption can occur. I wasn't too sure of *why* there was a risk, but Tom's post cleared that up. I wonder why we need to expose this GUC at all -- perhaps it should be off when fsync is off and on otherwise? Leaving it on without fsync is just harming performance for not much benefit, and turning it off with fsync seems to be saying that you are willing to tolerate a known risk of database corruption, just not quite so much as you have without fsync. In reality it seems most likely to be a mistake, either way. -Kevin
Tom Lane wrote: > You've got entirely too simplistic a view of what the "delta" might be, > I fear. In particular there are various sorts of changes that involve > inserting the data carried in the WAL record and shifting pre-existing > data around to make room, or removing an item and moving remaining data > around. If you try to replay that type of action against a torn page, > you'll get corrupted results. > I wasn't sure exactly how those were encoded, thanks for the clarification. Given that, it seems to me there are only two situations where full_page_writes is safe to turn off: 1) The operating system block size is exactly the same database block size, and all writes are guaranteed to be atomic to that block size. 2) You're using a form of journaled filesystem where data blocks are never updated, they're always written elsewhere and the filesystem is redirected to that new block once it's on disk. Looks to me like whether or not there's a non-volatile write cache sitting in the middle, like a BBU protected RAID card, doesn't really make any difference here then. I think that most people who have thought they were safe to turn off full_page_writes in the past did so because they believed they were in category (1) here. I've never advised anyone to do that, because it's so difficult to validate the truth of. Just given that, I'd be tempted to join in on suggesting this parameter just go away in the name of safety, except that I think category (2) here is growing now. ZFS is the most obvious example where the atomic write implementation seems to always make disabling full_page_writes safe. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On 2010-10-22 17:37, Greg Smith wrote:
> I think that most people who have thought they were safe to turn off
> full_page_writes in the past did so because they believed they were
> in category (1) here. I've never advised anyone to do that, because
> it's so difficult to validate the truth of. Just given that, I'd be
> tempted to join in on suggesting this parameter just go away in the
> name of safety, except that I think category (2) here is growing now.
> ZFS is the most obvious example where the atomic write implementation
> seems to always make disabling full_page_writes safe.
Can you point to some ZFS docs that tell that this is the case.. I'd be surprised
if it doesnt copy away the old block and replaces it with the new one in-place. The
other behaviour would quite quickly lead to a hugely fragmented filesystem that
performs next to useless and ZFS doesnt seem to be in that category..
... All given my total lack of insight into ZFS.
--
Jesper
> I think that most people who have thought they were safe to turn off
> full_page_writes in the past did so because they believed they were
> in category (1) here. I've never advised anyone to do that, because
> it's so difficult to validate the truth of. Just given that, I'd be
> tempted to join in on suggesting this parameter just go away in the
> name of safety, except that I think category (2) here is growing now.
> ZFS is the most obvious example where the atomic write implementation
> seems to always make disabling full_page_writes safe.
Can you point to some ZFS docs that tell that this is the case.. I'd be surprised
if it doesnt copy away the old block and replaces it with the new one in-place. The
other behaviour would quite quickly lead to a hugely fragmented filesystem that
performs next to useless and ZFS doesnt seem to be in that category..
... All given my total lack of insight into ZFS.
--
Jesper
On Fri, Oct 22, 2010 at 8:37 AM, Greg Smith <greg@2ndquadrant.com> wrote: > Tom Lane wrote: >> >> You've got entirely too simplistic a view of what the "delta" might be, >> I fear. In particular there are various sorts of changes that involve >> inserting the data carried in the WAL record and shifting pre-existing >> data around to make room, or removing an item and moving remaining data >> around. If you try to replay that type of action against a torn page, >> you'll get corrupted results. >> > > I wasn't sure exactly how those were encoded, thanks for the clarification. > Given that, it seems to me there are only two situations where > full_page_writes is safe to turn off: > > 1) The operating system block size is exactly the same database block size, > and all writes are guaranteed to be atomic to that block size. > 2) You're using a form of journaled filesystem where data blocks are never > updated, they're always written elsewhere and the filesystem is redirected > to that new block once it's on disk. > > Looks to me like whether or not there's a non-volatile write cache sitting > in the middle, like a BBU protected RAID card, doesn't really make any > difference here then. > > I think that most people who have thought they were safe to turn off > full_page_writes in the past did so because they believed they were in > category (1) here. I've never advised anyone to do that, because it's so > difficult to validate the truth of. Just given that, I'd be tempted to join > in on suggesting this parameter just go away in the name of safety, except > that I think category (2) here is growing now. ZFS is the most obvious > example where the atomic write implementation seems to always make disabling > full_page_writes safe. > For the sake of argument, has PG considered using a double write buffer similar to InnodB? -- Rob Wultsch wultsch@gmail.com
Rob Wultsch <wultsch@gmail.com> wrote: > has PG considered using a double write buffer similar to InnodB? That seems inferior to the full_page_writes strategy, where you only write a page twice the first time it is written after a checkpoint. We're talking about when we might be able to write *less*, not more. -Kevin
On Fri, Oct 22, 2010 at 10:28 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Rob Wultsch <wultsch@gmail.com> wrote: > >> has PG considered using a double write buffer similar to InnodB? > > That seems inferior to the full_page_writes strategy, where you only > write a page twice the first time it is written after a checkpoint. > We're talking about when we might be able to write *less*, not more. > > -Kevin > By "write" do you mean number of writes, or the number of bytes of the writes? For number of writes, yes a double write buffer will lose. In terms of number of bytes, I would think full_page_writes=off + double write buffer should be far superior, particularly given that the WAL is shipped over the network to slaves. -- Rob Wultsch wultsch@gmail.com
Rob Wultsch <wultsch@gmail.com> wrote: > I would think full_page_writes=off + double write buffer should be > far superior, particularly given that the WAL is shipped over the > network to slaves. For a reasonably brief description of InnoDB double write buffers, I found this: http://www.mysqlperformanceblog.com/2006/08/04/innodb-double-write/ One big question before even considering this would by how to determine whether a potentially torn page "is inconsistent". Without a page CRC or some such mechanism, I don't see how this technique is possible. Even if it's possible, it's far from clear to me that it would be an improvement. The author estimates (apparently somewhat loosely) that it's a 5% to 10% performance hit in InnoDB; I'm far from certain that full_page_writes cost us that much. Does anyone have benchmark numbers handy? -Kevin
On Fri, Oct 22, 2010 at 12:05 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Rob Wultsch <wultsch@gmail.com> wrote: > >> I would think full_page_writes=off + double write buffer should be >> far superior, particularly given that the WAL is shipped over the >> network to slaves. > > For a reasonably brief description of InnoDB double write buffers, I > found this: > > http://www.mysqlperformanceblog.com/2006/08/04/innodb-double-write/ > > One big question before even considering this would by how to > determine whether a potentially torn page "is inconsistent". > Without a page CRC or some such mechanism, I don't see how this > technique is possible. > > Even if it's possible, it's far from clear to me that it would be an > improvement. The author estimates (apparently somewhat loosely) > that it's a 5% to 10% performance hit in InnoDB; I'm far from > certain that full_page_writes cost us that much. Does anyone have > benchmark numbers handy? > > -Kevin > Ignoring (briefly) the cost in terms of performance of the different system, not needing full_page_writes would make geographically dispersed replication possible for certain cases where it is not currently (or at least rather painful). -- Rob Wultsch wultsch@gmail.com
Rob Wultsch <wultsch@gmail.com> wrote: > not needing full_page_writes would make geographically dispersed > replication possible for certain cases where it is not currently > (or at least rather painful). Do you have any hard numbers on WAL file size impact? How much does pglesslog help in a file-based WAL transmission environment? Should we be considering similar filtering for streaming replication? -Kevin
On Fri, Oct 22, 2010 at 1:15 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Rob Wultsch <wultsch@gmail.com> wrote: > >> not needing full_page_writes would make geographically dispersed >> replication possible for certain cases where it is not currently >> (or at least rather painful). > > Do you have any hard numbers on WAL file size impact? How much does > pglesslog help in a file-based WAL transmission environment? Should > we be considering similar filtering for streaming replication? > > -Kevin > No, I am DBA that mostly works on MySQL. I have had to deal with (handwaving...) tangential issues recently. I really would like to work with PG more and this seems like it would be a significant hindrance for certain usage patterns. Lots of replication does not take place over gig... -- Rob Wultsch wultsch@gmail.com
Rob Wultsch wrote: > I really would like to work with PG more and this seems like > [full_page_writes] would be a significant hindrance for certain > usage patterns. Lots of replication does not take place over gig... Certainly most of the Wisconsin State Courts replication takes place over WAN connections at a few Mbps. I haven't seen any evidence that having full_page_writes on has caused us problems, personally. In the PostgreSQL community you generally need to show some hard numbers from a repeatable test case for the community to believe that there's a problem which needs fixing, much less to buy in to some particular fix for the purported problem. On the other hand, if you can show that there actually *is* a problem, I've never seen a group which responds so quickly and effectively to solve it as the PostgreSQL community. Don't get too attached to a particular solution without proof that it's better than the alternatives, though.... -Kevin
> Even if it's possible, it's far from clear to me that it would be an > improvement. The author estimates (apparently somewhat loosely) > that it's a 5% to 10% performance hit in InnoDB; I'm far from > certain that full_page_writes cost us that much. Does anyone have > benchmark numbers handy? It most certainly can, depending on your CPU saturation and I/O support. I've seen a 10% improvement in througput time from turning off full_page_writes on some machines, such as when we were doing the SpecJAppserver benchmarks on Solaris. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Kevin Grittner wrote: > On what do you base that assumption? I assume that we send a full > 8K to the OS cache, and the file system writes disk sectors > according to its own algorithm. With either platters or BBU cache, > the data is persisted on fsync; why do you see a risk with one but > not the other? > Surely 'the data is persisted sometime after our write and before the fsynch returns, but may be written: - in small chunks - out of order - in an unpredictable way' When I looked at the internals of TokyoCabinet for example, the design was flawed but would be 'fairly robust' so long as mmap'd pages that were dirtied did not get persisted until msync, and were then persisted atomically.
James Mansion wrote: > When I looked at the internals of TokyoCabinet for example, the design > was flawed but > would be 'fairly robust' so long as mmap'd pages that were dirtied did > not get persisted > until msync, and were then persisted atomically. If TokyoCabinet presumes that's true and overwrites existing blocks with that assumption, it would land onto my list of databases I wouldn't trust to hold my TODO list. Flip off power to a server, and you have no idea what portion of the blocks sitting in the drive's cache actually made it to disk; that's not even guaranteed atomic to the byte level. Torn pages happen all the time unless you either a) put the entire write into a non-volatile cache before writing any of it, b) write and sync somewhere else first and then do a journaled filesystem pointer swap from the old page to the new one, or c) journal the whole write the way PostgreSQL does with full_page_writes and the WAL. The discussion here veered off over whether (a) was sufficiently satisfied just by having a RAID controller with battery backup, and what I concluded from the dive into the details is that it's definitely not true unless the filesystem block size exactly matches the database one. And even then, make sure you test heavily. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
Jesper Krogh wrote: > Can you point to some ZFS docs that tell that this is the case.. I'd > be surprised > if it doesnt copy away the old block and replaces it with the new one > in-place. The > other behaviour would quite quickly lead to a hugely fragmented > filesystem that > performs next to useless and ZFS doesnt seem to be in that category.. http://all-unix.blogspot.com/2007/03/zfs-cow-and-relate-features.html "Blocks containing active data are never overwritten in place; instead, a new block is allocated, modified data is written to it, and then any metadata blocks referencing it are similarly read, reallocated, and written." http://opensolaris.org/jive/thread.jspa?messageID=19264 discusses how this interacts with the common types of hardware around: no guaratees with lying hard drives as always, but otherwise you're fine. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
Greg Smith <greg@2ndquadrant.com> writes: > James Mansion wrote: >> When I looked at the internals of TokyoCabinet for example, the design >> was flawed but >> would be 'fairly robust' so long as mmap'd pages that were dirtied did >> not get persisted >> until msync, and were then persisted atomically. > If TokyoCabinet presumes that's true and overwrites existing blocks with > that assumption, it would land onto my list of databases I wouldn't > trust to hold my TODO list. Flip off power to a server, and you have no > idea what portion of the blocks sitting in the drive's cache actually > made it to disk; that's not even guaranteed atomic to the byte level. The other and probably worse problem is that there's no application control over how soon changes to mmap'd pages get to disk. An msync will flush them out, but the kernel is free to write dirty pages sooner. So if they're depending for consistency on writes not happening until msync, it's broken by design. (This is one of the big reasons we don't use mmap'd space for Postgres disk buffers.) regards, tom lane
On Oct 22, 2010, at 1:06 PM, Rob Wultsch wrote: > On Fri, Oct 22, 2010 at 12:05 PM, Kevin Grittner > <Kevin.Grittner@wicourts.gov> wrote: >> Rob Wultsch <wultsch@gmail.com> wrote: >> >>> I would think full_page_writes=off + double write buffer should be >>> far superior, particularly given that the WAL is shipped over the >>> network to slaves. >> >> For a reasonably brief description of InnoDB double write buffers, I >> found this: >> >> http://www.mysqlperformanceblog.com/2006/08/04/innodb-double-write/ >> >> One big question before even considering this would by how to >> determine whether a potentially torn page "is inconsistent". >> Without a page CRC or some such mechanism, I don't see how this >> technique is possible. >> >> Even if it's possible, it's far from clear to me that it would be an >> improvement. The author estimates (apparently somewhat loosely) >> that it's a 5% to 10% performance hit in InnoDB; I'm far from >> certain that full_page_writes cost us that much. Does anyone have >> benchmark numbers handy? >> >> -Kevin >> > > Ignoring (briefly) the cost in terms of performance of the different > system, not needing full_page_writes would make geographically > dispersed replication possible for certain cases where it is not > currently (or at least rather painful).. Am I missing something here? Can't the network replication traffic be partial pages, but the WAL log on the slave (and master) be full pages? In otherwords, the slave can transform a partial page update into a full page xlog entry. (optional) 1. Log partial pages received from master to disk. (not xlog, something else, useful to persist changes faster) 2. Read page from disk for update. 3. Log full page modification to xlog for local commit. 4. Update page in memory and write out to OS as usual. The lack of the full page from the master would mean that you have to do a read-modify-write rather than just overwrite,but I think that works fine if network bandwidth is your bottleneck. I don't know enough of the guts of Postgres to be certain, but it seems conceptually like this is possible. Also one could use lzo compression and get a likely factor of two space saving with small CPU cost. > > -- > Rob Wultsch > wultsch@gmail.com > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
On Fri, Oct 22, 2010 at 3:05 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Rob Wultsch <wultsch@gmail.com> wrote: > >> I would think full_page_writes=off + double write buffer should be >> far superior, particularly given that the WAL is shipped over the >> network to slaves. > > For a reasonably brief description of InnoDB double write buffers, I > found this: > > http://www.mysqlperformanceblog.com/2006/08/04/innodb-double-write/ > > One big question before even considering this would by how to > determine whether a potentially torn page "is inconsistent". > Without a page CRC or some such mechanism, I don't see how this > technique is possible. There are two sides to this problem: figuring out when to write a page to the double write buffer, and figuring out when to read it back from the double write buffer. The first seems easy: we just do it whenever we would XLOG a full page image. As to the second, when we write the page out to the double write buffer, we could also write to the double write buffer the LSN of the WAL record which depends on that full page image. Then, at the start of recovery, we scan the double write buffer and remember all those LSNs. When we reach one of them, we replay the full page image. The good thing about this is that it would reduce WAL volume; the bad thing about it is that it would probably mean doing two fsyncs where we only now do one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Oct 26, 2010 at 5:41 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Oct 22, 2010 at 3:05 PM, Kevin Grittner > <Kevin.Grittner@wicourts.gov> wrote: >> Rob Wultsch <wultsch@gmail.com> wrote: >> >>> I would think full_page_writes=off + double write buffer should be >>> far superior, particularly given that the WAL is shipped over the >>> network to slaves. >> >> For a reasonably brief description of InnoDB double write buffers, I >> found this: >> >> http://www.mysqlperformanceblog.com/2006/08/04/innodb-double-write/ >> >> One big question before even considering this would by how to >> determine whether a potentially torn page "is inconsistent". >> Without a page CRC or some such mechanism, I don't see how this >> technique is possible. > > There are two sides to this problem: figuring out when to write a page > to the double write buffer, and figuring out when to read it back from > the double write buffer. The first seems easy: we just do it whenever > we would XLOG a full page image. As to the second, when we write the > page out to the double write buffer, we could also write to the double > write buffer the LSN of the WAL record which depends on that full page > image. Then, at the start of recovery, we scan the double write > buffer and remember all those LSNs. When we reach one of them, we > replay the full page image. > > The good thing about this is that it would reduce WAL volume; the bad > thing about it is that it would probably mean doing two fsyncs where > we only now do one. > The double write buffer is one of the few areas where InnoDB does more IO (in the form of fsynch's) than PG. InnoDB also has fuzzy checkpoints (which help to keep dirty pages in memory longer), buffering of writing out changes to secondary indexes, and recently tunable page level compression. Given that InnoDB is not shipping its logs across the wire, I don't think many users would really care if it used the double writer or full page writes approach to the redo log (other than the fact that the log files would be bigger). PG on the other hand *is* pushing its logs over the wire... -- Rob Wultsch wultsch@gmail.com
On Tue, Oct 26, 2010 at 10:13 AM, Rob Wultsch <wultsch@gmail.com> wrote: > The double write buffer is one of the few areas where InnoDB does more > IO (in the form of fsynch's) than PG. InnoDB also has fuzzy > checkpoints (which help to keep dirty pages in memory longer), > buffering of writing out changes to secondary indexes, and recently > tunable page level compression. Baron Schwartz was talking to me about this at Surge. I don't really understand how the fuzzy checkpoint stuff works, and I haven't been able to find a good description of it anywhere. How does it keep dirty pages in memory longer? Details on the other things you mention would be interesting to hear, too. > Given that InnoDB is not shipping its logs across the wire, I don't > think many users would really care if it used the double writer or > full page writes approach to the redo log (other than the fact that > the log files would be bigger). PG on the other hand *is* pushing its > logs over the wire... So how is InnoDB doing replication? Is there a second log just for that? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Oct 26, 2010 at 7:25 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Oct 26, 2010 at 10:13 AM, Rob Wultsch <wultsch@gmail.com> wrote: >> The double write buffer is one of the few areas where InnoDB does more >> IO (in the form of fsynch's) than PG. InnoDB also has fuzzy >> checkpoints (which help to keep dirty pages in memory longer), >> buffering of writing out changes to secondary indexes, and recently >> tunable page level compression. > > Baron Schwartz was talking to me about this at Surge. I don't really > understand how the fuzzy checkpoint stuff works, and I haven't been > able to find a good description of it anywhere. How does it keep > dirty pages in memory longer? Details on the other things you mention > would be interesting to hear, too. For checkpoint behavior: http://books.google.com/books?id=S_yHERPRZScC&pg=PA606&lpg=PA606&dq=fuzzy+checkpoint&source=bl&ots=JJrzRUKBGh&sig=UOMPsRy5E-YDgjAFkaSVn3dps_M&hl=en&ei=_k8yTOfeHYzZnAepyumLBA&sa=X&oi=book_result&ct=result&resnum=8&ved=0CEYQ6AEwBw#v=onepage&q=fuzzy%20checkpoint&f=false I would think that best case behavior "sharp" checkpoints with a large checkpoint_completion_target would have behavior similar to a fuzzy checkpoint. Insert (for innodb 1.1+ evidently there is also does delete and purge) buffering: http://dev.mysql.com/doc/refman/5.5/en/innodb-insert-buffering.html For a recent ~800GB db I had to restore, the insert buffer saved 92% of io needed for secondary indexes. Compression: http://dev.mysql.com/doc/innodb-plugin/1.0/en/innodb-compression-internals.html For many workloads 50% compression results in negligible impact to performance. For certain workloads compression can help performance. Please note that InnoDB also has non-tunable toast like feature. >> Given that InnoDB is not shipping its logs across the wire, I don't >> think many users would really care if it used the double writer or >> full page writes approach to the redo log (other than the fact that >> the log files would be bigger). PG on the other hand *is* pushing its >> logs over the wire... > > So how is InnoDB doing replication? Is there a second log just for that? > The other log is the "binary log" and it is one of the biggest problems with MySQL. Running MySQL in such a way that the binary log stays in sync with the InnoDB redo has a very significant impact on performance. http://www.mysqlperformanceblog.com/2010/10/23/mysql-limitations-part-2-the-binary-log/ http://mysqlha.blogspot.com/2010/09/mysql-versus-mongodb-update-performance.html (check out the pretty graph) If you are going to West you should considering heading over to the Facebook office on Tuesday as the MySQL team is having something of an open house: http://www.facebook.com/event.php?eid=160712450628622 Mark Callaghan from the Facebook MySQL Engineering (and several members of their ops team, for that matter) team understands InnoDB dramatically better than I do. -- Rob Wultsch wultsch@gmail.com
On Wed, Oct 27, 2010 at 12:41 AM, Rob Wultsch <wultsch@gmail.com> wrote: > On Tue, Oct 26, 2010 at 7:25 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, Oct 26, 2010 at 10:13 AM, Rob Wultsch <wultsch@gmail.com> wrote: >>> The double write buffer is one of the few areas where InnoDB does more >>> IO (in the form of fsynch's) than PG. InnoDB also has fuzzy >>> checkpoints (which help to keep dirty pages in memory longer), >>> buffering of writing out changes to secondary indexes, and recently >>> tunable page level compression. >> >> Baron Schwartz was talking to me about this at Surge. I don't really >> understand how the fuzzy checkpoint stuff works, and I haven't been >> able to find a good description of it anywhere. How does it keep >> dirty pages in memory longer? Details on the other things you mention >> would be interesting to hear, too. > > For checkpoint behavior: > http://books.google.com/books?id=S_yHERPRZScC&pg=PA606&lpg=PA606&dq=fuzzy+checkpoint&source=bl&ots=JJrzRUKBGh&sig=UOMPsRy5E-YDgjAFkaSVn3dps_M&hl=en&ei=_k8yTOfeHYzZnAepyumLBA&sa=X&oi=book_result&ct=result&resnum=8&ved=0CEYQ6AEwBw#v=onepage&q=fuzzy%20checkpoint&f=false > > I would think that best case behavior "sharp" checkpoints with a large > checkpoint_completion_target would have behavior similar to a fuzzy > checkpoint. Well, under that definition of a fuzzy checkpoint, our checkpoints are fuzzy even with checkpoint_completion_target=0. What Baron seemed to be describing was a scheme whereby you could do what I might call partial checkpoints. IOW, you want to move the redo pointer without writing out ALL the dirty buffers in memory, so you write out the pages with the oldest LSNs and then move the redo pointer to the oldest LSN you have left. Except that doesn't quite work, because the page might have been dirtied at LSN X and then later updated again at LSN Y, and you still have to flush it to disk before moving the redo pointer to any value >X. So you work around that by maintaining a "first dirtied" LSN for each page as well as the current LSN. I'm not 100% sure that this is how it works or that it would work in PG, but even assuming that it is and does, I'm not sure what the benefit is over the checkpoint-spreading logic we have now. There might be some benefit in sorting the writes that we do, so that we can spread out the fsyncs. So, write all the blocks to a give file, fsync, and then repeat for each underlying data file that has at least one dirty block. But that's completely orthogonal to (and would actually be hindered by) the approach described in the preceding paragraph. > Insert (for innodb 1.1+ evidently there is also does delete and purge) > buffering: > http://dev.mysql.com/doc/refman/5.5/en/innodb-insert-buffering.html We do something a bit like this for GIST indices. It would be interesting to see if it also has a benefit for btree indices. > For a recent ~800GB db I had to restore, the insert buffer saved 92% > of io needed for secondary indexes. > > Compression: > http://dev.mysql.com/doc/innodb-plugin/1.0/en/innodb-compression-internals.html > > For many workloads 50% compression results in negligible impact to > performance. For certain workloads compression can help performance. > Please note that InnoDB also has non-tunable toast like feature. Interesting. I am surprised this works well. It seems that this only works for pages that can be compressed by >=50%, which seems like it could result in a lot of CPU wasted on failed attempts to compress. >>> Given that InnoDB is not shipping its logs across the wire, I don't >>> think many users would really care if it used the double writer or >>> full page writes approach to the redo log (other than the fact that >>> the log files would be bigger). PG on the other hand *is* pushing its >>> logs over the wire... >> >> So how is InnoDB doing replication? Is there a second log just for that? >> > > The other log is the "binary log" and it is one of the biggest > problems with MySQL. Running MySQL in such a way that the binary log > stays in sync with the InnoDB redo has a very significant impact on > performance. > http://www.mysqlperformanceblog.com/2010/10/23/mysql-limitations-part-2-the-binary-log/ > http://mysqlha.blogspot.com/2010/09/mysql-versus-mongodb-update-performance.html > (check out the pretty graph) Hmm. That seems kinda painful. Having to ship full page images over the wire doesn't seems so bad by comparison, though I'm not very happy about having to do that either. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Oct 27, 2010 at 6:55 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Oct 27, 2010 at 12:41 AM, Rob Wultsch <wultsch@gmail.com> wrote: >> On Tue, Oct 26, 2010 at 7:25 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Tue, Oct 26, 2010 at 10:13 AM, Rob Wultsch <wultsch@gmail.com> wrote: >>>> The double write buffer is one of the few areas where InnoDB does more >>>> IO (in the form of fsynch's) than PG. InnoDB also has fuzzy >>>> checkpoints (which help to keep dirty pages in memory longer), >>>> buffering of writing out changes to secondary indexes, and recently >>>> tunable page level compression. >>> >>> Baron Schwartz was talking to me about this at Surge. I don't really >>> understand how the fuzzy checkpoint stuff works, and I haven't been >>> able to find a good description of it anywhere. How does it keep >>> dirty pages in memory longer? Details on the other things you mention >>> would be interesting to hear, too. >> >> For checkpoint behavior: >> http://books.google.com/books?id=S_yHERPRZScC&pg=PA606&lpg=PA606&dq=fuzzy+checkpoint&source=bl&ots=JJrzRUKBGh&sig=UOMPsRy5E-YDgjAFkaSVn3dps_M&hl=en&ei=_k8yTOfeHYzZnAepyumLBA&sa=X&oi=book_result&ct=result&resnum=8&ved=0CEYQ6AEwBw#v=onepage&q=fuzzy%20checkpoint&f=false >> >> I would think that best case behavior "sharp" checkpoints with a large >> checkpoint_completion_target would have behavior similar to a fuzzy >> checkpoint. > > Well, under that definition of a fuzzy checkpoint, our checkpoints are > fuzzy even with checkpoint_completion_target=0. > > What Baron seemed to be describing was a scheme whereby you could do > what I might call partial checkpoints. IOW, you want to move the redo > pointer without writing out ALL the dirty buffers in memory, so you > write out the pages with the oldest LSNs and then move the redo > pointer to the oldest LSN you have left. Except that doesn't quite > work, because the page might have been dirtied at LSN X and then later > updated again at LSN Y, and you still have to flush it to disk before > moving the redo pointer to any value >X. So you work around that by > maintaining a "first dirtied" LSN for each page as well as the current > LSN. > > I'm not 100% sure that this is how it works or that it would work in > PG, but even assuming that it is and does, I'm not sure what the > benefit is over the checkpoint-spreading logic we have now. There > might be some benefit in sorting the writes that we do, so that we can > spread out the fsyncs. So, write all the blocks to a give file, > fsync, and then repeat for each underlying data file that has at least > one dirty block. But that's completely orthogonal to (and would > actually be hindered by) the approach described in the preceding > paragraph. I wish I could answer your questions better. I am a power user that does not fully understand InnoDB internals. There are not all that many folks that have a very good understanding of InnoDB internals (given how well it works there is not all that much need). > >> Insert (for innodb 1.1+ evidently there is also does delete and purge) >> buffering: >> http://dev.mysql.com/doc/refman/5.5/en/innodb-insert-buffering.html > > We do something a bit like this for GIST indices. It would be > interesting to see if it also has a benefit for btree indices. > >> For a recent ~800GB db I had to restore, the insert buffer saved 92% >> of io needed for secondary indexes. >> >> Compression: >> http://dev.mysql.com/doc/innodb-plugin/1.0/en/innodb-compression-internals.html >> >> For many workloads 50% compression results in negligible impact to >> performance. For certain workloads compression can help performance. >> Please note that InnoDB also has non-tunable toast like feature. > > Interesting. I am surprised this works well. It seems that this only > works for pages that can be compressed by >=50%, which seems like it > could result in a lot of CPU wasted on failed attempts to compress. In my world, the spinning disk is almost always the bottleneck. Trading CPU for IO is almost always a good deal for me. > >>>> Given that InnoDB is not shipping its logs across the wire, I don't >>>> think many users would really care if it used the double writer or >>>> full page writes approach to the redo log (other than the fact that >>>> the log files would be bigger). PG on the other hand *is* pushing its >>>> logs over the wire... >>> >>> So how is InnoDB doing replication? Is there a second log just for that? >>> >> >> The other log is the "binary log" and it is one of the biggest >> problems with MySQL. Running MySQL in such a way that the binary log >> stays in sync with the InnoDB redo has a very significant impact on >> performance. >> http://www.mysqlperformanceblog.com/2010/10/23/mysql-limitations-part-2-the-binary-log/ >> http://mysqlha.blogspot.com/2010/09/mysql-versus-mongodb-update-performance.html >> (check out the pretty graph) > > Hmm. That seems kinda painful. Having to ship full page images over > the wire doesn't seems so bad by comparison, though I'm not very happy > about having to do that either. > The binary log is less than ideal, but with MySQL replication I can replicate to *many* servers that are *very* geographically distributed without all that many headaches. In addition it is simple enough that I can have junior DBA manage it. I have doubts that I could make PG do the same anywhere near as easily, particularly given how long and narrow some pipes are... -- Rob Wultsch wultsch@gmail.com
Tom Lane wrote: > The other and probably worse problem is that there's no application > control over how soon changes to mmap'd pages get to disk. An msync > will flush them out, but the kernel is free to write dirty pages sooner. > So if they're depending for consistency on writes not happening until > msync, it's broken by design. (This is one of the big reasons we don't > use mmap'd space for Postgres disk buffers.) > Well, I agree that it sucks for the reason you give - but you use write and that's *exactly* the same in terms of when it gets written, as when you update a byte on an mmap'd page. And you're quite happy to use write. The only difference is that its a lot more explicit where the point of 'maybe its written and maybe it isn't' occurs. There need be no real difference in the architecture for one over the other: there does seem to be evidence that write and read can have better forward-read and write-behind behaviour, because read/write does allow you to initiate an IO with a hint to a size that exceeds a hardware page. And yes, after getting into the details while starting to port TC to Windows, I decided to bin it. Especially handy that SQLite3 has WAL now. (And one last dig - TC didn't even have a checksum that would let you tell when it had been broken: but it might all be fixed now of course, I don't have time to check.) James
James Mansion <james@mansionfamily.plus.com> writes: > Tom Lane wrote: >> The other and probably worse problem is that there's no application >> control over how soon changes to mmap'd pages get to disk. An msync >> will flush them out, but the kernel is free to write dirty pages sooner. >> So if they're depending for consistency on writes not happening until >> msync, it's broken by design. (This is one of the big reasons we don't >> use mmap'd space for Postgres disk buffers.) > Well, I agree that it sucks for the reason you give - but you use > write and that's *exactly* the same in terms of when it gets written, > as when you update a byte on an mmap'd page. Uh, no, it is not. The difference is that we can update a byte in a shared buffer, and know that it *isn't* getting written out before we say so. If the buffer were mmap'd then we'd have no control over that, which makes it mighty hard to obey the WAL "write log before data" paradigm. It's true that we don't know whether write() causes an immediate or delayed disk write, but we generally don't care that much. What we do care about is being able to ensure that a WAL write happens before the data write, and with mmap we don't have control over that. regards, tom lane
On Thu, Oct 28, 2010 at 5:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > James Mansion <james@mansionfamily.plus.com> writes: >> Tom Lane wrote: >>> The other and probably worse problem is that there's no application >>> control over how soon changes to mmap'd pages get to disk. An msync >>> will flush them out, but the kernel is free to write dirty pages sooner. >>> So if they're depending for consistency on writes not happening until >>> msync, it's broken by design. (This is one of the big reasons we don't >>> use mmap'd space for Postgres disk buffers.) > >> Well, I agree that it sucks for the reason you give - but you use >> write and that's *exactly* the same in terms of when it gets written, >> as when you update a byte on an mmap'd page. > > Uh, no, it is not. The difference is that we can update a byte in a > shared buffer, and know that it *isn't* getting written out before we > say so. If the buffer were mmap'd then we'd have no control over that, > which makes it mighty hard to obey the WAL "write log before data" > paradigm. > > It's true that we don't know whether write() causes an immediate or > delayed disk write, but we generally don't care that much. What we do > care about is being able to ensure that a WAL write happens before the > data write, and with mmap we don't have control over that. Well, we COULD keep the data in shared buffers, and then copy it into an mmap()'d region rather than calling write(), but I'm not sure there's any advantage to it. Managing address space mappings is a pain in the butt. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Oct 29, 2010 at 11:43 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Well, we COULD keep the data in shared buffers, and then copy it into > an mmap()'d region rather than calling write(), but I'm not sure > there's any advantage to it. Managing address space mappings is a > pain in the butt. I could see this being a *theoretical* benefit in the case that the background writer gains the ability to write out all blocks associated with a file in order. In that case, you might get a win because you could get a single mmap of the entire file, and just wholesale memcpy blocks across, then sync/unmap it. This, of course assumes a few things that must be for it to be per formant: 0) a list of blocks to be written grouped by files is readily available. 1) The pages you write to must be in the page cache, or your memcpy is going to fault them in. With a plain write, you don't need the over-written page in the cache. 2) Now, instead of the torn-page problem being FS block/sector sized base, you can now actually have a possibly arbitrary amount of the block memory written when the kernel writes out the page. you *really* need full-page-writes. 3) The mmap overhead required for the kernel to setup the mappings is less than the repeated syscalls of a simple write(). All those things seem like something that somebody could synthetically benchmark to prove value before even trying to bolt into PostgreSQL. a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Oct 28, 2010 at 5:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> It's true that we don't know whether write() causes an immediate or >> delayed disk write, but we generally don't care that much. �What we do >> care about is being able to ensure that a WAL write happens before the >> data write, and with mmap we don't have control over that. > Well, we COULD keep the data in shared buffers, and then copy it into > an mmap()'d region rather than calling write(), but I'm not sure > there's any advantage to it. Managing address space mappings is a > pain in the butt. In principle that ought to be right about the same speed as using write() to copy the data from shared buffers to kernel disk buffers, anyway. regards, tom lane
On Fri, Oct 29, 2010 at 11:56 AM, Aidan Van Dyk <aidan@highrise.ca> wrote: > 1) The pages you write to must be in the page cache, or your memcpy is > going to fault them in. With a plain write, you don't need the > over-written page in the cache. I seem to remember a time many years ago when I got bitten by this problem. The fact that our I/O is in 8K pages means this could be a pretty severe hit, I think. > 2) Now, instead of the torn-page problem being FS block/sector sized > base, you can now actually have a possibly arbitrary amount of the > block memory written when the kernel writes out the page. you > *really* need full-page-writes. Yeah. > 3) The mmap overhead required for the kernel to setup the mappings is > less than the repeated syscalls of a simple write(). You'd expect to save something from that; but on the other hand, at least on 32-bit systems, there's a very limited number of 1GB files that can be simultaneously mapped into one address space, and it's a lot smaller than the number of file descriptors that you can have open. Rumor has it that cutting down the number of fds that can stay open simultaneously is pretty bad for performance, so cutting it down to a number you can count on one hand (maybe one finger) would probably be bad. Maybe on 64-bit it would be OK but it seems like an awful lot of complexity for at most a minor savings (and a pretty bad anti-savings if point #1 kicks in). Anyway this is all totally off-topic... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Excerpts from Greg Smith's message of jue oct 21 14:04:17 -0300 2010: > What I would like to do is beef up the documentation with some concrete > examples of how to figure out if your cache and associated write path > are working reliably or not. It should be possible to include "does > this handle full page writes correctly?" in that test suite. Until we > have something like that, I'm concerned that bugs in filesystem or > controller handling may make full_page_writes unsafe even with a BBU, > and we'd have no way for people to tell if that's true or not. I think if you assume that there are bugs in the filesystem which you need to protect against, you are already hosed. I imagine there must be some filesystem bug that makes it safe to have full_page_writes=on, but unsafe to have full_page_writes=off; but I'd probably discard those as a rare minority and thus not worth worrying about. I agree it would be worth testing though. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Fri, 29 Oct 2010, Robert Haas wrote: > On Thu, Oct 28, 2010 at 5:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> James Mansion <james@mansionfamily.plus.com> writes: >>> Tom Lane wrote: >>>> The other and probably worse problem is that there's no application >>>> control over how soon changes to mmap'd pages get to disk. An msync >>>> will flush them out, but the kernel is free to write dirty pages sooner. >>>> So if they're depending for consistency on writes not happening until >>>> msync, it's broken by design. (This is one of the big reasons we don't >>>> use mmap'd space for Postgres disk buffers.) >> >>> Well, I agree that it sucks for the reason you give - but you use >>> write and that's *exactly* the same in terms of when it gets written, >>> as when you update a byte on an mmap'd page. >> >> Uh, no, it is not. The difference is that we can update a byte in a >> shared buffer, and know that it *isn't* getting written out before we >> say so. If the buffer were mmap'd then we'd have no control over that, >> which makes it mighty hard to obey the WAL "write log before data" >> paradigm. >> >> It's true that we don't know whether write() causes an immediate or >> delayed disk write, but we generally don't care that much. What we do >> care about is being able to ensure that a WAL write happens before the >> data write, and with mmap we don't have control over that. > > Well, we COULD keep the data in shared buffers, and then copy it into > an mmap()'d region rather than calling write(), but I'm not sure > there's any advantage to it. Managing address space mappings is a > pain in the butt. keep in mind that you have no way of knowing what order the data in the mmap region gets written out to disk. David Lang
Tom Lane wrote: > Uh, no, it is not. The difference is that we can update a byte in a > shared buffer, and know that it *isn't* getting written out before we > Well, I don't know where yu got the idea I was refering to that sort of thing - its the same as writing to a buffer before copying to the mmap'd area. > It's true that we don't know whether write() causes an immediate or > delayed disk write, but we generally don't care that much. What we do > Which is what I was refering to. > care about is being able to ensure that a WAL write happens before the > data write, and with mmap we don't have control over that. > > I think you have just the same control either way, because you can only force ordering with an appropriate explicit sync, and in the absence of such a sync all bets are off for whether/when each disk page is written out, and if you can't ensure that the controller and disk are write through you'd better do a hardware cache flush.too, right? A shame that so many systems have relatively poor handling of that hardware flush.
On Fri, 29 Oct 2010, James Mansion wrote: > Tom Lane wrote: >> Uh, no, it is not. The difference is that we can update a byte in a >> shared buffer, and know that it *isn't* getting written out before we >> > Well, I don't know where yu got the idea I was refering to that sort of thing > - its > the same as writing to a buffer before copying to the mmap'd area. >> It's true that we don't know whether write() causes an immediate or >> delayed disk write, but we generally don't care that much. What we do >> > Which is what I was refering to. >> care about is being able to ensure that a WAL write happens before the >> data write, and with mmap we don't have control over that. >> >> > I think you have just the same control either way, because you can only force > ordering > with an appropriate explicit sync, and in the absence of such a sync all bets > are off for > whether/when each disk page is written out, and if you can't ensure that the > controller > and disk are write through you'd better do a hardware cache flush.too, right? > > A shame that so many systems have relatively poor handling of that hardware > flush. the issue is that when you update a mmaped chunk of data, it could be written out immediatly without you doing _anything_ (and thanks to multiple cores/threads, it could get written out while you are still in the middle of updating it). When you update an internal buffer and then write that, you know that nothing will hit the disk before you issue the write command. David Lang
Greg Smith wrote: > Kevin Grittner wrote: > > I assume that we send a full > > 8K to the OS cache, and the file system writes disk sectors > > according to its own algorithm. With either platters or BBU cache, > > the data is persisted on fsync; why do you see a risk with one but > > not the other > > I'd like a 10 minute argument please. I started to write something to > refute this, only to clarify in my head the sequence of events that > leads to the most questionable result, where I feel a bit less certain > than I did before of the safety here. Here is the worst case I believe > you're describing: > > 1) Transaction is written to the WAL and sync'd; client receives > COMMIT. Since full_page_writes is off, the data in the WAL consists > only of the delta of what changed on the page. > 2) 8K database page is written to OS cache > 3) PG calls fsync to force the database block out > 4) OS writes first 4K block of the change to the BBU write cache. Worst > case, this fills the cache, and it takes a moment for some random writes > to process before it has space to buffer again (makes this more likely > to happen, but it's not required to see the failure case here) > 5) Sudden power interruption, second half of the page write is lost > 6) Server restarts > 7) That 4K write is now replayed from the battery's cache > > At this point, you now have a torn 8K page, with 1/2 old and 1/2 new Based on this report, I think we need to update our documentation and backpatch removal of text that says that BBU users can safely turn off full-page writes. Patch attached. I think we have fallen into a trap I remember from the late 1990's where I was assuming that an 8k-block based file system would write to the disk atomically in 8k segments, which of course it cannot. My bet is that even if you write to the kernel in 8k pages, and have an 8k file system, the disk is still accessed via 512-byte blocks, even with a BBU. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml index a2724fa..1e67bbd 100644 *** /tmp/pgrevert.14281/7sLqTb_wal.sgml Tue Nov 30 21:57:17 2010 --- doc/src/sgml/wal.sgml Tue Nov 30 21:56:49 2010 *************** *** 164,173 **** <productname>PostgreSQL</> periodically writes full page images to permanent WAL storage <emphasis>before</> modifying the actual page on disk. By doing this, during crash recovery <productname>PostgreSQL</> can ! restore partially-written pages. If you have a battery-backed disk ! controller or file-system software that prevents partial page writes ! (e.g., ZFS), you can turn off this page imaging by turning off the ! <xref linkend="guc-full-page-writes"> parameter. </para> </sect1> --- 164,175 ---- <productname>PostgreSQL</> periodically writes full page images to permanent WAL storage <emphasis>before</> modifying the actual page on disk. By doing this, during crash recovery <productname>PostgreSQL</> can ! restore partially-written pages. If you have file-system software ! that prevents partial page writes (e.g., ZFS), you can turn off ! this page imaging by turning off the <xref ! linkend="guc-full-page-writes"> parameter. Battery-Backed unit ! (BBU) disk controllers do not prevent partial page writes unless ! they guarantee that data is written to the BBU as full (8kB) pages. </para> </sect1>
Kevin Grittner wrote: > Greg Smith <greg@2ndquadrant.com> wrote: > > > I think Kevin's point here may be that if your fsync isn't > > reliable, you're always in trouble. But if your fsync is good, > > even torn pages should be repairable by the deltas written to the > > WAL > > I was actually just arguing that a BBU doesn't eliminate a risk > here; if there is a risk with production-quality disk drives, there > is a risk with a controller with a BBU cache. The BBU cache just > tends to reduce the window of time in which corruption can occur. I > wasn't too sure of *why* there was a risk, but Tom's post cleared > that up. > > I wonder why we need to expose this GUC at all -- perhaps it should > be off when fsync is off and on otherwise? Leaving it on without > fsync is just harming performance for not much benefit, and turning > it off with fsync seems to be saying that you are willing to > tolerate a known risk of database corruption, just not quite so much > as you have without fsync. In reality it seems most likely to be a > mistake, either way. According to our docs, and my submitted patch, if you are using ZFS then you can turn off full-page writes, so full-page writes are still useful. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Greg Smith wrote: > Tom Lane wrote: > > You've got entirely too simplistic a view of what the "delta" might be, > > I fear. In particular there are various sorts of changes that involve > > inserting the data carried in the WAL record and shifting pre-existing > > data around to make room, or removing an item and moving remaining data > > around. If you try to replay that type of action against a torn page, > > you'll get corrupted results. > > > > I wasn't sure exactly how those were encoded, thanks for the > clarification. Given that, it seems to me there are only two situations > where full_page_writes is safe to turn off: > > 1) The operating system block size is exactly the same database block > size, and all writes are guaranteed to be atomic to that block size. Is that true? I have no idea. I thought everything was done at the 512-byte block level. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
> Is that true? I have no idea. I thought everything was done at the > 512-byte block level. Newer disks (2TB and up) can have 4k sectors, but this still means a page spans several sectors.
Pierre C wrote: > > > Is that true? I have no idea. I thought everything was done at the > > 512-byte block level. > > Newer disks (2TB and up) can have 4k sectors, but this still means a page > spans several sectors. Yes, I had heard about that. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Bruce Momjian wrote: > Greg Smith wrote: > > Kevin Grittner wrote: > > > I assume that we send a full > > > 8K to the OS cache, and the file system writes disk sectors > > > according to its own algorithm. With either platters or BBU cache, > > > the data is persisted on fsync; why do you see a risk with one but > > > not the other > > > > I'd like a 10 minute argument please. I started to write something to > > refute this, only to clarify in my head the sequence of events that > > leads to the most questionable result, where I feel a bit less certain > > than I did before of the safety here. Here is the worst case I believe > > you're describing: > > > > 1) Transaction is written to the WAL and sync'd; client receives > > COMMIT. Since full_page_writes is off, the data in the WAL consists > > only of the delta of what changed on the page. > > 2) 8K database page is written to OS cache > > 3) PG calls fsync to force the database block out > > 4) OS writes first 4K block of the change to the BBU write cache. Worst > > case, this fills the cache, and it takes a moment for some random writes > > to process before it has space to buffer again (makes this more likely > > to happen, but it's not required to see the failure case here) > > 5) Sudden power interruption, second half of the page write is lost > > 6) Server restarts > > 7) That 4K write is now replayed from the battery's cache > > > > At this point, you now have a torn 8K page, with 1/2 old and 1/2 new > > Based on this report, I think we need to update our documentation and > backpatch removal of text that says that BBU users can safely turn off > full-page writes. Patch attached. > > I think we have fallen into a trap I remember from the late 1990's where > I was assuming that an 8k-block based file system would write to the > disk atomically in 8k segments, which of course it cannot. My bet is > that even if you write to the kernel in 8k pages, and have an 8k file > system, the disk is still accessed via 512-byte blocks, even with a BBU. Doc patch applied. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml index a2724fa..1e67bbd 100644 *** /tmp/pgrevert.14281/7sLqTb_wal.sgml Tue Nov 30 21:57:17 2010 --- doc/src/sgml/wal.sgml Tue Nov 30 21:56:49 2010 *************** *** 164,173 **** <productname>PostgreSQL</> periodically writes full page images to permanent WAL storage <emphasis>before</> modifying the actual page on disk. By doing this, during crash recovery <productname>PostgreSQL</> can ! restore partially-written pages. If you have a battery-backed disk ! controller or file-system software that prevents partial page writes ! (e.g., ZFS), you can turn off this page imaging by turning off the ! <xref linkend="guc-full-page-writes"> parameter. </para> </sect1> --- 164,175 ---- <productname>PostgreSQL</> periodically writes full page images to permanent WAL storage <emphasis>before</> modifying the actual page on disk. By doing this, during crash recovery <productname>PostgreSQL</> can ! restore partially-written pages. If you have file-system software ! that prevents partial page writes (e.g., ZFS), you can turn off ! this page imaging by turning off the <xref ! linkend="guc-full-page-writes"> parameter. Battery-Backed unit ! (BBU) disk controllers do not prevent partial page writes unless ! they guarantee that data is written to the BBU as full (8kB) pages. </para> </sect1>