Thread: BBU Cache vs. spindles

BBU Cache vs. spindles

From
Steve Crawford
Date:
I'm weighing options for a new server. In addition to PostgreSQL, this
machine will handle some modest Samba and Rsync load.

I will have enough RAM so the virtually all disk-read activity will be
cached. The average PostgreSQL read activity will be modest - a mix of
single-record and fairly large (reporting) result-sets. Writes will be
modest as well but will come in brief (1-5 second) bursts of individual
inserts. The rate of insert requests will hit 100-200/second for those
brief bursts.

So...

Am I likely to be better off putting $$$ toward battery-backup on the
RAID or toward adding a second RAID-set and splitting off the WAL
traffic? Or something else?

Cheers,
Steve


Re: BBU Cache vs. spindles

From
Ben Chobot
Date:
On Oct 7, 2010, at 4:38 PM, Steve Crawford wrote:

> I'm weighing options for a new server. In addition to PostgreSQL, this machine will handle some modest Samba and
Rsyncload. 
>
> I will have enough RAM so the virtually all disk-read activity will be cached. The average PostgreSQL read activity
willbe modest - a mix of single-record and fairly large (reporting) result-sets. Writes will be modest as well but will
comein brief (1-5 second) bursts of individual inserts. The rate of insert requests will hit 100-200/second for those
briefbursts. 
>
> So...
>
> Am I likely to be better off putting $$$ toward battery-backup on the RAID or toward adding a second RAID-set and
splittingoff the WAL traffic? Or something else? 

A BBU is, what, $100 or so? Adding one seems a no-brainer to me. Dedicated WAL spindles are nice and all, but they're
stillspinning media. Raid card cache is waaaay faster, and while it's best at bursty writes, it sounds like bursty
writesare precisely what you have. 



Re: BBU Cache vs. spindles

From
Bruce Momjian
Date:
Ben Chobot wrote:
> On Oct 7, 2010, at 4:38 PM, Steve Crawford wrote:
>
> > I'm weighing options for a new server. In addition to PostgreSQL, this machine will handle some modest Samba and
Rsyncload. 
> >
> > I will have enough RAM so the virtually all disk-read activity will be cached. The average PostgreSQL read activity
willbe modest - a mix of single-record and fairly large (reporting) result-sets. Writes will be modest as well but will
comein brief (1-5 second) bursts of individual inserts. The rate of insert requests will hit 100-200/second for those
briefbursts. 
> >
> > So...
> >
> > Am I likely to be better off putting $$$ toward battery-backup on the RAID or toward adding a second RAID-set and
splittingoff the WAL traffic? Or something else? 
>
> A BBU is, what, $100 or so? Adding one seems a no-brainer to me.
> Dedicated WAL spindles are nice and all, but they're still spinning
> media. Raid card cache is waaaay faster, and while it's best at bursty
> writes, it sounds like bursty writes are precisely what you have.

Totally agree!

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: BBU Cache vs. spindles

From
"Joshua D. Drake"
Date:
On Wed, 2010-10-20 at 22:13 -0400, Bruce Momjian wrote:
> Ben Chobot wrote:
> > On Oct 7, 2010, at 4:38 PM, Steve Crawford wrote:
> >
> > > I'm weighing options for a new server. In addition to PostgreSQL, this machine will handle some modest Samba and
Rsyncload. 
> > >
> > > I will have enough RAM so the virtually all disk-read activity will be cached. The average PostgreSQL read
activitywill be modest - a mix of single-record and fairly large (reporting) result-sets. Writes will be modest as well
butwill come in brief (1-5 second) bursts of individual inserts. The rate of insert requests will hit 100-200/second
forthose brief bursts. 
> > >
> > > So...
> > >
> > > Am I likely to be better off putting $$$ toward battery-backup on the RAID or toward adding a second RAID-set and
splittingoff the WAL traffic? Or something else? 
> >
> > A BBU is, what, $100 or so? Adding one seems a no-brainer to me.
> > Dedicated WAL spindles are nice and all, but they're still spinning
> > media. Raid card cache is waaaay faster, and while it's best at bursty
> > writes, it sounds like bursty writes are precisely what you have.
>
> Totally agree!

BBU first, more spindles second.

>
> --
>   Bruce Momjian  <bruce@momjian.us>        http://momjian.us
>   EnterpriseDB                             http://enterprisedb.com
>
>   + It's impossible for everything to be true. +
>

--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt


Re: BBU Cache vs. spindles

From
Scott Marlowe
Date:
On Wed, Oct 20, 2010 at 8:25 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
> On Wed, 2010-10-20 at 22:13 -0400, Bruce Momjian wrote:
>> Ben Chobot wrote:
>> > On Oct 7, 2010, at 4:38 PM, Steve Crawford wrote:
>> >
>> > > I'm weighing options for a new server. In addition to PostgreSQL, this machine will handle some modest Samba and
Rsyncload. 
>> > >
>> > > I will have enough RAM so the virtually all disk-read activity will be cached. The average PostgreSQL read
activitywill be modest - a mix of single-record and fairly large (reporting) result-sets. Writes will be modest as well
butwill come in brief (1-5 second) bursts of individual inserts. The rate of insert requests will hit 100-200/second
forthose brief bursts. 
>> > >
>> > > So...
>> > >
>> > > Am I likely to be better off putting $$$ toward battery-backup on the RAID or toward adding a second RAID-set
andsplitting off the WAL traffic? Or something else? 
>> >
>> > A BBU is, what, $100 or so? Adding one seems a no-brainer to me.
>> > Dedicated WAL spindles are nice and all, but they're still spinning
>> > media. Raid card cache is waaaay faster, and while it's best at bursty
>> > writes, it sounds like bursty writes are precisely what you have.
>>
>> Totally agree!
>
> BBU first, more spindles second.

Agreed.  note that while you can get incredible burst performance from
a battery backed cache, due to both caching and writing out of order,
once the throughput begins to saturate at the speed of the disk array,
the bbu cache is now only re-ordering really, as it will eventually
fill up faster than the disks can take the writes, and you'll settle
in at some percentage of your max tps you get for a short benchmark
run.  It's vitally important that once you put a BBU cache in place,
you run a very long running transactional test (pgbench is a simple
one to start with) that floods the io subsystem so you see what you're
average throughput is with the WAL and data store getting flooded.  I
know on my system pgbench runs of a few minutes can be 3 or 4 times
faster than runs that last for the better part of an hour.

Re: BBU Cache vs. spindles

From
Steve Crawford
Date:
On 10/20/2010 09:45 PM, Scott Marlowe wrote:
> On Wed, Oct 20, 2010 at 8:25 PM, Joshua D. Drake<jd@commandprompt.com>  wrote:
>
>> On Wed, 2010-10-20 at 22:13 -0400, Bruce Momjian wrote:
>>
>>> Ben Chobot wrote:
>>>
>>>> On Oct 7, 2010, at 4:38 PM, Steve Crawford wrote:
>>>>
>>>>
>>>>> I'm weighing options for a new server. In addition to PostgreSQL, this machine will handle some modest Samba and
Rsyncload. 
>>>>>
>>>>> I will have enough RAM so the virtually all disk-read activity will be cached. The average PostgreSQL read
activitywill be modest - a mix of single-record and fairly large (reporting) result-sets. Writes will be modest as well
butwill come in brief (1-5 second) bursts of individual inserts. The rate of insert requests will hit 100-200/second
forthose brief bursts. 
>>>>>
>>>>> So...
>>>>>
>>>>> Am I likely to be better off putting $$$ toward battery-backup on the RAID or toward adding a second RAID-set and
splittingoff the WAL traffic? Or something else? 
>>>>>
>>>> A BBU is, what, $100 or so? Adding one seems a no-brainer to me.
>>>> Dedicated WAL spindles are nice and all, but they're still spinning
>>>> media. Raid card cache is waaaay faster, and while it's best at bursty
>>>> writes, it sounds like bursty writes are precisely what you have.
>>>>
>>> Totally agree!
>>>
>> BBU first, more spindles second.
>>
> Agreed.  note that while you can get incredible burst performance from
> a battery backed cache, due to both caching and writing out of order,
> once the throughput begins to saturate at the speed of the disk array,
> the bbu cache is now only re-ordering really, as it will eventually
> fill up faster than the disks can take the writes, and you'll settle
> in at some percentage of your max tps you get for a short benchmark
> run.  It's vitally important that once you put a BBU cache in place,
> you run a very long running transactional test (pgbench is a simple
> one to start with) that floods the io subsystem so you see what you're
> average throughput is with the WAL and data store getting flooded.  I
> know on my system pgbench runs of a few minutes can be 3 or 4 times
> faster than runs that last for the better part of an hour.
>
>
Thanks for all the replies. This is what I suspected but since I can't
just buy one of everything to try, I wanted a sanity-check before
spending the $$$.

I am not too worried about saturating the controller cache as the
current much lower spec machine can handle the sustained load just fine
and the bursts are typically only 1-3 seconds long spaced a minute or
more apart.

Cheers,
Steve




Re: BBU Cache vs. spindles

From
Bruce Momjian
Date:
Scott Marlowe wrote:
> On Wed, Oct 20, 2010 at 8:25 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
> > On Wed, 2010-10-20 at 22:13 -0400, Bruce Momjian wrote:
> >> Ben Chobot wrote:
> >> > On Oct 7, 2010, at 4:38 PM, Steve Crawford wrote:
> >> >
> >> > > I'm weighing options for a new server. In addition to PostgreSQL, this machine will handle some modest Samba
andRsync load. 
> >> > >
> >> > > I will have enough RAM so the virtually all disk-read activity will be cached. The average PostgreSQL read
activitywill be modest - a mix of single-record and fairly large (reporting) result-sets. Writes will be modest as well
butwill come in brief (1-5 second) bursts of individual inserts. The rate of insert requests will hit 100-200/second
forthose brief bursts. 
> >> > >
> >> > > So...
> >> > >
> >> > > Am I likely to be better off putting $$$ toward battery-backup on the RAID or toward adding a second RAID-set
andsplitting off the WAL traffic? Or something else? 
> >> >
> >> > A BBU is, what, $100 or so? Adding one seems a no-brainer to me.
> >> > Dedicated WAL spindles are nice and all, but they're still spinning
> >> > media. Raid card cache is waaaay faster, and while it's best at bursty
> >> > writes, it sounds like bursty writes are precisely what you have.
> >>
> >> Totally agree!
> >
> > BBU first, more spindles second.
>
> Agreed.  note that while you can get incredible burst performance from
> a battery backed cache, due to both caching and writing out of order,
> once the throughput begins to saturate at the speed of the disk array,
> the bbu cache is now only re-ordering really, as it will eventually
> fill up faster than the disks can take the writes, and you'll settle
> in at some percentage of your max tps you get for a short benchmark
> run.  It's vitally important that once you put a BBU cache in place,
> you run a very long running transactional test (pgbench is a simple
> one to start with) that floods the io subsystem so you see what you're
> average throughput is with the WAL and data store getting flooded.  I
> know on my system pgbench runs of a few minutes can be 3 or 4 times
> faster than runs that last for the better part of an hour.

With a BBU you can turn off full_page_writes, which should decrease the
WAL traffic.

However, I don't see this mentioned in our documentation.  Should I add
it?

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: BBU Cache vs. spindles

From
"Kevin Grittner"
Date:
Bruce Momjian <bruce@momjian.us> wrote:

> With a BBU you can turn off full_page_writes

My understanding is that that is not without risk.  What happens if
the WAL is written, there is a commit, but the data page has not yet
been written to the controller?  Don't we still have a torn page?

-Kevin

Re: BBU Cache vs. spindles

From
Bruce Momjian
Date:
Kevin Grittner wrote:
> Bruce Momjian <bruce@momjian.us> wrote:
>
> > With a BBU you can turn off full_page_writes
>
> My understanding is that that is not without risk.  What happens if
> the WAL is written, there is a commit, but the data page has not yet
> been written to the controller?  Don't we still have a torn page?

I don't see how full_page_writes affect non-written pages to the
controller.

full_page_writes is designed to guard against a partial write to a
device.  I don't think the raid cache can be partially written to, and
the cache will not be cleared until the drive has fully writen the data
to disk.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: BBU Cache vs. spindles

From
Greg Smith
Date:
Bruce Momjian wrote:
> With a BBU you can turn off full_page_writes, which should decrease the
> WAL traffic.
>
> However, I don't see this mentioned in our documentation.  Should I add
> it?
>

What I would like to do is beef up the documentation with some concrete
examples of how to figure out if your cache and associated write path
are working reliably or not.  It should be possible to include "does
this handle full page writes correctly?" in that test suite.  Until we
have something like that, I'm concerned that bugs in filesystem or
controller handling may make full_page_writes unsafe even with a BBU,
and we'd have no way for people to tell if that's true or not.

--
Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us



Re: BBU Cache vs. spindles

From
"Kevin Grittner"
Date:
Bruce Momjian <bruce@momjian.us> wrote:

> full_page_writes is designed to guard against a partial write to a
> device.  I don't think the raid cache can be partially written to

So you're confident that an 8kB write to the controller will not be
done as a series of smaller atomic writes by the OS file system?

-Kevin

Re: BBU Cache vs. spindles

From
Greg Smith
Date:
Kevin Grittner wrote:
> Bruce Momjian <bruce@momjian.us> wrote:
>
>
>> full_page_writes is designed to guard against a partial write to a
>> device.  I don't think the raid cache can be partially written to
>>
>
> So you're confident that an 8kB write to the controller will not be
> done as a series of smaller atomic writes by the OS file system?

Sure, that happens.  But if the BBU has gotten an fsync call after the
8K write, it shouldn't return success until after all 8K are in its
cache.  That's why full_page_writes should be safe on a system with BBU
as Bruce is suggesting.  But I'd like to see some independent proof of
that fact, that includes some targeted tests users can run, before we
start recommending that practice.

--
Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us



Re: BBU Cache vs. spindles

From
"Kevin Grittner"
Date:
Greg Smith <greg@2ndquadrant.com> wrote:
> Kevin Grittner wrote:

>> So you're confident that an 8kB write to the controller will not
>> be done as a series of smaller atomic writes by the OS file
>> system?
>
> Sure, that happens.  But if the BBU has gotten an fsync call after
> the 8K write, it shouldn't return success until after all 8K are
> in its cache.

I'm not concerned about an fsync after the controller has it; I'm
concerned about a system crash in the middle of writing an 8K page
to the controller.  Other than the expected *size* of the window of
time during which you're vulnerable, what does a BBU caching
controller buy you in this regard?  Can't the OS rearrange the
writes of disk sectors after the 8K page is written to the OS cache
so that the window might occasionally be rather large?

-Kevin

Re: BBU Cache vs. spindles

From
Bruce Momjian
Date:
Kevin Grittner wrote:
> Greg Smith <greg@2ndquadrant.com> wrote:
> > Kevin Grittner wrote:
>
> >> So you're confident that an 8kB write to the controller will not
> >> be done as a series of smaller atomic writes by the OS file
> >> system?
> >
> > Sure, that happens.  But if the BBU has gotten an fsync call after
> > the 8K write, it shouldn't return success until after all 8K are
> > in its cache.
>
> I'm not concerned about an fsync after the controller has it; I'm
> concerned about a system crash in the middle of writing an 8K page
> to the controller.  Other than the expected *size* of the window of
> time during which you're vulnerable, what does a BBU caching
> controller buy you in this regard?  Can't the OS rearrange the
> writes of disk sectors after the 8K page is written to the OS cache
> so that the window might occasionally be rather large?

If the write fails to the controller, the page is not flushed and PG
does not continue.  If the write fails, the fsync never happens, and
hence PG stops.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: BBU Cache vs. spindles

From
"Kevin Grittner"
Date:
Bruce Momjian <bruce@momjian.us> wrote:

> If the write fails to the controller, the page is not flushed and
> PG does not continue.  If the write fails, the fsync never
> happens, and hence PG stops.

PG stops?  This case at issue is when the OS crashes or the plug is
pulled in the middle of writing a page.  I don't think PG will
normally have the option of a graceful stop after that.  To quote
the Fine Manual:

http://www.postgresql.org/docs/current/interactive/runtime-config-wal.html#GUC-FULL-PAGE-WRITES

| a page write that is in process during an operating system crash
| might be only partially completed, leading to an on-disk page that
| contains a mix of old and new data. The row-level change data
| normally stored in WAL will not be enough to completely restore
| such a page during post-crash recovery. Storing the full page
| image guarantees that the page can be correctly restored

Like I said, the only difference between the page being written to
platters and to a BBU cache that I can see is the average size of
the window of time in which you're vulnerable, not whether there is
a window.  I don't think you've really addressed that concern.

-Kevin

Re: BBU Cache vs. spindles

From
Bruce Momjian
Date:
Kevin Grittner wrote:
> Bruce Momjian <bruce@momjian.us> wrote:
>
> > If the write fails to the controller, the page is not flushed and
> > PG does not continue.  If the write fails, the fsync never
> > happens, and hence PG stops.
>
> PG stops?  This case at issue is when the OS crashes or the plug is
> pulled in the middle of writing a page.  I don't think PG will
> normally have the option of a graceful stop after that.  To quote
> the Fine Manual:

If the OS crashes during a write or fsync, we have not committed the
transaction.

>
> http://www.postgresql.org/docs/current/interactive/runtime-config-wal.html#GUC-FULL-PAGE-WRITES
>
> | a page write that is in process during an operating system crash
> | might be only partially completed, leading to an on-disk page that
> | contains a mix of old and new data. The row-level change data
> | normally stored in WAL will not be enough to completely restore
> | such a page during post-crash recovery. Storing the full page
> | image guarantees that the page can be correctly restored
>
> Like I said, the only difference between the page being written to
> platters and to a BBU cache that I can see is the average size of
> the window of time in which you're vulnerable, not whether there is
> a window.  I don't think you've really addressed that concern.

I assume we send a full 8k to the controller, and a failure during that
write is not registered as a write.  A disk drive is modifying permanent
storage so there is always the possibility of that failing.  I assume
the BBU just rewrites that after startup.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: BBU Cache vs. spindles

From
"Kevin Grittner"
Date:
Bruce Momjian <bruce@momjian.us> wrote:

> I assume we send a full 8k to the controller, and a failure during
> that write is not registered as a write.

On what do you base that assumption?  I assume that we send a full
8K to the OS cache, and the file system writes disk sectors
according to its own algorithm.  With either platters or BBU cache,
the data is persisted on fsync; why do you see a risk with one but
not the other?

-Kevin

Re: BBU Cache vs. spindles

From
Bruce Momjian
Date:
Kevin Grittner wrote:
> Bruce Momjian <bruce@momjian.us> wrote:
>
> > I assume we send a full 8k to the controller, and a failure during
> > that write is not registered as a write.
>
> On what do you base that assumption?  I assume that we send a full
> 8K to the OS cache, and the file system writes disk sectors
> according to its own algorithm.  With either platters or BBU cache,
> the data is persisted on fsync; why do you see a risk with one but
> not the other?

Now that is an interesting question.  We write 8k to the kernel, but the
kernel doesn't have to honor those write sizes, so while we probably
can't get a partial 512-byte block written to disk with an BBU (that
isn't cleanup up by the BBU on reboot), we could get some 512-byte
blocks of an 8k written and others not.

I agree you are right and a BBU does not mean you can safely turn off
full_page_writes.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: BBU Cache vs. spindles

From
Andres Freund
Date:
On Thursday 21 October 2010 21:42:06 Kevin Grittner wrote:
> Bruce Momjian <bruce@momjian.us> wrote:
> > I assume we send a full 8k to the controller, and a failure during
> > that write is not registered as a write.
>
> On what do you base that assumption?  I assume that we send a full
> 8K to the OS cache, and the file system writes disk sectors
> according to its own algorithm.  With either platters or BBU cache,
> the data is persisted on fsync; why do you see a risk with one but
> not the other?
At least on linux pages can certainly get written out in < 8kb batches if
youre under memory pressure.

Andres



Re: BBU Cache vs. spindles

From
Greg Smith
Date:
Kevin Grittner wrote:
> I assume that we send a full
> 8K to the OS cache, and the file system writes disk sectors
> according to its own algorithm.  With either platters or BBU cache,
> the data is persisted on fsync; why do you see a risk with one but
> not the other

I'd like a 10 minute argument please.  I started to write something to
refute this, only to clarify in my head the sequence of events that
leads to the most questionable result, where I feel a bit less certain
than I did before of the safety here.  Here is the worst case I believe
you're describing:

1) Transaction is written to the WAL and sync'd; client receives
COMMIT.  Since full_page_writes is off, the data in the WAL consists
only of the delta of what changed on the page.
2) 8K database page is written to OS cache
3) PG calls fsync to force the database block out
4) OS writes first 4K block of the change to the BBU write cache.  Worst
case, this fills the cache, and it takes a moment for some random writes
to process before it has space to buffer again (makes this more likely
to happen, but it's not required to see the failure case here)
5) Sudden power interruption, second half of the page write is lost
6) Server restarts
7) That 4K write is now replayed from the battery's cache

At this point, you now have a torn 8K page, with 1/2 old and 1/2 new
data.  Without a full page write in the WAL, is it always possible to
restore its original state now?  In theory, I think you do.  Since the
delta in the WAL should be overwriting all of the bytes that changed
between the old and new version of the page, applying it on top of any
four possible states here:

1) None of the data was written to the database page yet
2) The first 4K of data was written out
3) The second 4K of data was written out
4) All 8K was actually written out

Should lead to the same result: an 8K page that includes the change that
was in the WAL but not onto disk at the point when the crash happened.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


Re: BBU Cache vs. spindles

From
Tom Lane
Date:
Greg Smith <greg@2ndquadrant.com> writes:
> At this point, you now have a torn 8K page, with 1/2 old and 1/2 new
> data.

Right.

> Without a full page write in the WAL, is it always possible to
> restore its original state now?  In theory, I think you do.  Since the
> delta in the WAL should be overwriting all of the bytes that changed
> between the old and new version of the page, applying it on top of any
> four possible states here:

You've got entirely too simplistic a view of what the "delta" might be,
I fear.  In particular there are various sorts of changes that involve
inserting the data carried in the WAL record and shifting pre-existing
data around to make room, or removing an item and moving remaining data
around.  If you try to replay that type of action against a torn page,
you'll get corrupted results.

We could possibly redefine the WAL records so that they weren't just the
minimum amount of data but carried every byte that'd changed on the
page, and then I think what you're envisioning would work.  But the
records would be a lot bulkier.  It's not clear to me that this would be
a net savings over the current design, particularly not if there's
a long interval between checkpoints.

            regards, tom lane

Re: BBU Cache vs. spindles

From
Greg Smith
Date:
Kevin Grittner wrote:
> With either platters or BBU cache,
> the data is persisted on fsync; why do you see a risk with one but
> not the other

Forgot to address this part.  The troublesome sequence if you don't have
a BBU is:

1) WAL data is written to the OS cache
2) PG calls fsync
3) Data is tranferred into the drive's volatile, non battery-backed cache
4) Drive lies about data being on disk, says fsync is done
5) That 8K data page is written out to the OS cache, also with fsync,
then onto the drive.  It says it has that too.
6) Due to current disk head location, 4KB of the data page gets written
out before it gets to the WAL data
7) System crashes

Now you're dead.  You've just torn a data page, but not written any of
the data to the WAL necessary to reconstruct any valid version of that page.

I think Kevin's point here may be that if your fsync isn't reliable,
you're always in trouble.  But if your fsync is good, even torn pages
should be repairable by the deltas written to the WAL, as I described in
the message I just sent before this one.  That's true regardless of
whether you achieved "non-lying fsync" with a BBU or by turning a
drive's write cache off.  There's nothing really special about the BBU
beyond it behind the most common form of reliable write cache that
works.  You get the same properties at a slower rate with a drive that's
configured to never lie about writes.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


Re: BBU Cache vs. spindles

From
"Kevin Grittner"
Date:
Greg Smith <greg@2ndquadrant.com> wrote:

> I think Kevin's point here may be that if your fsync isn't
> reliable, you're always in trouble.  But if your fsync is good,
> even torn pages should be repairable by the deltas written to the
> WAL

I was actually just arguing that a BBU doesn't eliminate a risk
here; if there is a risk with production-quality disk drives, there
is a risk with a controller with a BBU cache.  The BBU cache just
tends to reduce the window of time in which corruption can occur.  I
wasn't too sure of *why* there was a risk, but Tom's post cleared
that up.

I wonder why we need to expose this GUC at all -- perhaps it should
be off when fsync is off and on otherwise?  Leaving it on without
fsync is just harming performance for not much benefit, and turning
it off with fsync seems to be saying that you are willing to
tolerate a known risk of database corruption, just not quite so much
as you have without fsync.  In reality it seems most likely to be a
mistake, either way.

-Kevin

Re: BBU Cache vs. spindles

From
Greg Smith
Date:
Tom Lane wrote:
> You've got entirely too simplistic a view of what the "delta" might be,
> I fear.  In particular there are various sorts of changes that involve
> inserting the data carried in the WAL record and shifting pre-existing
> data around to make room, or removing an item and moving remaining data
> around.  If you try to replay that type of action against a torn page,
> you'll get corrupted results.
>

I wasn't sure exactly how those were encoded, thanks for the
clarification.  Given that, it seems to me there are only two situations
where full_page_writes is safe to turn off:

1) The operating system block size is exactly the same database block
size, and all writes are guaranteed to be atomic to that block size.

2) You're using a form of journaled filesystem where data blocks are
never updated, they're always written elsewhere and the filesystem is
redirected to that new block once it's on disk.

Looks to me like whether or not there's a non-volatile write cache
sitting in the middle, like a BBU protected RAID card, doesn't really
make any difference here then.

I think that most people who have thought they were safe to turn off
full_page_writes in the past did so because they believed they were in
category (1) here.  I've never advised anyone to do that, because it's
so difficult to validate the truth of.  Just given that, I'd be tempted
to join in on suggesting this parameter just go away in the name of
safety, except that I think category (2) here is growing now.  ZFS is
the most obvious example where the atomic write implementation seems to
always make disabling full_page_writes safe.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


Re: BBU Cache vs. spindles

From
Jesper Krogh
Date:
On 2010-10-22 17:37, Greg Smith wrote:
> I think that most people who have thought they were safe to turn off
> full_page_writes in the past did so because they believed they were
> in category (1) here.  I've never advised anyone to do that, because
> it's so difficult to validate the truth of.  Just given that, I'd be
> tempted to join in on suggesting this parameter just go away in the
> name of safety, except that I think category (2) here is growing now.
> ZFS is the most obvious example where the atomic write implementation
> seems to always make disabling full_page_writes safe.

Can you point to some ZFS docs that tell that this is the case..  I'd be surprised
if it doesnt copy away the old block and replaces it with the new one in-place. The
other behaviour would quite quickly lead to a hugely fragmented filesystem that
performs next to useless and ZFS doesnt seem to be in that category..

  ...  All given my total lack of insight into ZFS.

--
Jesper



Re: BBU Cache vs. spindles

From
Rob Wultsch
Date:
On Fri, Oct 22, 2010 at 8:37 AM, Greg Smith <greg@2ndquadrant.com> wrote:
> Tom Lane wrote:
>>
>> You've got entirely too simplistic a view of what the "delta" might be,
>> I fear.  In particular there are various sorts of changes that involve
>> inserting the data carried in the WAL record and shifting pre-existing
>> data around to make room, or removing an item and moving remaining data
>> around.  If you try to replay that type of action against a torn page,
>> you'll get corrupted results.
>>
>
> I wasn't sure exactly how those were encoded, thanks for the clarification.
>  Given that, it seems to me there are only two situations where
> full_page_writes is safe to turn off:
>
> 1) The operating system block size is exactly the same database block size,
> and all writes are guaranteed to be atomic to that block size.
> 2) You're using a form of journaled filesystem where data blocks are never
> updated, they're always written elsewhere and the filesystem is redirected
> to that new block once it's on disk.
>
> Looks to me like whether or not there's a non-volatile write cache sitting
> in the middle, like a BBU protected RAID card, doesn't really make any
> difference here then.
>
> I think that most people who have thought they were safe to turn off
> full_page_writes in the past did so because they believed they were in
> category (1) here.  I've never advised anyone to do that, because it's so
> difficult to validate the truth of.  Just given that, I'd be tempted to join
> in on suggesting this parameter just go away in the name of safety, except
> that I think category (2) here is growing now.  ZFS is the most obvious
> example where the atomic write implementation seems to always make disabling
> full_page_writes safe.
>

For the sake of argument, has PG considered using a double write
buffer similar to InnodB?


--
Rob Wultsch
wultsch@gmail.com

Re: BBU Cache vs. spindles

From
"Kevin Grittner"
Date:
Rob Wultsch <wultsch@gmail.com> wrote:

> has PG considered using a double write buffer similar to InnodB?

That seems inferior to the full_page_writes strategy, where you only
write a page twice the first time it is written after a checkpoint.
We're talking about when we might be able to write *less*, not more.

-Kevin

Re: BBU Cache vs. spindles

From
Rob Wultsch
Date:
On Fri, Oct 22, 2010 at 10:28 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Rob Wultsch <wultsch@gmail.com> wrote:
>
>> has PG considered using a double write buffer similar to InnodB?
>
> That seems inferior to the full_page_writes strategy, where you only
> write a page twice the first time it is written after a checkpoint.
> We're talking about when we might be able to write *less*, not more.
>
> -Kevin
>

By "write" do you mean number of writes, or the number of bytes of the
writes? For number of writes, yes a double write buffer will lose. In
terms of number of bytes, I would think full_page_writes=off + double
write buffer should be far superior, particularly given that the WAL
is shipped over the network to slaves.

--
Rob Wultsch
wultsch@gmail.com

Re: BBU Cache vs. spindles

From
"Kevin Grittner"
Date:
Rob Wultsch <wultsch@gmail.com> wrote:

> I would think full_page_writes=off + double write buffer should be
> far superior, particularly given that the WAL is shipped over the
> network to slaves.

For a reasonably brief description of InnoDB double write buffers, I
found this:

http://www.mysqlperformanceblog.com/2006/08/04/innodb-double-write/

One big question before even considering this would by how to
determine whether a potentially torn page "is inconsistent".
Without a page CRC or some such mechanism, I don't see how this
technique is possible.

Even if it's possible, it's far from clear to me that it would be an
improvement.  The author estimates (apparently somewhat loosely)
that it's a 5% to 10% performance hit in InnoDB; I'm far from
certain that full_page_writes cost us that much.  Does anyone have
benchmark numbers handy?

-Kevin

Re: BBU Cache vs. spindles

From
Rob Wultsch
Date:
On Fri, Oct 22, 2010 at 12:05 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Rob Wultsch <wultsch@gmail.com> wrote:
>
>> I would think full_page_writes=off + double write buffer should be
>> far superior, particularly given that the WAL is shipped over the
>> network to slaves.
>
> For a reasonably brief description of InnoDB double write buffers, I
> found this:
>
> http://www.mysqlperformanceblog.com/2006/08/04/innodb-double-write/
>
> One big question before even considering this would by how to
> determine whether a potentially torn page "is inconsistent".
> Without a page CRC or some such mechanism, I don't see how this
> technique is possible.
>
> Even if it's possible, it's far from clear to me that it would be an
> improvement.  The author estimates (apparently somewhat loosely)
> that it's a 5% to 10% performance hit in InnoDB; I'm far from
> certain that full_page_writes cost us that much.  Does anyone have
> benchmark numbers handy?
>
> -Kevin
>

Ignoring (briefly) the cost in terms of performance of the different
system, not needing full_page_writes would make geographically
dispersed replication possible for certain cases where it is not
currently (or at least rather painful).

--
Rob Wultsch
wultsch@gmail.com

Re: BBU Cache vs. spindles

From
"Kevin Grittner"
Date:
Rob Wultsch <wultsch@gmail.com> wrote:

> not needing full_page_writes would make geographically dispersed
> replication possible for certain cases where it is not currently
> (or at least rather painful).

Do you have any hard numbers on WAL file size impact?  How much does
pglesslog help in a file-based WAL transmission environment?  Should
we be considering similar filtering for streaming replication?

-Kevin

Re: BBU Cache vs. spindles

From
Rob Wultsch
Date:
On Fri, Oct 22, 2010 at 1:15 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Rob Wultsch <wultsch@gmail.com> wrote:
>
>> not needing full_page_writes would make geographically dispersed
>> replication possible for certain cases where it is not currently
>> (or at least rather painful).
>
> Do you have any hard numbers on WAL file size impact?  How much does
> pglesslog help in a file-based WAL transmission environment?  Should
> we be considering similar filtering for streaming replication?
>
> -Kevin
>

No, I am DBA that mostly works on MySQL. I have had to deal with
(handwaving...) tangential issues recently. I really would like to
work with PG more and this seems like it would be a significant
hindrance for certain usage patterns. Lots of replication does not
take place over gig...


--
Rob Wultsch
wultsch@gmail.com

Re: BBU Cache vs. spindles

From
"Kevin Grittner"
Date:
Rob Wultsch  wrote:

> I really would like to work with PG more and this seems like
> [full_page_writes] would be a significant hindrance for certain
> usage patterns. Lots of replication does not take place over gig...

Certainly most of the Wisconsin State Courts replication takes place
over WAN connections at a few Mbps.  I haven't seen any evidence that
having full_page_writes on has caused us problems, personally.

In the PostgreSQL community you generally need to show some hard
numbers from a repeatable test case for the community to believe that
there's a problem which needs fixing, much less to buy in to some
particular fix for the purported problem.  On the other hand, if you
can show that there actually *is* a problem, I've never seen a group
which responds so quickly and effectively to solve it as the
PostgreSQL community.  Don't get too attached to a particular
solution without proof that it's better than the alternatives,
though....

-Kevin

Re: BBU Cache vs. spindles

From
Josh Berkus
Date:
> Even if it's possible, it's far from clear to me that it would be an
> improvement.  The author estimates (apparently somewhat loosely)
> that it's a 5% to 10% performance hit in InnoDB; I'm far from
> certain that full_page_writes cost us that much.  Does anyone have
> benchmark numbers handy?

It most certainly can, depending on your CPU saturation and I/O support.
   I've seen a 10% improvement in througput time from turning off
full_page_writes on some machines, such as when we were doing the
SpecJAppserver benchmarks on Solaris.

--
                                   -- Josh Berkus
                                      PostgreSQL Experts Inc.
                                      http://www.pgexperts.com

Re: BBU Cache vs. spindles

From
James Mansion
Date:
Kevin Grittner wrote:
> On what do you base that assumption?  I assume that we send a full
> 8K to the OS cache, and the file system writes disk sectors
> according to its own algorithm.  With either platters or BBU cache,
> the data is persisted on fsync; why do you see a risk with one but
> not the other?
>
Surely 'the data is persisted sometime after our write and before the
fsynch returns, but
may be written:
 - in small chunks
 - out of order
 - in an unpredictable way'

When I looked at the internals of TokyoCabinet for example, the design
was flawed but
would be 'fairly robust' so long as mmap'd pages that were dirtied did
not get persisted
until msync, and were then persisted atomically.


Re: BBU Cache vs. spindles

From
Greg Smith
Date:
James Mansion wrote:
> When I looked at the internals of TokyoCabinet for example, the design
> was flawed but
> would be 'fairly robust' so long as mmap'd pages that were dirtied did
> not get persisted
> until msync, and were then persisted atomically.

If TokyoCabinet presumes that's true and overwrites existing blocks with
that assumption, it would land onto my list of databases I wouldn't
trust to hold my TODO list.  Flip off power to a server, and you have no
idea what portion of the blocks sitting in the drive's cache actually
made it to disk; that's not even guaranteed atomic to the byte level.
Torn pages happen all the time unless you either a) put the entire write
into a non-volatile cache before writing any of it, b) write and sync
somewhere else first and then do a journaled filesystem pointer swap
from the old page to the new one, or c) journal the whole write the way
PostgreSQL does with full_page_writes and the WAL.  The discussion here
veered off over whether (a) was sufficiently satisfied just by having a
RAID controller with battery backup, and what I concluded from the dive
into the details is that it's definitely not true unless the filesystem
block size exactly matches the database one.  And even then, make sure
you test heavily.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


Re: BBU Cache vs. spindles

From
Greg Smith
Date:
Jesper Krogh wrote:
> Can you point to some ZFS docs that tell that this is the case..  I'd
> be surprised
> if it doesnt copy away the old block and replaces it with the new one
> in-place. The
> other behaviour would quite quickly lead to a hugely fragmented
> filesystem that
> performs next to useless and ZFS doesnt seem to be in that category..

http://all-unix.blogspot.com/2007/03/zfs-cow-and-relate-features.html

"Blocks containing active data are never overwritten in place; instead,
a new block is allocated, modified data is written to it, and then any
metadata blocks referencing it are similarly read, reallocated, and
written."

http://opensolaris.org/jive/thread.jspa?messageID=19264 discusses how
this interacts with the common types of hardware around:  no guaratees
with lying hard drives as always, but otherwise you're fine.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


Re: BBU Cache vs. spindles

From
Tom Lane
Date:
Greg Smith <greg@2ndquadrant.com> writes:
> James Mansion wrote:
>> When I looked at the internals of TokyoCabinet for example, the design
>> was flawed but
>> would be 'fairly robust' so long as mmap'd pages that were dirtied did
>> not get persisted
>> until msync, and were then persisted atomically.

> If TokyoCabinet presumes that's true and overwrites existing blocks with
> that assumption, it would land onto my list of databases I wouldn't
> trust to hold my TODO list.  Flip off power to a server, and you have no
> idea what portion of the blocks sitting in the drive's cache actually
> made it to disk; that's not even guaranteed atomic to the byte level.

The other and probably worse problem is that there's no application
control over how soon changes to mmap'd pages get to disk.  An msync
will flush them out, but the kernel is free to write dirty pages sooner.
So if they're depending for consistency on writes not happening until
msync, it's broken by design.  (This is one of the big reasons we don't
use mmap'd space for Postgres disk buffers.)

            regards, tom lane

Re: BBU Cache vs. spindles

From
Scott Carey
Date:
On Oct 22, 2010, at 1:06 PM, Rob Wultsch wrote:

> On Fri, Oct 22, 2010 at 12:05 PM, Kevin Grittner
> <Kevin.Grittner@wicourts.gov> wrote:
>> Rob Wultsch <wultsch@gmail.com> wrote:
>>
>>> I would think full_page_writes=off + double write buffer should be
>>> far superior, particularly given that the WAL is shipped over the
>>> network to slaves.
>>
>> For a reasonably brief description of InnoDB double write buffers, I
>> found this:
>>
>> http://www.mysqlperformanceblog.com/2006/08/04/innodb-double-write/
>>
>> One big question before even considering this would by how to
>> determine whether a potentially torn page "is inconsistent".
>> Without a page CRC or some such mechanism, I don't see how this
>> technique is possible.
>>
>> Even if it's possible, it's far from clear to me that it would be an
>> improvement.  The author estimates (apparently somewhat loosely)
>> that it's a 5% to 10% performance hit in InnoDB; I'm far from
>> certain that full_page_writes cost us that much.  Does anyone have
>> benchmark numbers handy?
>>
>> -Kevin
>>
>
> Ignoring (briefly) the cost in terms of performance of the different
> system, not needing full_page_writes would make geographically
> dispersed replication possible for certain cases where it is not
> currently (or at least rather painful)..

Am I missing something here?

Can't the network replication traffic be partial pages, but the WAL log on the slave (and master) be full pages?  In
otherwords, the slave can transform a partial page update into a full page xlog entry. 


(optional) 1. Log partial pages received from master to disk. (not xlog, something else, useful to persist changes
faster)
2. Read page from disk for update.
3. Log full page modification to xlog for local commit.
4. Update page in memory and write out to OS as usual.

The lack of the full page from the master would mean that you have to do a read-modify-write rather than just
overwrite,but I think that works fine if network bandwidth is your bottleneck. 
I don't know enough of the guts of Postgres to be certain, but it seems conceptually like this is possible.

Also one could use lzo compression and get a likely factor of two space saving with small CPU cost.

>
> --
> Rob Wultsch
> wultsch@gmail.com
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


Re: BBU Cache vs. spindles

From
Robert Haas
Date:
On Fri, Oct 22, 2010 at 3:05 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Rob Wultsch <wultsch@gmail.com> wrote:
>
>> I would think full_page_writes=off + double write buffer should be
>> far superior, particularly given that the WAL is shipped over the
>> network to slaves.
>
> For a reasonably brief description of InnoDB double write buffers, I
> found this:
>
> http://www.mysqlperformanceblog.com/2006/08/04/innodb-double-write/
>
> One big question before even considering this would by how to
> determine whether a potentially torn page "is inconsistent".
> Without a page CRC or some such mechanism, I don't see how this
> technique is possible.

There are two sides to this problem: figuring out when to write a page
to the double write buffer, and figuring out when to read it back from
the double write buffer.  The first seems easy: we just do it whenever
we would XLOG a full page image.  As to the second, when we write the
page out to the double write buffer, we could also write to the double
write buffer the LSN of the WAL record which depends on that full page
image.  Then, at the start of recovery, we scan the double write
buffer and remember all those LSNs.  When we reach one of them, we
replay the full page image.

The good thing about this is that it would reduce WAL volume; the bad
thing about it is that it would probably mean doing two fsyncs where
we only now do one.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BBU Cache vs. spindles

From
Rob Wultsch
Date:
On Tue, Oct 26, 2010 at 5:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Oct 22, 2010 at 3:05 PM, Kevin Grittner
> <Kevin.Grittner@wicourts.gov> wrote:
>> Rob Wultsch <wultsch@gmail.com> wrote:
>>
>>> I would think full_page_writes=off + double write buffer should be
>>> far superior, particularly given that the WAL is shipped over the
>>> network to slaves.
>>
>> For a reasonably brief description of InnoDB double write buffers, I
>> found this:
>>
>> http://www.mysqlperformanceblog.com/2006/08/04/innodb-double-write/
>>
>> One big question before even considering this would by how to
>> determine whether a potentially torn page "is inconsistent".
>> Without a page CRC or some such mechanism, I don't see how this
>> technique is possible.
>
> There are two sides to this problem: figuring out when to write a page
> to the double write buffer, and figuring out when to read it back from
> the double write buffer.  The first seems easy: we just do it whenever
> we would XLOG a full page image.  As to the second, when we write the
> page out to the double write buffer, we could also write to the double
> write buffer the LSN of the WAL record which depends on that full page
> image.  Then, at the start of recovery, we scan the double write
> buffer and remember all those LSNs.  When we reach one of them, we
> replay the full page image.
>
> The good thing about this is that it would reduce WAL volume; the bad
> thing about it is that it would probably mean doing two fsyncs where
> we only now do one.
>

The double write buffer is one of the few areas where InnoDB does more
IO (in the form of fsynch's) than PG. InnoDB also has fuzzy
checkpoints (which help to keep dirty pages in memory longer),
buffering of writing out changes to secondary indexes, and recently
tunable page level compression.

Given that InnoDB is not shipping its logs across the wire, I don't
think many users would really care if it used the double writer or
full page writes approach to the redo log (other than the fact that
the log files would be bigger). PG on the other hand *is* pushing its
logs over the wire...

--
Rob Wultsch
wultsch@gmail.com

Re: BBU Cache vs. spindles

From
Robert Haas
Date:
On Tue, Oct 26, 2010 at 10:13 AM, Rob Wultsch <wultsch@gmail.com> wrote:
> The double write buffer is one of the few areas where InnoDB does more
> IO (in the form of fsynch's) than PG. InnoDB also has fuzzy
> checkpoints (which help to keep dirty pages in memory longer),
> buffering of writing out changes to secondary indexes, and recently
> tunable page level compression.

Baron Schwartz was talking to me about this at Surge.  I don't really
understand how the fuzzy checkpoint stuff works, and I haven't been
able to find a good description of it anywhere.  How does it keep
dirty pages in memory longer?  Details on the other things you mention
would be interesting to hear, too.

> Given that InnoDB is not shipping its logs across the wire, I don't
> think many users would really care if it used the double writer or
> full page writes approach to the redo log (other than the fact that
> the log files would be bigger). PG on the other hand *is* pushing its
> logs over the wire...

So how is InnoDB doing replication?  Is there a second log just for that?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BBU Cache vs. spindles

From
Rob Wultsch
Date:
On Tue, Oct 26, 2010 at 7:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Oct 26, 2010 at 10:13 AM, Rob Wultsch <wultsch@gmail.com> wrote:
>> The double write buffer is one of the few areas where InnoDB does more
>> IO (in the form of fsynch's) than PG. InnoDB also has fuzzy
>> checkpoints (which help to keep dirty pages in memory longer),
>> buffering of writing out changes to secondary indexes, and recently
>> tunable page level compression.
>
> Baron Schwartz was talking to me about this at Surge.  I don't really
> understand how the fuzzy checkpoint stuff works, and I haven't been
> able to find a good description of it anywhere.  How does it keep
> dirty pages in memory longer?  Details on the other things you mention
> would be interesting to hear, too.

For checkpoint behavior:

http://books.google.com/books?id=S_yHERPRZScC&pg=PA606&lpg=PA606&dq=fuzzy+checkpoint&source=bl&ots=JJrzRUKBGh&sig=UOMPsRy5E-YDgjAFkaSVn3dps_M&hl=en&ei=_k8yTOfeHYzZnAepyumLBA&sa=X&oi=book_result&ct=result&resnum=8&ved=0CEYQ6AEwBw#v=onepage&q=fuzzy%20checkpoint&f=false

I would think that best case behavior "sharp" checkpoints with a large
checkpoint_completion_target would have behavior similar to a fuzzy
checkpoint.

Insert (for innodb 1.1+ evidently there is also does delete and purge)
buffering:
http://dev.mysql.com/doc/refman/5.5/en/innodb-insert-buffering.html

For a recent ~800GB db I had to restore, the insert buffer saved 92%
of io needed for secondary indexes.

Compression:
http://dev.mysql.com/doc/innodb-plugin/1.0/en/innodb-compression-internals.html

For many workloads 50% compression results in negligible impact to
performance. For certain workloads compression can help performance.
Please note that InnoDB also has non-tunable toast like feature.


>> Given that InnoDB is not shipping its logs across the wire, I don't
>> think many users would really care if it used the double writer or
>> full page writes approach to the redo log (other than the fact that
>> the log files would be bigger). PG on the other hand *is* pushing its
>> logs over the wire...
>
> So how is InnoDB doing replication?  Is there a second log just for that?
>

The other log is the "binary log" and it is one of the biggest
problems with MySQL. Running MySQL in such a way that the binary log
stays in sync with the InnoDB redo has a very significant impact on
performance.
http://www.mysqlperformanceblog.com/2010/10/23/mysql-limitations-part-2-the-binary-log/
http://mysqlha.blogspot.com/2010/09/mysql-versus-mongodb-update-performance.html
(check out the pretty graph)

If you are going to West you should considering heading over to the
Facebook office on Tuesday as the MySQL team is having something of an
open house:
http://www.facebook.com/event.php?eid=160712450628622

Mark Callaghan from the Facebook MySQL Engineering (and several
members of their ops team, for that matter) team understands InnoDB
dramatically better than I do.

--
Rob Wultsch
wultsch@gmail.com

Re: BBU Cache vs. spindles

From
Robert Haas
Date:
On Wed, Oct 27, 2010 at 12:41 AM, Rob Wultsch <wultsch@gmail.com> wrote:
> On Tue, Oct 26, 2010 at 7:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Oct 26, 2010 at 10:13 AM, Rob Wultsch <wultsch@gmail.com> wrote:
>>> The double write buffer is one of the few areas where InnoDB does more
>>> IO (in the form of fsynch's) than PG. InnoDB also has fuzzy
>>> checkpoints (which help to keep dirty pages in memory longer),
>>> buffering of writing out changes to secondary indexes, and recently
>>> tunable page level compression.
>>
>> Baron Schwartz was talking to me about this at Surge.  I don't really
>> understand how the fuzzy checkpoint stuff works, and I haven't been
>> able to find a good description of it anywhere.  How does it keep
>> dirty pages in memory longer?  Details on the other things you mention
>> would be interesting to hear, too.
>
> For checkpoint behavior:
>
http://books.google.com/books?id=S_yHERPRZScC&pg=PA606&lpg=PA606&dq=fuzzy+checkpoint&source=bl&ots=JJrzRUKBGh&sig=UOMPsRy5E-YDgjAFkaSVn3dps_M&hl=en&ei=_k8yTOfeHYzZnAepyumLBA&sa=X&oi=book_result&ct=result&resnum=8&ved=0CEYQ6AEwBw#v=onepage&q=fuzzy%20checkpoint&f=false
>
> I would think that best case behavior "sharp" checkpoints with a large
> checkpoint_completion_target would have behavior similar to a fuzzy
> checkpoint.

Well, under that definition of a fuzzy checkpoint, our checkpoints are
fuzzy even with checkpoint_completion_target=0.

What Baron seemed to be describing was a scheme whereby you could do
what I might call partial checkpoints.  IOW, you want to move the redo
pointer without writing out ALL the dirty buffers in memory, so you
write out the pages with the oldest LSNs and then move the redo
pointer to the oldest LSN you have left.  Except that doesn't quite
work, because the page might have been dirtied at LSN X and then later
updated again at LSN Y, and you still have to flush it to disk before
moving the redo pointer to any value >X.  So you work around that by
maintaining a "first dirtied" LSN for each page as well as the current
LSN.

I'm not 100% sure that this is how it works or that it would work in
PG, but even assuming that it is and does, I'm not sure what the
benefit is over the checkpoint-spreading logic we have now.  There
might be some benefit in sorting the writes that we do, so that we can
spread out the fsyncs.  So, write all the blocks to a give file,
fsync, and then repeat for each underlying data file that has at least
one dirty block.  But that's completely orthogonal to (and would
actually be hindered by) the approach described in the preceding
paragraph.

> Insert (for innodb 1.1+ evidently there is also does delete and purge)
> buffering:
> http://dev.mysql.com/doc/refman/5.5/en/innodb-insert-buffering.html

We do something a bit like this for GIST indices.  It would be
interesting to see if it also has a benefit for btree indices.

> For a recent ~800GB db I had to restore, the insert buffer saved 92%
> of io needed for secondary indexes.
>
> Compression:
> http://dev.mysql.com/doc/innodb-plugin/1.0/en/innodb-compression-internals.html
>
> For many workloads 50% compression results in negligible impact to
> performance. For certain workloads compression can help performance.
> Please note that InnoDB also has non-tunable toast like feature.

Interesting.  I am surprised this works well.  It seems that this only
works for pages that can be compressed by >=50%, which seems like it
could result in a lot of CPU wasted on failed attempts to compress.

>>> Given that InnoDB is not shipping its logs across the wire, I don't
>>> think many users would really care if it used the double writer or
>>> full page writes approach to the redo log (other than the fact that
>>> the log files would be bigger). PG on the other hand *is* pushing its
>>> logs over the wire...
>>
>> So how is InnoDB doing replication?  Is there a second log just for that?
>>
>
> The other log is the "binary log" and it is one of the biggest
> problems with MySQL. Running MySQL in such a way that the binary log
> stays in sync with the InnoDB redo has a very significant impact on
> performance.
> http://www.mysqlperformanceblog.com/2010/10/23/mysql-limitations-part-2-the-binary-log/
> http://mysqlha.blogspot.com/2010/09/mysql-versus-mongodb-update-performance.html
> (check out the pretty graph)

Hmm.  That seems kinda painful.  Having to ship full page images over
the wire doesn't seems so bad by comparison, though I'm not very happy
about having to do that either.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BBU Cache vs. spindles

From
Rob Wultsch
Date:
On Wed, Oct 27, 2010 at 6:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Oct 27, 2010 at 12:41 AM, Rob Wultsch <wultsch@gmail.com> wrote:
>> On Tue, Oct 26, 2010 at 7:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Tue, Oct 26, 2010 at 10:13 AM, Rob Wultsch <wultsch@gmail.com> wrote:
>>>> The double write buffer is one of the few areas where InnoDB does more
>>>> IO (in the form of fsynch's) than PG. InnoDB also has fuzzy
>>>> checkpoints (which help to keep dirty pages in memory longer),
>>>> buffering of writing out changes to secondary indexes, and recently
>>>> tunable page level compression.
>>>
>>> Baron Schwartz was talking to me about this at Surge.  I don't really
>>> understand how the fuzzy checkpoint stuff works, and I haven't been
>>> able to find a good description of it anywhere.  How does it keep
>>> dirty pages in memory longer?  Details on the other things you mention
>>> would be interesting to hear, too.
>>
>> For checkpoint behavior:
>>
http://books.google.com/books?id=S_yHERPRZScC&pg=PA606&lpg=PA606&dq=fuzzy+checkpoint&source=bl&ots=JJrzRUKBGh&sig=UOMPsRy5E-YDgjAFkaSVn3dps_M&hl=en&ei=_k8yTOfeHYzZnAepyumLBA&sa=X&oi=book_result&ct=result&resnum=8&ved=0CEYQ6AEwBw#v=onepage&q=fuzzy%20checkpoint&f=false
>>
>> I would think that best case behavior "sharp" checkpoints with a large
>> checkpoint_completion_target would have behavior similar to a fuzzy
>> checkpoint.
>
> Well, under that definition of a fuzzy checkpoint, our checkpoints are
> fuzzy even with checkpoint_completion_target=0.
>
> What Baron seemed to be describing was a scheme whereby you could do
> what I might call partial checkpoints.  IOW, you want to move the redo
> pointer without writing out ALL the dirty buffers in memory, so you
> write out the pages with the oldest LSNs and then move the redo
> pointer to the oldest LSN you have left.  Except that doesn't quite
> work, because the page might have been dirtied at LSN X and then later
> updated again at LSN Y, and you still have to flush it to disk before
> moving the redo pointer to any value >X.  So you work around that by
> maintaining a "first dirtied" LSN for each page as well as the current
> LSN.
>
> I'm not 100% sure that this is how it works or that it would work in
> PG, but even assuming that it is and does, I'm not sure what the
> benefit is over the checkpoint-spreading logic we have now.  There
> might be some benefit in sorting the writes that we do, so that we can
> spread out the fsyncs.  So, write all the blocks to a give file,
> fsync, and then repeat for each underlying data file that has at least
> one dirty block.  But that's completely orthogonal to (and would
> actually be hindered by) the approach described in the preceding
> paragraph.

I wish I could answer your questions better. I am a power user that
does not fully understand InnoDB internals. There are not all that
many folks that have a very good understanding of InnoDB internals
(given how well it works there is not all that much need).

>
>> Insert (for innodb 1.1+ evidently there is also does delete and purge)
>> buffering:
>> http://dev.mysql.com/doc/refman/5.5/en/innodb-insert-buffering.html
>
> We do something a bit like this for GIST indices.  It would be
> interesting to see if it also has a benefit for btree indices.
>
>> For a recent ~800GB db I had to restore, the insert buffer saved 92%
>> of io needed for secondary indexes.
>>
>> Compression:
>> http://dev.mysql.com/doc/innodb-plugin/1.0/en/innodb-compression-internals.html
>>
>> For many workloads 50% compression results in negligible impact to
>> performance. For certain workloads compression can help performance.
>> Please note that InnoDB also has non-tunable toast like feature.
>
> Interesting.  I am surprised this works well.  It seems that this only
> works for pages that can be compressed by >=50%, which seems like it
> could result in a lot of CPU wasted on failed attempts to compress.

In my world, the spinning disk is almost always the bottleneck.
Trading CPU for IO is almost always a good deal for me.

>
>>>> Given that InnoDB is not shipping its logs across the wire, I don't
>>>> think many users would really care if it used the double writer or
>>>> full page writes approach to the redo log (other than the fact that
>>>> the log files would be bigger). PG on the other hand *is* pushing its
>>>> logs over the wire...
>>>
>>> So how is InnoDB doing replication?  Is there a second log just for that?
>>>
>>
>> The other log is the "binary log" and it is one of the biggest
>> problems with MySQL. Running MySQL in such a way that the binary log
>> stays in sync with the InnoDB redo has a very significant impact on
>> performance.
>> http://www.mysqlperformanceblog.com/2010/10/23/mysql-limitations-part-2-the-binary-log/
>> http://mysqlha.blogspot.com/2010/09/mysql-versus-mongodb-update-performance.html
>> (check out the pretty graph)
>
> Hmm.  That seems kinda painful.  Having to ship full page images over
> the wire doesn't seems so bad by comparison, though I'm not very happy
> about having to do that either.
>

The binary log is less than ideal, but with MySQL replication I can
replicate to *many* servers that are *very* geographically distributed
without all that many headaches. In addition it is simple enough that
I can have junior DBA manage it.  I have doubts that I could make PG
do the same anywhere near as easily, particularly given how long and
narrow some pipes are...


--
Rob Wultsch
wultsch@gmail.com

Re: BBU Cache vs. spindles

From
James Mansion
Date:
Tom Lane wrote:
> The other and probably worse problem is that there's no application
> control over how soon changes to mmap'd pages get to disk.  An msync
> will flush them out, but the kernel is free to write dirty pages sooner.
> So if they're depending for consistency on writes not happening until
> msync, it's broken by design.  (This is one of the big reasons we don't
> use mmap'd space for Postgres disk buffers.)
>
Well, I agree that it sucks for the reason you give - but you use write
and that's *exactly* the
same in terms of when it gets written, as when you update a byte on an
mmap'd page.

And you're quite happy to use write.

The only difference is that its a lot more explicit where the point of
'maybe its written and maybe
it isn't' occurs.

There need be no real difference in the architecture for one over the
other: there does seem to be
evidence that write and read can have better forward-read and
write-behind behaviour, because
read/write does allow you to initiate an IO with a hint to a size that
exceeds a hardware page.

And yes, after getting into the details while starting to port TC to
Windows, I decided to bin
it.  Especially handy that SQLite3 has WAL now.  (And one last dig - TC
didn't even
have a checksum that would let you tell when it had been broken: but it
might all be fixed now
of course, I don't have time to check.)

James


Re: BBU Cache vs. spindles

From
Tom Lane
Date:
James Mansion <james@mansionfamily.plus.com> writes:
> Tom Lane wrote:
>> The other and probably worse problem is that there's no application
>> control over how soon changes to mmap'd pages get to disk.  An msync
>> will flush them out, but the kernel is free to write dirty pages sooner.
>> So if they're depending for consistency on writes not happening until
>> msync, it's broken by design.  (This is one of the big reasons we don't
>> use mmap'd space for Postgres disk buffers.)

> Well, I agree that it sucks for the reason you give - but you use
> write and that's *exactly* the same in terms of when it gets written,
> as when you update a byte on an mmap'd page.

Uh, no, it is not.  The difference is that we can update a byte in a
shared buffer, and know that it *isn't* getting written out before we
say so.  If the buffer were mmap'd then we'd have no control over that,
which makes it mighty hard to obey the WAL "write log before data"
paradigm.

It's true that we don't know whether write() causes an immediate or
delayed disk write, but we generally don't care that much.  What we do
care about is being able to ensure that a WAL write happens before the
data write, and with mmap we don't have control over that.

            regards, tom lane

Re: BBU Cache vs. spindles

From
Robert Haas
Date:
On Thu, Oct 28, 2010 at 5:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> James Mansion <james@mansionfamily.plus.com> writes:
>> Tom Lane wrote:
>>> The other and probably worse problem is that there's no application
>>> control over how soon changes to mmap'd pages get to disk.  An msync
>>> will flush them out, but the kernel is free to write dirty pages sooner.
>>> So if they're depending for consistency on writes not happening until
>>> msync, it's broken by design.  (This is one of the big reasons we don't
>>> use mmap'd space for Postgres disk buffers.)
>
>> Well, I agree that it sucks for the reason you give - but you use
>> write and that's *exactly* the same in terms of when it gets written,
>> as when you update a byte on an mmap'd page.
>
> Uh, no, it is not.  The difference is that we can update a byte in a
> shared buffer, and know that it *isn't* getting written out before we
> say so.  If the buffer were mmap'd then we'd have no control over that,
> which makes it mighty hard to obey the WAL "write log before data"
> paradigm.
>
> It's true that we don't know whether write() causes an immediate or
> delayed disk write, but we generally don't care that much.  What we do
> care about is being able to ensure that a WAL write happens before the
> data write, and with mmap we don't have control over that.

Well, we COULD keep the data in shared buffers, and then copy it into
an mmap()'d region rather than calling write(), but I'm not sure
there's any advantage to it.  Managing address space mappings is a
pain in the butt.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BBU Cache vs. spindles

From
Aidan Van Dyk
Date:
On Fri, Oct 29, 2010 at 11:43 AM, Robert Haas <robertmhaas@gmail.com> wrote:

> Well, we COULD keep the data in shared buffers, and then copy it into
> an mmap()'d region rather than calling write(), but I'm not sure
> there's any advantage to it.  Managing address space mappings is a
> pain in the butt.

I could see this being a *theoretical* benefit in the case that the
background writer gains the ability to write out all blocks associated
with a file in order.  In that case, you might get a win because you
could get a single mmap of the entire file, and just wholesale memcpy
blocks across, then sync/unmap it.

This, of course assumes a few things that must be for it to be per formant:
0) a list of blocks to be written grouped by files is readily available.
1) The pages you write to must be in the page cache, or your memcpy is
going to fault them in.  With a plain write, you don't need the
over-written page in the cache.
2) Now, instead of the torn-page problem being FS block/sector sized
base, you can now actually have a possibly arbitrary amount of the
block memory written when the kernel writes out the page.  you
*really* need full-page-writes.
3) The mmap overhead required for the kernel to setup the mappings is
less than the repeated syscalls of a simple write().

All those things seem like something that somebody could synthetically
benchmark to prove value before even trying to bolt into PostgreSQL.

a.

--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: BBU Cache vs. spindles

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Oct 28, 2010 at 5:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> It's true that we don't know whether write() causes an immediate or
>> delayed disk write, but we generally don't care that much. �What we do
>> care about is being able to ensure that a WAL write happens before the
>> data write, and with mmap we don't have control over that.

> Well, we COULD keep the data in shared buffers, and then copy it into
> an mmap()'d region rather than calling write(), but I'm not sure
> there's any advantage to it.  Managing address space mappings is a
> pain in the butt.

In principle that ought to be right about the same speed as using
write() to copy the data from shared buffers to kernel disk buffers,
anyway.

            regards, tom lane

Re: BBU Cache vs. spindles

From
Robert Haas
Date:
On Fri, Oct 29, 2010 at 11:56 AM, Aidan Van Dyk <aidan@highrise.ca> wrote:
> 1) The pages you write to must be in the page cache, or your memcpy is
> going to fault them in.  With a plain write, you don't need the
> over-written page in the cache.

I seem to remember a time many years ago when I got bitten by this
problem.  The fact that our I/O is in 8K pages means this could be a
pretty severe hit, I think.

> 2) Now, instead of the torn-page problem being FS block/sector sized
> base, you can now actually have a possibly arbitrary amount of the
> block memory written when the kernel writes out the page.  you
> *really* need full-page-writes.

Yeah.

> 3) The mmap overhead required for the kernel to setup the mappings is
> less than the repeated syscalls of a simple write().

You'd expect to save something from that; but on the other hand, at
least on 32-bit systems, there's a very limited number of 1GB files
that can be simultaneously mapped into one address space, and it's a
lot smaller than the number of file descriptors that you can have
open.   Rumor has it that cutting down the number of fds that can stay
open simultaneously is pretty bad for performance, so cutting it down
to a number you can count on one hand (maybe one finger) would
probably be bad.  Maybe on 64-bit it would be OK but it seems like an
awful lot of complexity for at most a minor savings (and a pretty bad
anti-savings if point #1 kicks in).

Anyway this is all totally off-topic...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BBU Cache vs. spindles

From
Alvaro Herrera
Date:
Excerpts from Greg Smith's message of jue oct 21 14:04:17 -0300 2010:

> What I would like to do is beef up the documentation with some concrete
> examples of how to figure out if your cache and associated write path
> are working reliably or not.  It should be possible to include "does
> this handle full page writes correctly?" in that test suite.  Until we
> have something like that, I'm concerned that bugs in filesystem or
> controller handling may make full_page_writes unsafe even with a BBU,
> and we'd have no way for people to tell if that's true or not.

I think if you assume that there are bugs in the filesystem which you
need to protect against, you are already hosed.  I imagine there must be
some filesystem bug that makes it safe to have full_page_writes=on, but
unsafe to have full_page_writes=off; but I'd probably discard those as a
rare minority and thus not worth worrying about.

I agree it would be worth testing though.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: BBU Cache vs. spindles

From
david@lang.hm
Date:
On Fri, 29 Oct 2010, Robert Haas wrote:

> On Thu, Oct 28, 2010 at 5:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> James Mansion <james@mansionfamily.plus.com> writes:
>>> Tom Lane wrote:
>>>> The other and probably worse problem is that there's no application
>>>> control over how soon changes to mmap'd pages get to disk.  An msync
>>>> will flush them out, but the kernel is free to write dirty pages sooner.
>>>> So if they're depending for consistency on writes not happening until
>>>> msync, it's broken by design.  (This is one of the big reasons we don't
>>>> use mmap'd space for Postgres disk buffers.)
>>
>>> Well, I agree that it sucks for the reason you give - but you use
>>> write and that's *exactly* the same in terms of when it gets written,
>>> as when you update a byte on an mmap'd page.
>>
>> Uh, no, it is not.  The difference is that we can update a byte in a
>> shared buffer, and know that it *isn't* getting written out before we
>> say so.  If the buffer were mmap'd then we'd have no control over that,
>> which makes it mighty hard to obey the WAL "write log before data"
>> paradigm.
>>
>> It's true that we don't know whether write() causes an immediate or
>> delayed disk write, but we generally don't care that much.  What we do
>> care about is being able to ensure that a WAL write happens before the
>> data write, and with mmap we don't have control over that.
>
> Well, we COULD keep the data in shared buffers, and then copy it into
> an mmap()'d region rather than calling write(), but I'm not sure
> there's any advantage to it.  Managing address space mappings is a
> pain in the butt.

keep in mind that you have no way of knowing what order the data in the
mmap region gets written out to disk.

David Lang

Re: BBU Cache vs. spindles

From
James Mansion
Date:
Tom Lane wrote:
> Uh, no, it is not.  The difference is that we can update a byte in a
> shared buffer, and know that it *isn't* getting written out before we
>
Well, I don't know where yu got the idea I was refering to that sort of
thing - its
the same as writing to a buffer before copying to the mmap'd area.
> It's true that we don't know whether write() causes an immediate or
> delayed disk write, but we generally don't care that much.  What we do
>
Which is what I was refering to.
> care about is being able to ensure that a WAL write happens before the
> data write, and with mmap we don't have control over that.
>
>
I think you have just the same control either way, because you can only
force ordering
with an appropriate explicit sync, and in the absence of such a sync all
bets are off for
whether/when each disk page is written out, and if you can't ensure that
the controller
and disk are write through you'd better do a hardware cache flush.too,
right?

A shame that so many systems have relatively poor handling of that
hardware flush.


Re: BBU Cache vs. spindles

From
david@lang.hm
Date:
On Fri, 29 Oct 2010, James Mansion wrote:

> Tom Lane wrote:
>> Uh, no, it is not.  The difference is that we can update a byte in a
>> shared buffer, and know that it *isn't* getting written out before we
>>
> Well, I don't know where yu got the idea I was refering to that sort of thing
> - its
> the same as writing to a buffer before copying to the mmap'd area.
>> It's true that we don't know whether write() causes an immediate or
>> delayed disk write, but we generally don't care that much.  What we do
>>
> Which is what I was refering to.
>> care about is being able to ensure that a WAL write happens before the
>> data write, and with mmap we don't have control over that.
>>
>>
> I think you have just the same control either way, because you can only force
> ordering
> with an appropriate explicit sync, and in the absence of such a sync all bets
> are off for
> whether/when each disk page is written out, and if you can't ensure that the
> controller
> and disk are write through you'd better do a hardware cache flush.too, right?
>
> A shame that so many systems have relatively poor handling of that hardware
> flush.

the issue is that when you update a mmaped chunk of data, it could be
written out immediatly without you doing _anything_ (and thanks to
multiple cores/threads, it could get written out while you are still in
the middle of updating it). When you update an internal buffer and then
write that, you know that nothing will hit the disk before you issue the
write command.

David Lang

Re: BBU Cache vs. spindles

From
Bruce Momjian
Date:
Greg Smith wrote:
> Kevin Grittner wrote:
> > I assume that we send a full
> > 8K to the OS cache, and the file system writes disk sectors
> > according to its own algorithm.  With either platters or BBU cache,
> > the data is persisted on fsync; why do you see a risk with one but
> > not the other
>
> I'd like a 10 minute argument please.  I started to write something to
> refute this, only to clarify in my head the sequence of events that
> leads to the most questionable result, where I feel a bit less certain
> than I did before of the safety here.  Here is the worst case I believe
> you're describing:
>
> 1) Transaction is written to the WAL and sync'd; client receives
> COMMIT.  Since full_page_writes is off, the data in the WAL consists
> only of the delta of what changed on the page.
> 2) 8K database page is written to OS cache
> 3) PG calls fsync to force the database block out
> 4) OS writes first 4K block of the change to the BBU write cache.  Worst
> case, this fills the cache, and it takes a moment for some random writes
> to process before it has space to buffer again (makes this more likely
> to happen, but it's not required to see the failure case here)
> 5) Sudden power interruption, second half of the page write is lost
> 6) Server restarts
> 7) That 4K write is now replayed from the battery's cache
>
> At this point, you now have a torn 8K page, with 1/2 old and 1/2 new

Based on this report, I think we need to update our documentation and
backpatch removal of text that says that BBU users can safely turn off
full-page writes.  Patch attached.

I think we have fallen into a trap I remember from the late 1990's where
I was assuming that an 8k-block based file system would write to the
disk atomically in 8k segments, which of course it cannot.  My bet is
that even if you write to the kernel in 8k pages, and have an 8k file
system, the disk is still accessed via 512-byte blocks, even with a BBU.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index a2724fa..1e67bbd 100644
*** /tmp/pgrevert.14281/7sLqTb_wal.sgml    Tue Nov 30 21:57:17 2010
--- doc/src/sgml/wal.sgml    Tue Nov 30 21:56:49 2010
***************
*** 164,173 ****
     <productname>PostgreSQL</> periodically writes full page images to
     permanent WAL storage <emphasis>before</> modifying the actual page on
     disk. By doing this, during crash recovery <productname>PostgreSQL</> can
!    restore partially-written pages.  If you have a battery-backed disk
!    controller or file-system software that prevents partial page writes
!    (e.g., ZFS),  you can turn off this page imaging by turning off the
!    <xref linkend="guc-full-page-writes"> parameter.
    </para>
   </sect1>

--- 164,175 ----
     <productname>PostgreSQL</> periodically writes full page images to
     permanent WAL storage <emphasis>before</> modifying the actual page on
     disk. By doing this, during crash recovery <productname>PostgreSQL</> can
!    restore partially-written pages.  If you have file-system software
!    that prevents partial page writes (e.g., ZFS),  you can turn off
!    this page imaging by turning off the <xref
!    linkend="guc-full-page-writes"> parameter. Battery-Backed unit
!    (BBU) disk controllers do not prevent partial page writes unless
!    they guarantee that data is written to the BBU as full (8kB) pages.
    </para>
   </sect1>


Re: BBU Cache vs. spindles

From
Bruce Momjian
Date:
Kevin Grittner wrote:
> Greg Smith <greg@2ndquadrant.com> wrote:
>
> > I think Kevin's point here may be that if your fsync isn't
> > reliable, you're always in trouble.  But if your fsync is good,
> > even torn pages should be repairable by the deltas written to the
> > WAL
>
> I was actually just arguing that a BBU doesn't eliminate a risk
> here; if there is a risk with production-quality disk drives, there
> is a risk with a controller with a BBU cache.  The BBU cache just
> tends to reduce the window of time in which corruption can occur.  I
> wasn't too sure of *why* there was a risk, but Tom's post cleared
> that up.
>
> I wonder why we need to expose this GUC at all -- perhaps it should
> be off when fsync is off and on otherwise?  Leaving it on without
> fsync is just harming performance for not much benefit, and turning
> it off with fsync seems to be saying that you are willing to
> tolerate a known risk of database corruption, just not quite so much
> as you have without fsync.  In reality it seems most likely to be a
> mistake, either way.

According to our docs, and my submitted patch, if you are using ZFS then
you can turn off full-page writes, so full-page writes are still useful.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: BBU Cache vs. spindles

From
Bruce Momjian
Date:
Greg Smith wrote:
> Tom Lane wrote:
> > You've got entirely too simplistic a view of what the "delta" might be,
> > I fear.  In particular there are various sorts of changes that involve
> > inserting the data carried in the WAL record and shifting pre-existing
> > data around to make room, or removing an item and moving remaining data
> > around.  If you try to replay that type of action against a torn page,
> > you'll get corrupted results.
> >
>
> I wasn't sure exactly how those were encoded, thanks for the
> clarification.  Given that, it seems to me there are only two situations
> where full_page_writes is safe to turn off:
>
> 1) The operating system block size is exactly the same database block
> size, and all writes are guaranteed to be atomic to that block size.

Is that true?  I have no idea.  I thought everything was done at the
512-byte block level.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: BBU Cache vs. spindles

From
"Pierre C"
Date:
> Is that true?  I have no idea.  I thought everything was done at the
> 512-byte block level.

Newer disks (2TB and up) can have 4k sectors, but this still means a page
spans several sectors.

Re: BBU Cache vs. spindles

From
Bruce Momjian
Date:
Pierre C wrote:
>
> > Is that true?  I have no idea.  I thought everything was done at the
> > 512-byte block level.
>
> Newer disks (2TB and up) can have 4k sectors, but this still means a page
> spans several sectors.

Yes, I had heard about that.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: BBU Cache vs. spindles

From
Bruce Momjian
Date:
Bruce Momjian wrote:
> Greg Smith wrote:
> > Kevin Grittner wrote:
> > > I assume that we send a full
> > > 8K to the OS cache, and the file system writes disk sectors
> > > according to its own algorithm.  With either platters or BBU cache,
> > > the data is persisted on fsync; why do you see a risk with one but
> > > not the other
> >
> > I'd like a 10 minute argument please.  I started to write something to
> > refute this, only to clarify in my head the sequence of events that
> > leads to the most questionable result, where I feel a bit less certain
> > than I did before of the safety here.  Here is the worst case I believe
> > you're describing:
> >
> > 1) Transaction is written to the WAL and sync'd; client receives
> > COMMIT.  Since full_page_writes is off, the data in the WAL consists
> > only of the delta of what changed on the page.
> > 2) 8K database page is written to OS cache
> > 3) PG calls fsync to force the database block out
> > 4) OS writes first 4K block of the change to the BBU write cache.  Worst
> > case, this fills the cache, and it takes a moment for some random writes
> > to process before it has space to buffer again (makes this more likely
> > to happen, but it's not required to see the failure case here)
> > 5) Sudden power interruption, second half of the page write is lost
> > 6) Server restarts
> > 7) That 4K write is now replayed from the battery's cache
> >
> > At this point, you now have a torn 8K page, with 1/2 old and 1/2 new
>
> Based on this report, I think we need to update our documentation and
> backpatch removal of text that says that BBU users can safely turn off
> full-page writes.  Patch attached.
>
> I think we have fallen into a trap I remember from the late 1990's where
> I was assuming that an 8k-block based file system would write to the
> disk atomically in 8k segments, which of course it cannot.  My bet is
> that even if you write to the kernel in 8k pages, and have an 8k file
> system, the disk is still accessed via 512-byte blocks, even with a BBU.

Doc patch applied.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index a2724fa..1e67bbd 100644
*** /tmp/pgrevert.14281/7sLqTb_wal.sgml    Tue Nov 30 21:57:17 2010
--- doc/src/sgml/wal.sgml    Tue Nov 30 21:56:49 2010
***************
*** 164,173 ****
     <productname>PostgreSQL</> periodically writes full page images to
     permanent WAL storage <emphasis>before</> modifying the actual page on
     disk. By doing this, during crash recovery <productname>PostgreSQL</> can
!    restore partially-written pages.  If you have a battery-backed disk
!    controller or file-system software that prevents partial page writes
!    (e.g., ZFS),  you can turn off this page imaging by turning off the
!    <xref linkend="guc-full-page-writes"> parameter.
    </para>
   </sect1>

--- 164,175 ----
     <productname>PostgreSQL</> periodically writes full page images to
     permanent WAL storage <emphasis>before</> modifying the actual page on
     disk. By doing this, during crash recovery <productname>PostgreSQL</> can
!    restore partially-written pages.  If you have file-system software
!    that prevents partial page writes (e.g., ZFS),  you can turn off
!    this page imaging by turning off the <xref
!    linkend="guc-full-page-writes"> parameter. Battery-Backed unit
!    (BBU) disk controllers do not prevent partial page writes unless
!    they guarantee that data is written to the BBU as full (8kB) pages.
    </para>
   </sect1>