Thread: SSD + RAID

From:
Laszlo Nagy
Date:

Hello,

I'm about to buy SSD drive(s) for a database. For decision making, I
used this tech report:

http://techreport.com/articles.x/16255/9
http://techreport.com/articles.x/16255/10

Here are my concerns:

    * I need at least 32GB disk space. So DRAM based SSD is not a real
      option. I would have to buy 8x4GB memory, costs a fortune. And
      then it would still not have redundancy.
    * I could buy two X25-E drives and have 32GB disk space, and some
      redundancy. This would cost about $1600, not counting the RAID
      controller. It is on the edge.
    * I could also buy many cheaper MLC SSD drives. They cost about
      $140. So even with 10 drives, I'm at $1400. I could put them in
      RAID6, have much more disk space (256GB), high redundancy and
      POSSIBLY good read/write speed. Of course then I need to buy a
      good RAID controller.

My question is about the last option. Are there any good RAID cards that
are optimized (or can be optimized) for SSD drives? Do any of you have
experience in using many cheaper SSD drives? Is it a bad idea?

Thank you,

   Laszlo


From:
Karl Denninger
Date:

Laszlo Nagy wrote:
> Hello,
>
> I'm about to buy SSD drive(s) for a database. For decision making, I
> used this tech report:
>
> http://techreport.com/articles.x/16255/9
> http://techreport.com/articles.x/16255/10
>
> Here are my concerns:
>
>    * I need at least 32GB disk space. So DRAM based SSD is not a real
>      option. I would have to buy 8x4GB memory, costs a fortune. And
>      then it would still not have redundancy.
>    * I could buy two X25-E drives and have 32GB disk space, and some
>      redundancy. This would cost about $1600, not counting the RAID
>      controller. It is on the edge.
>    * I could also buy many cheaper MLC SSD drives. They cost about
>      $140. So even with 10 drives, I'm at $1400. I could put them in
>      RAID6, have much more disk space (256GB), high redundancy and
>      POSSIBLY good read/write speed. Of course then I need to buy a
>      good RAID controller.
>
> My question is about the last option. Are there any good RAID cards
> that are optimized (or can be optimized) for SSD drives? Do any of you
> have experience in using many cheaper SSD drives? Is it a bad idea?
>
> Thank you,
>
>   Laszlo
>
Note that some RAID controllers (3Ware in particular) refuse to
recognize the MLC drives, in particular, they act as if the OCZ Vertex
series do not exist when connected.

I don't know what they're looking for (perhaps some indication that
actual rotation is happening?) but this is a potential problem.... make
sure your adapter can talk to these things!

BTW I have done some benchmarking with Postgresql against these drives
and they are SMOKING fast.

-- Karl

From:
Laszlo Nagy
Date:

> Note that some RAID controllers (3Ware in particular) refuse to
> recognize the MLC drives, in particular, they act as if the OCZ Vertex
> series do not exist when connected.
>
> I don't know what they're looking for (perhaps some indication that
> actual rotation is happening?) but this is a potential problem.... make
> sure your adapter can talk to these things!
>
> BTW I have done some benchmarking with Postgresql against these drives
> and they are SMOKING fast.
>
I was thinking about ARECA 1320 with 2GB memory + BBU. Unfortunately, I
cannot find information about using ARECA cards with SSD drives. I'm
also not sure how they would work together. I guess the RAID cards are
optimized for conventional disks. They read/write data in bigger blocks
and they optimize the order of reading/writing for physical cylinders. I
know for sure that this particular areca card has an Intel dual core IO
processor and its own embedded operating system. I guess it could be
tuned for SSD drives, but I don't know how.

I was hoping that with a RAID 6 setup, write speed (which is slower for
cheaper flash based SSD drives) would dramatically increase, because
information written simultaneously to 10 drives. With very small block
size, it would probably be true. But... what if the RAID card uses
bigger block sizes, and - say - I want to update much smaller blocks in
the database?

My other option is to buy two SLC SSD drives and use RAID1. It would
cost about the same, but has less redundancy and less capacity. Which is
the faster? 8-10 MLC disks in RAID 6 with a good caching controller, or
two SLC disks in RAID1?

Thanks,

   Laszlo


From:
Marcos Ortiz Valmaseda
Date:

This is very fast.
On IT Toolbox there are many whitepapers about it.
On the ERP and DataCenter sections specifically.

We need that all tests that we do, we can share it on the
Project Wiki.

Regards

On Nov 13, 2009, at 7:02 AM, Karl Denninger wrote:

> Laszlo Nagy wrote:
>> Hello,
>>
>> I'm about to buy SSD drive(s) for a database. For decision making, I
>> used this tech report:
>>
>> http://techreport.com/articles.x/16255/9
>> http://techreport.com/articles.x/16255/10
>>
>> Here are my concerns:
>>
>>   * I need at least 32GB disk space. So DRAM based SSD is not a real
>>     option. I would have to buy 8x4GB memory, costs a fortune. And
>>     then it would still not have redundancy.
>>   * I could buy two X25-E drives and have 32GB disk space, and some
>>     redundancy. This would cost about $1600, not counting the RAID
>>     controller. It is on the edge.
>>   * I could also buy many cheaper MLC SSD drives. They cost about
>>     $140. So even with 10 drives, I'm at $1400. I could put them in
>>     RAID6, have much more disk space (256GB), high redundancy and
>>     POSSIBLY good read/write speed. Of course then I need to buy a
>>     good RAID controller.
>>
>> My question is about the last option. Are there any good RAID cards
>> that are optimized (or can be optimized) for SSD drives? Do any of
>> you
>> have experience in using many cheaper SSD drives? Is it a bad idea?
>>
>> Thank you,
>>
>>  Laszlo
>>
> Note that some RAID controllers (3Ware in particular) refuse to
> recognize the MLC drives, in particular, they act as if the OCZ Vertex
> series do not exist when connected.
>
> I don't know what they're looking for (perhaps some indication that
> actual rotation is happening?) but this is a potential problem....
> make
> sure your adapter can talk to these things!
>
> BTW I have done some benchmarking with Postgresql against these drives
> and they are SMOKING fast.
>
> -- Karl
> <karl.vcf>
> --
> Sent via pgsql-performance mailing list (
> )
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


From:
Scott Marlowe
Date:

2009/11/13 Laszlo Nagy <>:
> Hello,
>
> I'm about to buy SSD drive(s) for a database. For decision making, I used
> this tech report:
>
> http://techreport.com/articles.x/16255/9
> http://techreport.com/articles.x/16255/10
>
> Here are my concerns:
>
>   * I need at least 32GB disk space. So DRAM based SSD is not a real
>     option. I would have to buy 8x4GB memory, costs a fortune. And
>     then it would still not have redundancy.
>   * I could buy two X25-E drives and have 32GB disk space, and some
>     redundancy. This would cost about $1600, not counting the RAID
>     controller. It is on the edge.

I'm not sure a RAID controller brings much of anything to the table with SSDs.

>   * I could also buy many cheaper MLC SSD drives. They cost about
>     $140. So even with 10 drives, I'm at $1400. I could put them in
>     RAID6, have much more disk space (256GB), high redundancy and

I think RAID6 is gonna reduce the throughput due to overhead to
something far less than what a software RAID-10 would achieve.

>     POSSIBLY good read/write speed. Of course then I need to buy a
>     good RAID controller.

I'm guessing that if you spent whatever money you were gonna spend on
more SSDs you'd come out ahead, assuming you had somewhere to put
them.

> My question is about the last option. Are there any good RAID cards that are
> optimized (or can be optimized) for SSD drives? Do any of you have
> experience in using many cheaper SSD drives? Is it a bad idea?

This I don't know.  Some quick googling shows the Areca 1680ix and
Adaptec 5 Series to be able to handle Samsun SSDs.

From:
Merlin Moncure
Date:

On Fri, Nov 13, 2009 at 9:48 AM, Scott Marlowe <> wrote:
> I think RAID6 is gonna reduce the throughput due to overhead to
> something far less than what a software RAID-10 would achieve.

I was wondering about this.  I think raid 5/6 might be a better fit
for SSD than traditional drives arrays.  Here's my thinking:

*) flash SSD reads are cheaper than writes.  With 6 or more drives,
less total data has to be written in Raid 5 than Raid 10.  The main
component of raid 5 performance penalty is that for each written
block, it has to be read first than written...incurring rotational
latency, etc.   SSD does not have this problem.

*) flash is much more expensive in terms of storage/$.

*) flash (at least the intel stuff) is so fast relative to what we are
used to, that the point of using flash in raid is more for fault
tolerance than performance enhancement.  I don't have data to support
this, but I suspect that even with relatively small amount of the
slower MLC drives in raid, postgres will become cpu bound for most
applications.

merlin

From:
Heikki Linnakangas
Date:

Laszlo Nagy wrote:
>    * I need at least 32GB disk space. So DRAM based SSD is not a real
>      option. I would have to buy 8x4GB memory, costs a fortune. And
>      then it would still not have redundancy.

At 32GB database size, I'd seriously consider just buying a server with
a regular hard drive or a small RAID array for redundancy, and stuffing
16 or 32 GB of RAM into it to ensure everything is cached. That's tried
and tested technology.

I don't know how you came to the 32 GB figure, but keep in mind that
administration is a lot easier if you have plenty of extra disk space
for things like backups, dumps+restore, temporary files, upgrades etc.
So if you think you'd need 32 GB of disk space, I'm guessing that 16 GB
of RAM would be enough to hold all the hot data in cache. And if you
choose a server with enough DIMM slots, you can expand easily if needed.

Just my 2 cents, I'm not really an expert on hardware..

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

From:
Merlin Moncure
Date:

2009/11/13 Heikki Linnakangas <>:
> Laszlo Nagy wrote:
>>    * I need at least 32GB disk space. So DRAM based SSD is not a real
>>      option. I would have to buy 8x4GB memory, costs a fortune. And
>>      then it would still not have redundancy.
>
> At 32GB database size, I'd seriously consider just buying a server with
> a regular hard drive or a small RAID array for redundancy, and stuffing
> 16 or 32 GB of RAM into it to ensure everything is cached. That's tried
> and tested technology.

lots of ram doesn't help you if:
*) your database gets written to a lot and you have high performance
requirements
*) your data is important

(if either of the above is not true or even partially true, than your
advice is spot on)

merlin

From:
Greg Smith
Date:

In order for a drive to work reliably for database use such as for
PostgreSQL, it cannot have a volatile write cache.  You either need a
write cache with a battery backup (and a UPS doesn't count), or to turn
the cache off.  The SSD performance figures you've been looking at are
with the drive's write cache turned on, which means they're completely
fictitious and exaggerated upwards for your purposes.  In the real
world, that will result in database corruption after a crash one day.
No one on the drive benchmarking side of the industry seems to have
picked up on this, so you can't use any of those figures.  I'm not even
sure right now whether drives like Intel's will even meet their lifetime
expectations if they aren't allowed to use their internal volatile write
cache.

Here's two links you should read and then reconsider your whole design:

http://www.mysqlperformanceblog.com/2009/03/02/ssd-xfs-lvm-fsync-write-cache-barrier-and-lost-transactions/
http://petereisentraut.blogspot.com/2009/07/solid-state-drive-benchmarks-and-write.html

I can't even imagine how bad the situation would be if you decide to
wander down the "use a bunch of really cheap SSD drives" path; these
things are barely usable for databases with Intel's hardware.  The needs
of people who want to throw SSD in a laptop and those of the enterprise
database market are really different, and if you believe doom
forecasting like the comments at
http://blogs.sun.com/BestPerf/entry/oracle_peoplesoft_payroll_sun_sparc
that gap is widening, not shrinking.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
  www.2ndQuadrant.com


From:
Scott Carey
Date:



On 11/13/09 7:29 AM, "Merlin Moncure" <> wrote:

> On Fri, Nov 13, 2009 at 9:48 AM, Scott Marlowe <>
> wrote:
>> I think RAID6 is gonna reduce the throughput due to overhead to
>> something far less than what a software RAID-10 would achieve.
>
> I was wondering about this.  I think raid 5/6 might be a better fit
> for SSD than traditional drives arrays.  Here's my thinking:
>
> *) flash SSD reads are cheaper than writes.  With 6 or more drives,
> less total data has to be written in Raid 5 than Raid 10.  The main
> component of raid 5 performance penalty is that for each written
> block, it has to be read first than written...incurring rotational
> latency, etc.   SSD does not have this problem.
>

For random writes, RAID 5 writes as much as RAID 10 (parity + data), and
more if the raid block size is larger than 8k.  With RAID 6 it writes 50%
more than RAID 10.

For streaming writes RAID 5 / 6 has an advantage however.

For SLC drives, there is really  not much of a write performance penalty.
>


From:
Karl Denninger
Date:

Greg Smith wrote:
> In order for a drive to work reliably for database use such as for
> PostgreSQL, it cannot have a volatile write cache.  You either need a
> write cache with a battery backup (and a UPS doesn't count), or to
> turn the cache off.  The SSD performance figures you've been looking
> at are with the drive's write cache turned on, which means they're
> completely fictitious and exaggerated upwards for your purposes.  In
> the real world, that will result in database corruption after a crash
> one day.
If power is "unexpectedly" removed from the system, this is true.  But
the caches on the SSD controllers are BUFFERS.  An operating system
crash does not disrupt the data in them or cause corruption.  An
unexpected disconnection of the power source from the drive (due to
unplugging it or a power supply failure for whatever reason) is a
different matter.
>   No one on the drive benchmarking side of the industry seems to have
> picked up on this, so you can't use any of those figures.  I'm not
> even sure right now whether drives like Intel's will even meet their
> lifetime expectations if they aren't allowed to use their internal
> volatile write cache.
>
> Here's two links you should read and then reconsider your whole design:
> http://www.mysqlperformanceblog.com/2009/03/02/ssd-xfs-lvm-fsync-write-cache-barrier-and-lost-transactions/
>
> http://petereisentraut.blogspot.com/2009/07/solid-state-drive-benchmarks-and-write.html
>
>
> I can't even imagine how bad the situation would be if you decide to
> wander down the "use a bunch of really cheap SSD drives" path; these
> things are barely usable for databases with Intel's hardware.  The
> needs of people who want to throw SSD in a laptop and those of the
> enterprise database market are really different, and if you believe
> doom forecasting like the comments at
> http://blogs.sun.com/BestPerf/entry/oracle_peoplesoft_payroll_sun_sparc
> that gap is widening, not shrinking.
Again, it depends.

With the write cache off on these disks they still are huge wins for
very-heavy-read applications, which many are.  The issue is (as always)
operation mix - if you do a lot of inserts and updates then you suffer,
but a lot of database applications are in the high 90%+ SELECTs both in
frequency and data flow volume.  The lack of rotational and seek latency
in those applications is HUGE.

-- Karl Denninger

From:
Greg Smith
Date:

Karl Denninger wrote:
> If power is "unexpectedly" removed from the system, this is true.  But
> the caches on the SSD controllers are BUFFERS.  An operating system
> crash does not disrupt the data in them or cause corruption.  An
> unexpected disconnection of the power source from the drive (due to
> unplugging it or a power supply failure for whatever reason) is a
> different matter.
>
As standard operating procedure, I regularly get something writing heavy
to the database on hardware I'm suspicious of and power the box off
hard.  If at any time I suffer database corruption from this, the
hardware is unsuitable for database use; that should never happen.  This
is what I mean when I say something meets the mythical "enterprise"
quality.  Companies whose data is worth something can't operate in a
situation where money has been exchanged because a database commit was
recorded, only to lose that commit just because somebody tripped over
the power cord and it was in the buffer rather than on permanent disk.
That's just not acceptable, and the even bigger danger of the database
perhaps not coming up altogether even after such a tiny disaster is also
very real with a volatile write cache.

> With the write cache off on these disks they still are huge wins for
> very-heavy-read applications, which many are.
Very read-heavy applications would do better to buy a ton of RAM instead
and just make sure they populate from permanent media (say by reading
everything in early at sequential rates to prime the cache).  There is
an extremely narrow use-case where SSDs are the right technology, and
it's only in a subset even of read-heavy apps where they make sense.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
  www.2ndQuadrant.com


From:
Karl Denninger
Date:

Greg Smith wrote:
> Karl Denninger wrote:
>> If power is "unexpectedly" removed from the system, this is true.  But
>> the caches on the SSD controllers are BUFFERS.  An operating system
>> crash does not disrupt the data in them or cause corruption.  An
>> unexpected disconnection of the power source from the drive (due to
>> unplugging it or a power supply failure for whatever reason) is a
>> different matter.
>>
> As standard operating procedure, I regularly get something writing
> heavy to the database on hardware I'm suspicious of and power the box
> off hard.  If at any time I suffer database corruption from this, the
> hardware is unsuitable for database use; that should never happen.
> This is what I mean when I say something meets the mythical
> "enterprise" quality.  Companies whose data is worth something can't
> operate in a situation where money has been exchanged because a
> database commit was recorded, only to lose that commit just because
> somebody tripped over the power cord and it was in the buffer rather
> than on permanent disk.  That's just not acceptable, and the even
> bigger danger of the database perhaps not coming up altogether even
> after such a tiny disaster is also very real with a volatile write cache.
Yep.  The "plug test" is part of my standard "is this stable enough for
something I care about" checkout.
>> With the write cache off on these disks they still are huge wins for
>> very-heavy-read applications, which many are.
> Very read-heavy applications would do better to buy a ton of RAM
> instead and just make sure they populate from permanent media (say by
> reading everything in early at sequential rates to prime the cache).
> There is an extremely narrow use-case where SSDs are the right
> technology, and it's only in a subset even of read-heavy apps where
> they make sense.
I don't know about that in the general case - I'd say "it depends."

250GB of SSD for read-nearly-always applications is a LOT cheaper than
250gb of ECC'd DRAM.  The write performance issues can be handled by
clever use of controller technology as well (that is, turn off the
drive's "write cache" and use the BBU on the RAID adapter.)

I have a couple of applications where two 250GB SSD disks in a Raid 1
array with a BBU'd controller, with the disk drive cache off, is all-in
a fraction of the cost of sticking 250GB of volatile storage in a server
and reading in the data set (plus managing the occasional updates) from
"stable storage."  It is not as fast as stuffing the 250GB of RAM in a
machine but it's a hell of a lot faster than a big array of small
conventional drives in a setup designed for maximum IO-Ops.

One caution for those thinking of doing this - the incremental
improvement of this setup on PostGresql in WRITE SIGNIFICANT environment
isn't NEARLY as impressive.  Indeed the performance in THAT case for
many workloads may only be 20 or 30% faster than even "reasonably
pedestrian" rotating media in a high-performance (lots of spindles and
thus stripes) configuration and it's more expensive (by a lot.)  If you
step up to the fast SAS drives on the rotating side there's little
argument for the SSD at all (again, assuming you don't intend to "cheat"
and risk data loss.)

Know your application and benchmark it.

-- Karl

From:
Merlin Moncure
Date:

On Fri, Nov 13, 2009 at 12:22 PM, Scott Carey
<> > On 11/13/09 7:29 AM, "Merlin Moncure"
<> wrote:
>
>> On Fri, Nov 13, 2009 at 9:48 AM, Scott Marlowe <>
>> wrote:
>>> I think RAID6 is gonna reduce the throughput due to overhead to
>>> something far less than what a software RAID-10 would achieve.
>>
>> I was wondering about this.  I think raid 5/6 might be a better fit
>> for SSD than traditional drives arrays.  Here's my thinking:
>>
>> *) flash SSD reads are cheaper than writes.  With 6 or more drives,
>> less total data has to be written in Raid 5 than Raid 10.  The main
>> component of raid 5 performance penalty is that for each written
>> block, it has to be read first than written...incurring rotational
>> latency, etc.   SSD does not have this problem.
>>
>
> For random writes, RAID 5 writes as much as RAID 10 (parity + data), and
> more if the raid block size is larger than 8k.  With RAID 6 it writes 50%
> more than RAID 10.

how does raid 5 write more if the block size is > 8k? raid 10 is also
striped, so has the same problem, right?  IOW, if the block size is 8k
and you need to write 16k sequentially the raid 5 might write out 24k
(two blocks + parity).  raid 10 always writes out 2x your data in
terms of blocks (raid 5 does only in the worst case).  For a SINGLE
block, it's always 2x your data for both raid 5 and raid 10, so what i
said above was not quite correct.

raid 6 is not going to outperform raid 10 ever IMO.  It's just a
slightly safer raid 5.  I was just wondering out loud if raid 5 might
give similar performance to raid 10 on flash based disks since there
is no rotational latency.  even if it did, I probably still wouldn't
use it...

merlin

From:
Merlin Moncure
Date:

2009/11/13 Greg Smith <>:
> In order for a drive to work reliably for database use such as for
> PostgreSQL, it cannot have a volatile write cache.  You either need a write
> cache with a battery backup (and a UPS doesn't count), or to turn the cache
> off.  The SSD performance figures you've been looking at are with the
> drive's write cache turned on, which means they're completely fictitious and
> exaggerated upwards for your purposes.  In the real world, that will result
> in database corruption after a crash one day.  No one on the drive
> benchmarking side of the industry seems to have picked up on this, so you
> can't use any of those figures.  I'm not even sure right now whether drives
> like Intel's will even meet their lifetime expectations if they aren't
> allowed to use their internal volatile write cache.

hm.  I never understood why Peter was only able to turn up 400 iops
when others were turning up 4000+ (measured from bonnie).  This would
explain it.

Is it authoritatively known that the Intel drives true random write
ops is not what they are claiming?  If so,  then you are right..flash
doesn't make sense, at least not without a NV cache on the device.

merlin

From:
Brad Nicholson
Date:

Greg Smith wrote:
> Karl Denninger wrote:
>> With the write cache off on these disks they still are huge wins for
>> very-heavy-read applications, which many are.
> Very read-heavy applications would do better to buy a ton of RAM
> instead and just make sure they populate from permanent media (say by
> reading everything in early at sequential rates to prime the cache).
> There is an extremely narrow use-case where SSDs are the right
> technology, and it's only in a subset even of read-heavy apps where
> they make sense.

Out of curiosity, what are those narrow use cases where you think SSD's
are the correct technology?

--
Brad Nicholson  416-673-4106
Database Administrator, Afilias Canada Corp.


From:
Dave Crooke
Date:

Itching to jump in here :-)

There are a lot of things to trade off when choosing storage for a
database: performance for different parts of the workload,
reliability, performance in degraded mode (when a disk dies), backup
methodologies, etc. ... the mistake many people make is to overlook
the sub-optimal operating conditions, dailure modes and recovery
paths.

Some thoughts:

- RAID-5 and RAID-6 have poor write performance, and terrible
performance in degraded mode - there are a few edge cases, but in
almost all cases you should be using RAID-10 for a database.

- Like most apps, the ultimate way to make a databse perform is to
have most of it (or at least the working set) in RAM, preferably the
DB server buffer cache. This is why big banks run Oracle on an HP
Superdome with 1TB of RAM ... the $15m Hitachi data array is just
backing store :-)

- Personally, I'm an SSD skeptic ... the technology just isn't mature
enough for the data center. If you apply a typical OLTP workload, they
are going to die early deaths. The only case in which they will
materially improve performance is where you have a large data set with
lots of **totally random** reads, i.e. where buffer cache is
ineffective. In the words of TurboTax, "this is not common".

- If you're going to use synchronous write with a significant amount
of small transactions, then you need some reliable RAM (not SSD) to
commit log files into, which means a proper battery-backed RAID
controller / external SAN with write-back cache. For many apps though,
a synchronous commit simply isn't necessary: losing a few rows of data
during a crash is relatively harmless. For these apps, turning off
synchronous writes is an often overlooked performance tweak.


In summary, don't get distracted by shiny new objects like SSD and RAID-6 :-)


2009/11/13 Brad Nicholson <>:
> Greg Smith wrote:
>>
>> Karl Denninger wrote:
>>>
>>> With the write cache off on these disks they still are huge wins for
>>> very-heavy-read applications, which many are.
>>
>> Very read-heavy applications would do better to buy a ton of RAM instead
>> and just make sure they populate from permanent media (say by reading
>> everything in early at sequential rates to prime the cache).  There is an
>> extremely narrow use-case where SSDs are the right technology, and it's only
>> in a subset even of read-heavy apps where they make sense.
>
> Out of curiosity, what are those narrow use cases where you think SSD's are
> the correct technology?
>
> --
> Brad Nicholson  416-673-4106
> Database Administrator, Afilias Canada Corp.
>
>
> --
> Sent via pgsql-performance mailing list ()
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

From:
"Fernando Hevia"
Date:


> -----Mensaje original-----
> Laszlo Nagy
>
> My question is about the last option. Are there any good RAID
> cards that are optimized (or can be optimized) for SSD
> drives? Do any of you have experience in using many cheaper
> SSD drives? Is it a bad idea?
>
> Thank you,
>
>    Laszlo
>

Never had a SSD to try yet, still I wonder if software raid + fsync on SSD
Drives could be regarded as a sound solution?
Shouldn't their write performance be more than a trade-off for fsync?

You could benchmark this setup yourself before purchasing a RAID card.


From:
Greg Smith
Date:

Brad Nicholson wrote:
> Out of curiosity, what are those narrow use cases where you think
> SSD's are the correct technology?
Dave Crooke did a good summary already, I see things like this:

 * You need to have a read-heavy app that's bigger than RAM, but not too
big so it can still fit on SSD
 * You need reads to be dominated by random-access and uncached lookups,
so that system RAM used as a buffer cache doesn't help you much.
 * Writes have to be low to moderate, as the true write speed is much
lower for database use than you'd expect from benchmarks derived from
other apps.  And it's better if writes are biased toward adding data
rather than changing existing pages

As far as what real-world apps have that profile, I like SSDs for small
to medium web applications that have to be responsive, where the user
shows up and wants their randomly distributed and uncached data with
minimal latency.

SSDs can also be used effectively as second-tier targeted storage for
things that have a performance-critical but small and random bit as part
of a larger design that doesn't have those characteristics; putting
indexes on SSD can work out well for example (and there the write
durability stuff isn't quite as critical, as you can always drop an
index and rebuild if it gets corrupted).

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
  www.2ndQuadrant.com


From:
Merlin Moncure
Date:

2009/11/13 Greg Smith <>:
> As far as what real-world apps have that profile, I like SSDs for small to
> medium web applications that have to be responsive, where the user shows up
> and wants their randomly distributed and uncached data with minimal latency.
> SSDs can also be used effectively as second-tier targeted storage for things
> that have a performance-critical but small and random bit as part of a
> larger design that doesn't have those characteristics; putting indexes on
> SSD can work out well for example (and there the write durability stuff
> isn't quite as critical, as you can always drop an index and rebuild if it
> gets corrupted).


Here's a bonnie++ result for Intel showing 14k seeks:
http://www.wlug.org.nz/HarddiskBenchmarks

bonnie++ only writes data back 10% of the time.  Why is Peter's
benchmark showing only 400 seeks? Is this all attributable to write
barrier? I'm not sure I'm buying that...

merlin

From:
Greg Smith
Date:

Fernando Hevia wrote:
> Shouldn't their write performance be more than a trade-off for fsync?
>
Not if you have sequential writes that are regularly fsync'd--which is
exactly how the WAL writes things out in PostgreSQL.  I think there's a
potential for SSD to reach a point where they can give good performance
even with their write caches turned off.  But it will require a more
robust software stack, like filesystems that really implement the write
barrier concept effectively for this use-case, for that to happen.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
  www.2ndQuadrant.com


From:
"Kenny Gorman"
Date:

The FusionIO products are a little different.  They are card based vs trying to emulate a traditional disk.  In terms of volatility, they have an on-board capacitor that allows power to be supplied until all writes drain.  They do not have a cache in front of them like a disk-type SSD might.   I don't sell these things, I am just a fan.  I verified all this with the Fusion IO techs before I replied.  Perhaps older versions didn't have this functionality?  I am not sure.  I have already done some cold power off tests w/o problems, but I could up the workload a bit and retest.  I will do a couple of 'pull the cable' tests on monday or tuesday and report back how it goes.

Re the performance #'s...  Here is my post:

http://www.kennygorman.com/wordpress/?p=398

-kg


>In order for a drive to work reliably for database use such as for
>PostgreSQL, it cannot have a volatile write cache.  You either need a
>write cache with a battery backup (and a UPS doesn't count), or to turn
>the cache off.  The SSD performance figures you've been looking at are
>with the drive's write cache turned on, which means they're completely
>fictitious and exaggerated upwards for your purposes.  In the real
>world, that will result in database corruption after a crash one day. 
>No one on the drive benchmarking side of the industry seems to have
>picked up on this, so you can't use any of those figures.  I'm not even
>sure right now whether drives like Intel's will even meet their lifetime
>expectations if they aren't allowed to use their internal volatile write
>cache.
>
>Here's two links you should read and then reconsider your whole design:
>
>http://www.mysqlperformanceblog.com/2009/03/02/ssd-xfs-lvm-fsync-write-cache-barrier-and-lost-transactions/
>http://petereisentraut.blogspot.com/2009/07/solid-state-drive-benchmarks-and-write.html
>
>I can't even imagine how bad the situation would be if you decide to
>wander down the "use a bunch of really cheap SSD drives" path; these
>things are barely usable for databases with Intel's hardware.  The needs
>of people who want to throw SSD in a laptop and those of the enterprise
>database market are really different, and if you believe doom
>forecasting like the comments at
>http://blogs.sun.com/BestPerf/entry/oracle_peoplesoft_payroll_sun_sparc
>that gap is widening, not shrinking.

From:
Lists
Date:

Laszlo Nagy wrote:
> Hello,
>
> I'm about to buy SSD drive(s) for a database. For decision making, I
> used this tech report:
>
> http://techreport.com/articles.x/16255/9
> http://techreport.com/articles.x/16255/10
>
> Here are my concerns:
>
>    * I need at least 32GB disk space. So DRAM based SSD is not a real
>      option. I would have to buy 8x4GB memory, costs a fortune. And
>      then it would still not have redundancy.
>    * I could buy two X25-E drives and have 32GB disk space, and some
>      redundancy. This would cost about $1600, not counting the RAID
>      controller. It is on the edge.
This was the solution I went with (4 drives in a raid 10 actually). Not
a cheap solution, but the performance is amazing.

>    * I could also buy many cheaper MLC SSD drives. They cost about
>      $140. So even with 10 drives, I'm at $1400. I could put them in
>      RAID6, have much more disk space (256GB), high redundancy and
>      POSSIBLY good read/write speed. Of course then I need to buy a
>      good RAID controller.
>
> My question is about the last option. Are there any good RAID cards
> that are optimized (or can be optimized) for SSD drives? Do any of you
> have experience in using many cheaper SSD drives? Is it a bad idea?
>
> Thank you,
>
>   Laszlo
>
>


From:
Ivan Voras
Date:

Lists wrote:
> Laszlo Nagy wrote:
>> Hello,
>>
>> I'm about to buy SSD drive(s) for a database. For decision making, I
>> used this tech report:
>>
>> http://techreport.com/articles.x/16255/9
>> http://techreport.com/articles.x/16255/10
>>
>> Here are my concerns:
>>
>>    * I need at least 32GB disk space. So DRAM based SSD is not a real
>>      option. I would have to buy 8x4GB memory, costs a fortune. And
>>      then it would still not have redundancy.
>>    * I could buy two X25-E drives and have 32GB disk space, and some
>>      redundancy. This would cost about $1600, not counting the RAID
>>      controller. It is on the edge.
> This was the solution I went with (4 drives in a raid 10 actually). Not
> a cheap solution, but the performance is amazing.

I've came across this article:

http://www.mysqlperformanceblog.com/2009/03/02/ssd-xfs-lvm-fsync-write-cache-barrier-and-lost-transactions/

It's from a Linux MySQL user so it's a bit confusing but it looks like
he has some reservations about performance vs reliability of the Intel
drives - apparently they have their own write cache and when it's
disabled performance drops sharply.

From:
Heikki Linnakangas
Date:

Merlin Moncure wrote:
> 2009/11/13 Heikki Linnakangas <>:
>> Laszlo Nagy wrote:
>>>    * I need at least 32GB disk space. So DRAM based SSD is not a real
>>>      option. I would have to buy 8x4GB memory, costs a fortune. And
>>>      then it would still not have redundancy.
>> At 32GB database size, I'd seriously consider just buying a server with
>> a regular hard drive or a small RAID array for redundancy, and stuffing
>> 16 or 32 GB of RAM into it to ensure everything is cached. That's tried
>> and tested technology.
>
> lots of ram doesn't help you if:
> *) your database gets written to a lot and you have high performance
> requirements

When all the (hot) data is cached, all writes are sequential writes to
the WAL, with the occasional flushing of the data pages at checkpoint.
The sequential write bandwidth of SSDs and HDDs is roughly the same.

I presume the fsync latency is a lot higher with HDDs, so if you're
running a lot of small write transactions, and don't want to risk losing
any recently committed transactions by setting synchronous_commit=off,
the usual solution is to get a RAID controller with a battery-backed up
cache. With a BBU cache, the fsync latency should be in the same
ballpark as with SDDs.

> *) your data is important

Huh? The data is safely on the hard disk in case of a crash. The RAM is
just for caching.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

From:
Merlin Moncure
Date:

On Sat, Nov 14, 2009 at 6:17 AM, Heikki Linnakangas
<> wrote:
>> lots of ram doesn't help you if:
>> *) your database gets written to a lot and you have high performance
>> requirements
>
> When all the (hot) data is cached, all writes are sequential writes to
> the WAL, with the occasional flushing of the data pages at checkpoint.
> The sequential write bandwidth of SSDs and HDDs is roughly the same.
>
> I presume the fsync latency is a lot higher with HDDs, so if you're
> running a lot of small write transactions, and don't want to risk losing
> any recently committed transactions by setting synchronous_commit=off,
> the usual solution is to get a RAID controller with a battery-backed up
> cache. With a BBU cache, the fsync latency should be in the same
> ballpark as with SDDs.

BBU raid controllers might only give better burst performance.  If you
are writing data randomly all over the volume, the cache will overflow
and performance will degrade.  Raid controllers degrade in different
fashions, at least one (perc 5) halted ALL access to the volume and
spun out the cache (a bug, IMO).

>> *) your data is important
>
> Huh? The data is safely on the hard disk in case of a crash. The RAM is
> just for caching.

I was alluding to not being able to lose any transactions... in this
case you can only run fsync, synchronously.  You are then bound by the
capabilities of the volume to write, ram only buffers reads.

merlin

From:
Heikki Linnakangas
Date:

Merlin Moncure wrote:
> On Sat, Nov 14, 2009 at 6:17 AM, Heikki Linnakangas
> <> wrote:
>>> lots of ram doesn't help you if:
>>> *) your database gets written to a lot and you have high performance
>>> requirements
>> When all the (hot) data is cached, all writes are sequential writes to
>> the WAL, with the occasional flushing of the data pages at checkpoint.
>> The sequential write bandwidth of SSDs and HDDs is roughly the same.
>>
>> I presume the fsync latency is a lot higher with HDDs, so if you're
>> running a lot of small write transactions, and don't want to risk losing
>> any recently committed transactions by setting synchronous_commit=off,
>> the usual solution is to get a RAID controller with a battery-backed up
>> cache. With a BBU cache, the fsync latency should be in the same
>> ballpark as with SDDs.
>
> BBU raid controllers might only give better burst performance.  If you
> are writing data randomly all over the volume, the cache will overflow
> and performance will degrade.

We're discussing a scenario where all the data fits in RAM. That's what
the large amount of RAM is for. The only thing that's being written to
disk is the WAL, which is sequential, and the occasional flush of data
pages from the buffer cache at checkpoints, which doesn't happen often
and will be spread over a period of time.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

From:
Laszlo Nagy
Date:

Heikki Linnakangas wrote:
> Laszlo Nagy wrote:
>
>>    * I need at least 32GB disk space. So DRAM based SSD is not a real
>>      option. I would have to buy 8x4GB memory, costs a fortune. And
>>      then it would still not have redundancy.
>>
>
> At 32GB database size, I'd seriously consider just buying a server with
> a regular hard drive or a small RAID array for redundancy, and stuffing
> 16 or 32 GB of RAM into it to ensure everything is cached. That's tried
> and tested technology.
>
32GB is for one table only. This server runs other applications, and you
need to leave space for sort memory, shared buffers etc. Buying 128GB
memory would solve the problem, maybe... but it is too expensive. And it
is not safe. Power out -> data loss.
> I don't know how you came to the 32 GB figure, but keep in mind that
> administration is a lot easier if you have plenty of extra disk space
> for things like backups, dumps+restore, temporary files, upgrades etc.
>
This disk space would be dedicated for a smaller tablespace, holding one
or two bigger tables with index scans. Of course I would never use an
SSD disk for storing database backups. It would be waste of money.


  L


From:
Robert Haas
Date:

2009/11/14 Laszlo Nagy <>:
> 32GB is for one table only. This server runs other applications, and you
> need to leave space for sort memory, shared buffers etc. Buying 128GB memory
> would solve the problem, maybe... but it is too expensive. And it is not
> safe. Power out -> data loss.

Huh?

...Robert

From:
Merlin Moncure
Date:

On Sat, Nov 14, 2009 at 8:47 AM, Heikki Linnakangas
<> wrote:
> Merlin Moncure wrote:
>> On Sat, Nov 14, 2009 at 6:17 AM, Heikki Linnakangas
>> <> wrote:
>>>> lots of ram doesn't help you if:
>>>> *) your database gets written to a lot and you have high performance
>>>> requirements
>>> When all the (hot) data is cached, all writes are sequential writes to
>>> the WAL, with the occasional flushing of the data pages at checkpoint.
>>> The sequential write bandwidth of SSDs and HDDs is roughly the same.
>>>
>>> I presume the fsync latency is a lot higher with HDDs, so if you're
>>> running a lot of small write transactions, and don't want to risk losing
>>> any recently committed transactions by setting synchronous_commit=off,
>>> the usual solution is to get a RAID controller with a battery-backed up
>>> cache. With a BBU cache, the fsync latency should be in the same
>>> ballpark as with SDDs.
>>
>> BBU raid controllers might only give better burst performance.  If you
>> are writing data randomly all over the volume, the cache will overflow
>> and performance will degrade.
>
> We're discussing a scenario where all the data fits in RAM. That's what
> the large amount of RAM is for. The only thing that's being written to
> disk is the WAL, which is sequential, and the occasional flush of data
> pages from the buffer cache at checkpoints, which doesn't happen often
> and will be spread over a period of time.

We are basically in agreement, but regardless of the effectiveness of
your WAL implementation, raid controller, etc, if you have to write
data to what approximates random locations to a disk based volume in a
sustained manner, you must eventually degrade to whatever the drive
can handle plus whatever efficiencies checkpoint, o/s, can gain by
grouping writes together.  Extra ram mainly helps only because it can
shave precious iops off the read side so you use them for writing.

merlin

From:
Laszlo Nagy
Date:

Robert Haas wrote:
> 2009/11/14 Laszlo Nagy <>:
>
>> 32GB is for one table only. This server runs other applications, and you
>> need to leave space for sort memory, shared buffers etc. Buying 128GB memory
>> would solve the problem, maybe... but it is too expensive. And it is not
>> safe. Power out -> data loss.
>>
I'm sorry I though he was talking about keeping the database in memory
with fsync=off. Now I see he was only talking about the OS disk cache.

My server has 24GB RAM, and I cannot easily expand it unless I throw out
some 2GB modules, and buy more 4GB or 8GB modules. But... buying 4x8GB
ECC RAM (+throwing out 4x2GB RAM) is a lot more expensive than buying
some 64GB SSD drives. 95% of the table in question is not modified. Only
read (mostly with index scan). Only 5% is actively updated.

This is why I think, using SSD in my case would be effective.

Sorry for the confusion.

  L


From:
Laszlo Nagy
Date:

>>>
>>>    * I could buy two X25-E drives and have 32GB disk space, and some
>>>      redundancy. This would cost about $1600, not counting the RAID
>>>      controller. It is on the edge.
>> This was the solution I went with (4 drives in a raid 10 actually).
>> Not a cheap solution, but the performance is amazing.
>
> I've came across this article:
>
> http://www.mysqlperformanceblog.com/2009/03/02/ssd-xfs-lvm-fsync-write-cache-barrier-and-lost-transactions/
>
>
> It's from a Linux MySQL user so it's a bit confusing but it looks like
> he has some reservations about performance vs reliability of the Intel
> drives - apparently they have their own write cache and when it's
> disabled performance drops sharply.
Ok, I'm getting confused here. There is the WAL, which is written
sequentially. If the WAL is not corrupted, then it can be replayed on
next database startup. Please somebody enlighten me! In my mind, fsync
is only needed for the WAL. If I could configure postgresql to put the
WAL on a real hard drive that has BBU and write cache, then I cannot
loose data. Meanwhile, product table data could be placed on the SSD
drive, and I sould be able to turn on write cache safely. Am I wrong?

  L


From:
Craig Ringer
Date:

On 15/11/2009 11:57 AM, Laszlo Nagy wrote:

> Ok, I'm getting confused here. There is the WAL, which is written
> sequentially. If the WAL is not corrupted, then it can be replayed on
> next database startup. Please somebody enlighten me! In my mind, fsync
> is only needed for the WAL. If I could configure postgresql to put the
> WAL on a real hard drive that has BBU and write cache, then I cannot
> loose data. Meanwhile, product table data could be placed on the SSD
> drive, and I sould be able to turn on write cache safely. Am I wrong?

A change has been written to the WAL and fsync()'d, so Pg knows it's hit
disk. It can now safely apply the change to the tables themselves, and
does so, calling fsync() to tell the drive containing the tables to
commit those changes to disk.

The drive lies, returning success for the fsync when it's just cached
the data in volatile memory. Pg carries on, shortly deleting the WAL
archive the changes were recorded in or recycling it and overwriting it
with new change data. The SSD is still merrily buffering data to write
cache, and hasn't got around to writing your particular change yet.

The machine loses power.

Oops! A hole just appeared in history. A WAL replay won't re-apply the
changes that the database guaranteed had hit disk, but the changes never
made it onto the main database storage.

Possible fixes for this are:

- Don't let the drive lie about cache flush operations, ie disable write
buffering.

- Give Pg some way to find out, from the drive, when particular write
operations have actually hit disk. AFAIK there's no such mechanism at
present, and I don't think the drives are even capable of reporting this
data. If they were, Pg would have to be capable of applying entries from
the WAL "sparsely" to account for the way the drive's write cache
commits changes out-of-order, and Pg would have to maintain a map of
committed / uncommitted WAL records. Pg would need another map of
tablespace blocks to WAL records to know, when a drive write cache
commit notice came in, what record in what WAL archive was affected.
It'd also require Pg to keep WAL archives for unbounded and possibly
long periods of time, making disk space management for WAL much harder.
So - "not easy" is a bit of an understatement here.

You still need to turn off write caching.

--
Craig Ringer


From:
Laszlo Nagy
Date:

> A change has been written to the WAL and fsync()'d, so Pg knows it's hit
> disk. It can now safely apply the change to the tables themselves, and
> does so, calling fsync() to tell the drive containing the tables to
> commit those changes to disk.
>
> The drive lies, returning success for the fsync when it's just cached
> the data in volatile memory. Pg carries on, shortly deleting the WAL
> archive the changes were recorded in or recycling it and overwriting it
> with new change data. The SSD is still merrily buffering data to write
> cache, and hasn't got around to writing your particular change yet.
>
All right. I believe you. In the current Pg implementation, I need to
turn of disk cache.

But.... I would like to ask some theoretical questions. It is just an
idea from me, and probably I'm wrong.
Here is a scenario:

#1. user wants to change something, resulting in a write_to_disk(data) call
#2. data is written into the WAL and fsync()-ed
#3. at this point the write_to_disk(data) call CAN RETURN, the user can
continue his work (the WAL is already written, changes cannot be lost)
#4. Pg can continue writting data onto the disk, and fsync() it.
#5. Then WAL archive data can be deleted.

Now maybe I'm wrong, but between #3 and #5, the data to be written is
kept in memory. This is basically a write cache, implemented in OS
memory. We could really handle it like a write cache. E.g. everything
would remain the same, except that we add some latency. We can wait some
time after the last modification of a given block, and then write it out.

Is it possible to do? If so, then can we can turn off write cache for
all drives, except the one holding the WAL. And still write speed would
remain the same. I don't think that any SSD drive has more than some
megabytes of write cache. The same amount of write cache could easily be
implemented in OS memory, and then Pg would always know what hit the disk.

Thanks,

   Laci


From:
Craig Ringer
Date:

On 15/11/2009 2:05 PM, Laszlo Nagy wrote:
>
>> A change has been written to the WAL and fsync()'d, so Pg knows it's hit
>> disk. It can now safely apply the change to the tables themselves, and
>> does so, calling fsync() to tell the drive containing the tables to
>> commit those changes to disk.
>>
>> The drive lies, returning success for the fsync when it's just cached
>> the data in volatile memory. Pg carries on, shortly deleting the WAL
>> archive the changes were recorded in or recycling it and overwriting it
>> with new change data. The SSD is still merrily buffering data to write
>> cache, and hasn't got around to writing your particular change yet.
>>
> All right. I believe you. In the current Pg implementation, I need to
> turn of disk cache.

That's certainly my understanding. I've been wrong many times before :S

> #1. user wants to change something, resulting in a write_to_disk(data) call
> #2. data is written into the WAL and fsync()-ed
> #3. at this point the write_to_disk(data) call CAN RETURN, the user can
> continue his work (the WAL is already written, changes cannot be lost)
> #4. Pg can continue writting data onto the disk, and fsync() it.
> #5. Then WAL archive data can be deleted.
>
> Now maybe I'm wrong, but between #3 and #5, the data to be written is
> kept in memory. This is basically a write cache, implemented in OS
> memory. We could really handle it like a write cache. E.g. everything
> would remain the same, except that we add some latency. We can wait some
> time after the last modification of a given block, and then write it out.

I don't know enough about the whole affair to give you a good
explanation ( I tried, and it just showed me how much I didn't know )
but here are a few issues:

- Pg doesn't know the erase block sizes or positions. It can't group
writes up by erase block except by hoping that, within a given file,
writing in page order will get the blocks to the disk in roughly
erase-block order. So your write caching isn't going to do anywhere near
as good a job as the SSD's can.

- The only way to make this help the SSD out much would be to use a LOT
of RAM for write cache and maintain a LOT of WAL archives. That's RAM
not being used for caching read data. The large number of WAL archives
means incredibly long WAL replay times after a crash.

- You still need a reliable way to tell the SSD "really flush your cache
now" after you've flushed the changes from your huge chunks of WAL files
and are getting ready to recycle them.

I was thinking that write ordering would be an issue too, as some
changes in the WAL would hit main disk before others that were earlier
in the WAL. However, I don't think that matters if full_page_writes are
on. If you replay from the start, you'll reapply some changes with older
versions, but they'll be corrected again by a later WAL record. So
ordering during WAL replay shouldn't be a problem. On the other hand,
the INCREDIBLY long WAL replay times during recovery would be a nightmare.

> I don't think that any SSD drive has more than some
> megabytes of write cache.

The big, lots-of-$$ ones have HUGE battery backed caches for exactly
this reason.

> The same amount of write cache could easily be
> implemented in OS memory, and then Pg would always know what hit the disk.

Really? How does Pg know what order the SSD writes things out from its
cache?

--
Craig Ringer

From:
Laszlo Nagy
Date:

> - Pg doesn't know the erase block sizes or positions. It can't group
> writes up by erase block except by hoping that, within a given file,
> writing in page order will get the blocks to the disk in roughly
> erase-block order. So your write caching isn't going to do anywhere near
> as good a job as the SSD's can.
>
Okay, I see. We cannot query erase block size from an SSD drive. :-(
>> I don't think that any SSD drive has more than some
>> megabytes of write cache.
>>
>
> The big, lots-of-$$ ones have HUGE battery backed caches for exactly
> this reason.
>
Heh, this is why they are so expensive. :-)
>> The same amount of write cache could easily be
>> implemented in OS memory, and then Pg would always know what hit the disk.
>>
>
> Really? How does Pg know what order the SSD writes things out from its
> cache?
>
I got the point. We cannot implement an efficient write cache without
much more knowledge about how that particular drive works.

So... the only solution that works well is to have much more RAM for
read cache, and much more RAM for write cache inside the RAID controller
(with BBU).

Thank you,

   Laszlo


From:
Craig James
Date:

I've wondered whether this would work for a read-mostly application: Buy a big RAM machine, like 64GB, with a crappy
littlesingle disk.  Build the database, then make a really big RAM disk, big enough to hold the DB and the WAL.  Then
builda duplicate DB on another machine with a decent disk (maybe a 4-disk RAID10), and turn on WAL logging. 

The system would be blazingly fast, and you'd just have to be sure before you shut it off to shut down Postgres and
copythe RAM files back to the regular disk.  And if you didn't, you could always recover from the backup.  Since it's a
read-mostlysystem, the WAL logging bandwidth wouldn't be too high, so even a modest machine would be able to keep up. 

Any thoughts?

Craig

From:
Heikki Linnakangas
Date:

Craig James wrote:
> I've wondered whether this would work for a read-mostly application: Buy
> a big RAM machine, like 64GB, with a crappy little single disk.  Build
> the database, then make a really big RAM disk, big enough to hold the DB
> and the WAL.  Then build a duplicate DB on another machine with a decent
> disk (maybe a 4-disk RAID10), and turn on WAL logging.
>
> The system would be blazingly fast, and you'd just have to be sure
> before you shut it off to shut down Postgres and copy the RAM files back
> to the regular disk.  And if you didn't, you could always recover from
> the backup.  Since it's a read-mostly system, the WAL logging bandwidth
> wouldn't be too high, so even a modest machine would be able to keep up.

Should work, but I don't see any advantage over attaching the RAID array
directly to the 1st machine with the RAM and turning synchronous_commit=off.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

From:
Merlin Moncure
Date:

2009/11/13 Greg Smith <>:
> As far as what real-world apps have that profile, I like SSDs for small to
> medium web applications that have to be responsive, where the user shows up
> and wants their randomly distributed and uncached data with minimal latency.
> SSDs can also be used effectively as second-tier targeted storage for things
> that have a performance-critical but small and random bit as part of a
> larger design that doesn't have those characteristics; putting indexes on
> SSD can work out well for example (and there the write durability stuff
> isn't quite as critical, as you can always drop an index and rebuild if it
> gets corrupted).

I am right now talking to someone on postgresql irc who is measuring
15k iops from x25-e and no data loss following power plug test.  I am
becoming increasingly suspicious that peter's results are not
representative: given that 90% of bonnie++ seeks are read only, the
math doesn't add up, and they contradict broadly published tests on
the internet.  Has anybody independently verified the results?

merlin

From:
Brad Nicholson
Date:

On Tue, 2009-11-17 at 11:36 -0500, Merlin Moncure wrote:
> 2009/11/13 Greg Smith <>:
> > As far as what real-world apps have that profile, I like SSDs for small to
> > medium web applications that have to be responsive, where the user shows up
> > and wants their randomly distributed and uncached data with minimal latency.
> > SSDs can also be used effectively as second-tier targeted storage for things
> > that have a performance-critical but small and random bit as part of a
> > larger design that doesn't have those characteristics; putting indexes on
> > SSD can work out well for example (and there the write durability stuff
> > isn't quite as critical, as you can always drop an index and rebuild if it
> > gets corrupted).
>
> I am right now talking to someone on postgresql irc who is measuring
> 15k iops from x25-e and no data loss following power plug test.  I am
> becoming increasingly suspicious that peter's results are not
> representative: given that 90% of bonnie++ seeks are read only, the
> math doesn't add up, and they contradict broadly published tests on
> the internet.  Has anybody independently verified the results?

How many times have the run the plug test?  I've read other reports of
people (not on Postgres) losing data on this drive with the write cache
on.

--
Brad Nicholson  416-673-4106
Database Administrator, Afilias Canada Corp.



From:
Scott Marlowe
Date:

On Tue, Nov 17, 2009 at 9:54 AM, Brad Nicholson
<> wrote:
> On Tue, 2009-11-17 at 11:36 -0500, Merlin Moncure wrote:
>> 2009/11/13 Greg Smith <>:
>> > As far as what real-world apps have that profile, I like SSDs for small to
>> > medium web applications that have to be responsive, where the user shows up
>> > and wants their randomly distributed and uncached data with minimal latency.
>> > SSDs can also be used effectively as second-tier targeted storage for things
>> > that have a performance-critical but small and random bit as part of a
>> > larger design that doesn't have those characteristics; putting indexes on
>> > SSD can work out well for example (and there the write durability stuff
>> > isn't quite as critical, as you can always drop an index and rebuild if it
>> > gets corrupted).
>>
>> I am right now talking to someone on postgresql irc who is measuring
>> 15k iops from x25-e and no data loss following power plug test.  I am
>> becoming increasingly suspicious that peter's results are not
>> representative: given that 90% of bonnie++ seeks are read only, the
>> math doesn't add up, and they contradict broadly published tests on
>> the internet.  Has anybody independently verified the results?
>
> How many times have the run the plug test?  I've read other reports of
> people (not on Postgres) losing data on this drive with the write cache
> on.

When I run the plug test it's on a pgbench that's as big as possible
(~4000) and I remove memory if there's a lot in the server so the
memory is smaller than the db.  I run 100+ concurrent and I set
checkoint timeouts to 30 minutes, and make a lots of checkpoint
segments (100 or so), and set completion target to 0.  Then after
about 1/2 checkpoint timeout has passed, I issue a checkpoint from the
command line, take a deep breath and pull the cord.

From:
Peter Eisentraut
Date:

On tis, 2009-11-17 at 11:36 -0500, Merlin Moncure wrote:
> I am right now talking to someone on postgresql irc who is measuring
> 15k iops from x25-e and no data loss following power plug test.  I am
> becoming increasingly suspicious that peter's results are not
> representative: given that 90% of bonnie++ seeks are read only, the
> math doesn't add up, and they contradict broadly published tests on
> the internet.  Has anybody independently verified the results?

Notably, between my two blog posts and this email thread, there have
been claims of

400
1800
4000
7000
14000
15000
35000

iops (of some kind) per second.

That alone should be cause of concern.


From:
Greg Smith
Date:

Merlin Moncure wrote:
> I am right now talking to someone on postgresql irc who is measuring
> 15k iops from x25-e and no data loss following power plug test.
The funny thing about Murphy is that he doesn't visit when things are
quiet.  It's quite possible the window for data loss on the drive is
very small.  Maybe you only see it one out of 10 pulls with a very
aggressive database-oriented write test.  Whatever the odd conditions
are, you can be sure you'll see them when there's a bad outage in actual
production though.

A good test program that is a bit better at introducing and detecting
the write cache issue is described at
http://brad.livejournal.com/2116715.html

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
  www.2ndQuadrant.com


From:
Merlin Moncure
Date:

On Tue, Nov 17, 2009 at 1:51 PM, Greg Smith <> wrote:
> Merlin Moncure wrote:
>>
>> I am right now talking to someone on postgresql irc who is measuring
>> 15k iops from x25-e and no data loss following power plug test.
>
> The funny thing about Murphy is that he doesn't visit when things are quiet.
>  It's quite possible the window for data loss on the drive is very small.
>  Maybe you only see it one out of 10 pulls with a very aggressive
> database-oriented write test.  Whatever the odd conditions are, you can be
> sure you'll see them when there's a bad outage in actual production though.
>
> A good test program that is a bit better at introducing and detecting the
> write cache issue is described at http://brad.livejournal.com/2116715.html

Sure, not disputing that...I don't have one to test myself, so I can't
vouch for the data being safe.  But what's up with the 400 iops
measured from bonnie++?  That's an order of magnitude slower than any
other published benchmark on the 'net, and I'm dying to get a little
clarification here.

merlin

From:
Mark Mielke
Date:

On 11/17/2009 01:51 PM, Greg Smith wrote:
> Merlin Moncure wrote:
>> I am right now talking to someone on postgresql irc who is measuring
>> 15k iops from x25-e and no data loss following power plug test.
> The funny thing about Murphy is that he doesn't visit when things are
> quiet.  It's quite possible the window for data loss on the drive is
> very small.  Maybe you only see it one out of 10 pulls with a very
> aggressive database-oriented write test.  Whatever the odd conditions
> are, you can be sure you'll see them when there's a bad outage in
> actual production though.
>
> A good test program that is a bit better at introducing and detecting
> the write cache issue is described at
> http://brad.livejournal.com/2116715.html
>

I've been following this thread with great interest in your results...
Please continue to share...

For write cache issues - is it possible that the reduced power
utilization of SSD allows for a capacitor to complete all scheduled
writes, even with a large cache? Is it this particular drive you are
suggesting that is known to be insufficient or is it really the
technology or maturity of the technology?

Cheers,
mark

--
Mark Mielke<>


From:
Greg Smith
Date:

Merlin Moncure wrote:
> But what's up with the 400 iops measured from bonnie++?
I don't know really.  SSD writes are really sensitive to block size and
the ability to chunk writes into larger chunks, so it may be that Peter
has just found the worst-case behavior and everybody else is seeing
something better than that.

When the reports I get back from people I believe are competant--Vadim,
Peter--show worst-case results that are lucky to beat RAID10, I feel I
have to dismiss the higher values reported by people who haven't been so
careful.  And that's just about everybody else, which leaves me quite
suspicious of the true value of the drives.  The whole thing really sets
off my vendor hype reflex, and short of someone loaning me a drive to
test I'm not sure how to get past that.  The Intel drives are still just
a bit too expensive to buy one on a whim, such that I'll just toss it if
the drive doesn't live up to expectations.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
  www.2ndQuadrant.com


From:
david@lang.hm
Date:

On Wed, 18 Nov 2009, Greg Smith wrote:

> Merlin Moncure wrote:
>> But what's up with the 400 iops measured from bonnie++?
> I don't know really.  SSD writes are really sensitive to block size and the
> ability to chunk writes into larger chunks, so it may be that Peter has just
> found the worst-case behavior and everybody else is seeing something better
> than that.
>
> When the reports I get back from people I believe are competant--Vadim,
> Peter--show worst-case results that are lucky to beat RAID10, I feel I have
> to dismiss the higher values reported by people who haven't been so careful.
> And that's just about everybody else, which leaves me quite suspicious of the
> true value of the drives.  The whole thing really sets off my vendor hype
> reflex, and short of someone loaning me a drive to test I'm not sure how to
> get past that.  The Intel drives are still just a bit too expensive to buy
> one on a whim, such that I'll just toss it if the drive doesn't live up to
> expectations.

keep in mind that bonnie++ isn't always going to reflect your real
performance.

I have run tests on some workloads that were definantly I/O limited where
bonnie++ results that differed by a factor of 10x made no measurable
difference in the application performance, so I can easily believe in
cases where bonnie++ numbers would not change but application performance
could be drasticly different.

as always it can depend heavily on your workload. you really do need to
figure out how to get your hands on one for your own testing.

David Lang

From:
Kenny Gorman
Date:

I found a bit of time to play with this.

I started up a test with 20 concurrent processes all inserting into
the same table and committing after each insert.  The db was achieving
about 5000 inserts per second, and I kept it running for about 10
minutes.  The host was doing about 5MB/s of Physical I/O to the Fusion
IO drive. I set checkpoint segments very small (10).  I observed the
following message in the log: checkpoints are occurring too frequently
(16 seconds apart).  Then I pulled the cord.  On reboot I noticed that
Fusion IO replayed it's log, then the filesystem (vxfs) did the same.
Then I started up the DB and observed the it perform auto-recovery:

Nov 18 14:33:53 frutestdb002 postgres[5667]: [6-1] 2009-11-18 14:33:53
PSTLOG:  database system was not properly shut down; automatic
recovery in progress
Nov 18 14:33:53 frutestdb002 postgres[5667]: [7-1] 2009-11-18 14:33:53
PSTLOG:  redo starts at 2A/55F9D478
Nov 18 14:33:54 frutestdb002 postgres[5667]: [8-1] 2009-11-18 14:33:54
PSTLOG:  record with zero length at 2A/56692F38
Nov 18 14:33:54 frutestdb002 postgres[5667]: [9-1] 2009-11-18 14:33:54
PSTLOG:  redo done at 2A/56692F08
Nov 18 14:33:54 frutestdb002 postgres[5667]: [10-1] 2009-11-18
14:33:54 PSTLOG:  database system is ready

Thanks
Kenny

On Nov 13, 2009, at 1:35 PM, Kenny Gorman wrote:

> The FusionIO products are a little different.  They are card based
> vs trying to emulate a traditional disk.  In terms of volatility,
> they have an on-board capacitor that allows power to be supplied
> until all writes drain.  They do not have a cache in front of them
> like a disk-type SSD might.   I don't sell these things, I am just a
> fan.  I verified all this with the Fusion IO techs before I
> replied.  Perhaps older versions didn't have this functionality?  I
> am not sure.  I have already done some cold power off tests w/o
> problems, but I could up the workload a bit and retest.  I will do a
> couple of 'pull the cable' tests on monday or tuesday and report
> back how it goes.
>
> Re the performance #'s...  Here is my post:
>
> http://www.kennygorman.com/wordpress/?p=398
>
> -kg
>
>
> >In order for a drive to work reliably for database use such as for
> >PostgreSQL, it cannot have a volatile write cache.  You either need a
> >write cache with a battery backup (and a UPS doesn't count), or to
> turn
> >the cache off.  The SSD performance figures you've been looking at
> are
> >with the drive's write cache turned on, which means they're
> completely
> >fictitious and exaggerated upwards for your purposes.  In the real
> >world, that will result in database corruption after a crash one day.
> >No one on the drive benchmarking side of the industry seems to have
> >picked up on this, so you can't use any of those figures.  I'm not
> even
> >sure right now whether drives like Intel's will even meet their
> lifetime
> >expectations if they aren't allowed to use their internal volatile
> write
> >cache.
> >
> >Here's two links you should read and then reconsider your whole
> design:
> >
> >http://www.mysqlperformanceblog.com/2009/03/02/ssd-xfs-lvm-fsync-write-cache-barrier-and-lost-transactions/
> >http://petereisentraut.blogspot.com/2009/07/solid-state-drive-benchmarks-and-write.html
> >
> >I can't even imagine how bad the situation would be if you decide to
> >wander down the "use a bunch of really cheap SSD drives" path; these
> >things are barely usable for databases with Intel's hardware.  The
> needs
> >of people who want to throw SSD in a laptop and those of the
> enterprise
> >database market are really different, and if you believe doom
> >forecasting like the comments at
> >http://blogs.sun.com/BestPerf/entry/oracle_peoplesoft_payroll_sun_sparc
> >that gap is widening, not shrinking.
>
>


From:
Scott Carey
Date:



On 11/13/09 10:21 AM, "Karl Denninger" <> wrote:

>
> One caution for those thinking of doing this - the incremental
> improvement of this setup on PostGresql in WRITE SIGNIFICANT environment
> isn't NEARLY as impressive.  Indeed the performance in THAT case for
> many workloads may only be 20 or 30% faster than even "reasonably
> pedestrian" rotating media in a high-performance (lots of spindles and
> thus stripes) configuration and it's more expensive (by a lot.)  If you
> step up to the fast SAS drives on the rotating side there's little
> argument for the SSD at all (again, assuming you don't intend to "cheat"
> and risk data loss.)

For your database DATA disks, leaving the write cache on is 100% acceptable,
even with power loss, and without a RAID controller.  And even in high write
environments.

That is what the XLOG is for, isn't it?  That is where this behavior is
critical.  But that has completely different performance requirements and
need not bee on the same volume, array, or drive.

>
> Know your application and benchmark it.
>
> -- Karl
>


From:
Scott Carey
Date:

On 11/15/09 12:46 AM, "Craig Ringer" <> wrote:
> Possible fixes for this are:
>
> - Don't let the drive lie about cache flush operations, ie disable write
> buffering.
>
> - Give Pg some way to find out, from the drive, when particular write
> operations have actually hit disk. AFAIK there's no such mechanism at
> present, and I don't think the drives are even capable of reporting this
> data. If they were, Pg would have to be capable of applying entries from
> the WAL "sparsely" to account for the way the drive's write cache
> commits changes out-of-order, and Pg would have to maintain a map of
> committed / uncommitted WAL records. Pg would need another map of
> tablespace blocks to WAL records to know, when a drive write cache
> commit notice came in, what record in what WAL archive was affected.
> It'd also require Pg to keep WAL archives for unbounded and possibly
> long periods of time, making disk space management for WAL much harder.
> So - "not easy" is a bit of an understatement here.

3:  Have PG wait a half second (configurable) after the checkpoint fsync()
completes before deleting/ overwriting any WAL segments.  This would be a
trivial "feature" to add to a postgres release, I think.  Actually, it
already exists!

Turn on log archiving, and have the script that it runs after a checkpoint
sleep().

BTW, the information I have seen indicates that the write cache is 256K on
the Intel drives, the 32MB/64MB of other RAM is working memory for the drive
block mapping / wear leveling algorithms (tracking 160GB of 4k blocks takes
space).

4: Yet another solution:  The drives DO adhere to write barriers properly.
A filesystem that used these in the process of fsync() would be fine too.
So XFS without LVM or MD (or the newer versions of those that don't ignore
barriers) would work too.

So, I think that write caching may not be necessary to turn off for non-xlog
disk.

>
> You still need to turn off write caching.
>
> --
> Craig Ringer
>
>
> --
> Sent via pgsql-performance mailing list ()
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>


From:
Tom Lane
Date:

Scott Carey <> writes:
> For your database DATA disks, leaving the write cache on is 100% acceptable,
> even with power loss, and without a RAID controller.  And even in high write
> environments.

Really?  How hard have you tested that configuration?

> That is what the XLOG is for, isn't it?

Once we have fsync'd a data change, we discard the relevant XLOG
entries.  If the disk hasn't actually put the data on stable storage
before it claims the fsync is done, you're screwed.

XLOG only exists to centralize the writes that have to happen before
a transaction can be reported committed (in particular, to avoid a
lot of random-access writes at commit).  It doesn't make any
fundamental change in the rules of the game: a disk that lies about
write complete will still burn you.

In a zero-seek-cost environment I suspect that XLOG wouldn't actually
be all that useful.  I gather from what's been said earlier that SSDs
don't fully eliminate random-access penalties, though.

            regards, tom lane

From:
Scott Carey
Date:

On 11/17/09 10:51 AM, "Greg Smith" <> wrote:

> Merlin Moncure wrote:
>> I am right now talking to someone on postgresql irc who is measuring
>> 15k iops from x25-e and no data loss following power plug test.
> The funny thing about Murphy is that he doesn't visit when things are
> quiet.  It's quite possible the window for data loss on the drive is
> very small.  Maybe you only see it one out of 10 pulls with a very
> aggressive database-oriented write test.  Whatever the odd conditions
> are, you can be sure you'll see them when there's a bad outage in actual
> production though.

Yes, but there is nothing fool proof.  Murphy visited me recently, and the
RAID card with BBU cache that the WAL logs were on crapped out.  Data was
fine.

Had to fix up the system without any WAL logs.  Luckily, out of 10TB, only
200GB or so of it could have been in the process of writing (yay!
partitioning by date!) to and we could restore just that part rather than
initiating a full restore.
Then there was fun times in single user mode to fix corrupted system tables
(about half the system indexes were dead, and the statistics table was
corrupt, but that could be truncated safely).

Its all fine now with all data validated.

Moral of the story:  Nothing is 100% safe, so sometimes a small bit of KNOWN
risk is perfectly fine.  There is always UNKNOWN risk.  If one risks losing
256K of cached data on an SSD if you're really unlucky with timing, how
dangerous is that versus the chance that the raid card or other hardware
barfs and takes out your whole WAL?

Nothing is safe enough to avoid a full DR plan of action.  The individual
tradeoffs are very application and data dependent.


>
> A good test program that is a bit better at introducing and detecting
> the write cache issue is described at
> http://brad.livejournal.com/2116715.html
>
> --
> Greg Smith    2ndQuadrant   Baltimore, MD
> PostgreSQL Training, Services and Support
>   www.2ndQuadrant.com
>
>
> --
> Sent via pgsql-performance mailing list ()
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>


From:
Scott Carey
Date:

On 11/17/09 10:58 PM, "" <> wrote:
>
> keep in mind that bonnie++ isn't always going to reflect your real
> performance.
>
> I have run tests on some workloads that were definantly I/O limited where
> bonnie++ results that differed by a factor of 10x made no measurable
> difference in the application performance, so I can easily believe in
> cases where bonnie++ numbers would not change but application performance
> could be drasticly different.
>

Well, that is sort of true for all benchmarks, but I do find that bonnie++
is the worst of the bunch.  I consider it relatively useless compared to
fio.  Its just not a great benchmark for server type load and I find it
lacking in the ability to simulate real applications.


> as always it can depend heavily on your workload. you really do need to
> figure out how to get your hands on one for your own testing.
>
> David Lang
>
> --
> Sent via pgsql-performance mailing list ()
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>


From:
Craig Ringer
Date:

On 19/11/2009 12:22 PM, Scott Carey wrote:

> 3:  Have PG wait a half second (configurable) after the checkpoint fsync()
> completes before deleting/ overwriting any WAL segments.  This would be a
> trivial "feature" to add to a postgres release, I think.

How does that help? It doesn't provide any guarantee that the data has
hit main storage - it could lurk in SDD cache for hours.

> 4: Yet another solution:  The drives DO adhere to write barriers properly.
> A filesystem that used these in the process of fsync() would be fine too.
> So XFS without LVM or MD (or the newer versions of those that don't ignore
> barriers) would work too.

*if* the WAL is also on the SSD.

If the WAL is on a separate drive, the write barriers do you no good,
because they won't ensure that the data hits the main drive storage
before the WAL recycling hits the WAL disk storage. The two drives
operate independently and the write barriers don't interact.

You'd need some kind of inter-drive write barrier.

--
Craig Ringer

From:
Greg Smith
Date:

Scott Carey wrote:
> For your database DATA disks, leaving the write cache on is 100% acceptable,
> even with power loss, and without a RAID controller.  And even in high write
> environments.
>
> That is what the XLOG is for, isn't it?  That is where this behavior is
> critical.  But that has completely different performance requirements and
> need not bee on the same volume, array, or drive.
>
At checkpoint time, writes to the main data files are done that are
followed by fsync calls to make sure those blocks have been written to
disk.  Those writes have exactly the same consistency requirements as
the more frequent pg_xlog writes.  If the drive ACKs the write, but it's
not on physical disk yet, it's possible for the checkpoint to finish and
the underlying pg_xlog segments needed to recover from a crash at that
point to be deleted.  The end of the checkpoint can wipe out many WAL
segments, presuming they're not needed anymore because the data blocks
they were intended to fix during recovery are now guaranteed to be on disk.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
  www.2ndQuadrant.com


From:
Karl Denninger
Date:

Greg Smith wrote:
> Scott Carey wrote:
>> For your database DATA disks, leaving the write cache on is 100%
>> acceptable,
>> even with power loss, and without a RAID controller.  And even in
>> high write
>> environments.
>>
>> That is what the XLOG is for, isn't it?  That is where this behavior is
>> critical.  But that has completely different performance requirements
>> and
>> need not bee on the same volume, array, or drive.
>>
> At checkpoint time, writes to the main data files are done that are
> followed by fsync calls to make sure those blocks have been written to
> disk.  Those writes have exactly the same consistency requirements as
> the more frequent pg_xlog writes.  If the drive ACKs the write, but
> it's not on physical disk yet, it's possible for the checkpoint to
> finish and the underlying pg_xlog segments needed to recover from a
> crash at that point to be deleted.  The end of the checkpoint can wipe
> out many WAL segments, presuming they're not needed anymore because
> the data blocks they were intended to fix during recovery are now
> guaranteed to be on disk.
Guys, read that again.

IF THE DISK OR DRIVER ACK'S A FSYNC CALL THE WAL ENTRY IS LIKELY GONE,
AND YOU ARE SCREWED IF THE DATA IS NOT REALLY ON THE DISK.

-- Karl

From:
Greg Smith
Date:

Scott Carey wrote:
> Moral of the story:  Nothing is 100% safe, so sometimes a small bit of KNOWN
> risk is perfectly fine.  There is always UNKNOWN risk.  If one risks losing
> 256K of cached data on an SSD if you're really unlucky with timing, how
> dangerous is that versus the chance that the raid card or other hardware
> barfs and takes out your whole WAL?
>
I think the point of the paranoia in this thread is that if you're
introducing a component with a known risk in it, you're really asking
for trouble because (as you point out) it's hard enough to keep a system
running just through the unexpected ones that shouldn't have happened at
all.  No need to make that even harder by introducing something that is
*known* to fail under some conditions.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
  www.2ndQuadrant.com


From:
Merlin Moncure
Date:

On Wed, Nov 18, 2009 at 11:39 PM, Scott Carey <> wrote:
> Well, that is sort of true for all benchmarks, but I do find that bonnie++
> is the worst of the bunch.  I consider it relatively useless compared to
> fio.  Its just not a great benchmark for server type load and I find it
> lacking in the ability to simulate real applications.

I agree.   My biggest gripe with bonnie actually is that 99% of the
time is spent measuring in sequential tests which is not that
important in the database world.  Dedicated wal volume uses ostensibly
sequential io, but it's fairly difficult to outrun a dedicated wal
volume even if it's on a vanilla sata drive.

pgbench is actually a pretty awesome i/o tester assuming you have big
enough scaling factor, because:
a) it's much closer to the environment you will actually run in
b) you get to see what i/o affecting options have on the load
c) you have broad array of options regarding what gets done (select
only, -f, etc)
d) once you build the test database, you can do multiple runs without
rebuilding it

merlin

merlin

From:
Scott Marlowe
Date:

On Thu, Nov 19, 2009 at 10:01 AM, Merlin Moncure <> wrote:
> On Wed, Nov 18, 2009 at 11:39 PM, Scott Carey <> wrote:
>> Well, that is sort of true for all benchmarks, but I do find that bonnie++
>> is the worst of the bunch.  I consider it relatively useless compared to
>> fio.  Its just not a great benchmark for server type load and I find it
>> lacking in the ability to simulate real applications.
>
> I agree.   My biggest gripe with bonnie actually is that 99% of the
> time is spent measuring in sequential tests which is not that
> important in the database world.  Dedicated wal volume uses ostensibly
> sequential io, but it's fairly difficult to outrun a dedicated wal
> volume even if it's on a vanilla sata drive.
>
> pgbench is actually a pretty awesome i/o tester assuming you have big
> enough scaling factor, because:
> a) it's much closer to the environment you will actually run in
> b) you get to see what i/o affecting options have on the load
> c) you have broad array of options regarding what gets done (select
> only, -f, etc)
> d) once you build the test database, you can do multiple runs without
> rebuilding it

Seeing as how pgbench only goes to scaling factor of 4000, are the any
plans on enlarging that number?

From:
Anton Rommerskirchen
Date:

Am Donnerstag, 19. November 2009 13:29:56 schrieb Craig Ringer:
> On 19/11/2009 12:22 PM, Scott Carey wrote:
> > 3:  Have PG wait a half second (configurable) after the checkpoint
> > fsync() completes before deleting/ overwriting any WAL segments.  This
> > would be a trivial "feature" to add to a postgres release, I think.
>
> How does that help? It doesn't provide any guarantee that the data has
> hit main storage - it could lurk in SDD cache for hours.
>
> > 4: Yet another solution:  The drives DO adhere to write barriers
> > properly. A filesystem that used these in the process of fsync() would be
> > fine too. So XFS without LVM or MD (or the newer versions of those that
> > don't ignore barriers) would work too.
>
> *if* the WAL is also on the SSD.
>
> If the WAL is on a separate drive, the write barriers do you no good,
> because they won't ensure that the data hits the main drive storage
> before the WAL recycling hits the WAL disk storage. The two drives
> operate independently and the write barriers don't interact.
>
> You'd need some kind of inter-drive write barrier.
>
> --
> Craig Ringer


Hello !

as i understand this:
ssd performace is great, but caching is the problem.

questions:

1. what about conventional disks with 32/64 mb cache ? how do they handle the
plug test if their caches are on ?

2. what about using seperated power supply for the disks ? it it possible to
write back the cache after switching the sata to another machine controller ?

3. what about making a statement about a lacking enterprise feature (aka
emergency battery equipped ssd) and submitting this to the producers ?

I found that one of them (OCZ) seems to handle suggestions of customers (see
write speed discussins on vertex fro example)

and another (intel) seems to handle serious problems with his disks in
rewriting and sometimes redesigning his products - if you tell them and
market dictades to react (see degeneration of performace before 1.11
firmware).

perhaps its time to act and not only to complain about the fact.

(btw: found funny bonnie++ for my intel 160 gb postville and my samsung pb22
after using the sam for now approx. 3 months+ ... my conclusion: NOT all SSD
are equal ...)

best regards

anton

--

ATRSoft GmbH
Bivetsweg 12
D 41542 Dormagen
Deutschland
Tel .: +49(0)2182 8339951
Mobil: +49(0)172 3490817

Geschäftsführer Anton Rommerskirchen

Köln HRB 44927
STNR 122/5701 - 2030
USTID DE213791450

From:
Brad Nicholson
Date:

On Thu, 2009-11-19 at 19:01 +0100, Anton Rommerskirchen wrote:
> Am Donnerstag, 19. November 2009 13:29:56 schrieb Craig Ringer:
> > On 19/11/2009 12:22 PM, Scott Carey wrote:
> > > 3:  Have PG wait a half second (configurable) after the checkpoint
> > > fsync() completes before deleting/ overwriting any WAL segments.  This
> > > would be a trivial "feature" to add to a postgres release, I think.
> >
> > How does that help? It doesn't provide any guarantee that the data has
> > hit main storage - it could lurk in SDD cache for hours.
> >
> > > 4: Yet another solution:  The drives DO adhere to write barriers
> > > properly. A filesystem that used these in the process of fsync() would be
> > > fine too. So XFS without LVM or MD (or the newer versions of those that
> > > don't ignore barriers) would work too.
> >
> > *if* the WAL is also on the SSD.
> >
> > If the WAL is on a separate drive, the write barriers do you no good,
> > because they won't ensure that the data hits the main drive storage
> > before the WAL recycling hits the WAL disk storage. The two drives
> > operate independently and the write barriers don't interact.
> >
> > You'd need some kind of inter-drive write barrier.
> >
> > --
> > Craig Ringer
>
>
> Hello !
>
> as i understand this:
> ssd performace is great, but caching is the problem.
>
> questions:
>
> 1. what about conventional disks with 32/64 mb cache ? how do they handle the
> plug test if their caches are on ?

If the aren't battery backed, they can lose data.  This is not specific
to SSD.

> 2. what about using seperated power supply for the disks ? it it possible to
> write back the cache after switching the sata to another machine controller ?

Not sure.  I only use devices with battery backed caches or no cache.  I
would be concerned however about the drive not flushing itself and still
running out of power.

> 3. what about making a statement about a lacking enterprise feature (aka
> emergency battery equipped ssd) and submitting this to the producers ?

The producers aren't making Enterprise products, they are using caches
to accelerate the speeds of consumer products to make their drives more
appealing to consumers.  They aren't going to slow them down to make
them more reliable, especially when the core consumer doesn't know about
this issue, and is even less likely to understand it if explained.

They may stamp the word Enterprise on them, but it's nothing more than
marketing.

> I found that one of them (OCZ) seems to handle suggestions of customers (see
> write speed discussins on vertex fro example)
>
> and another (intel) seems to handle serious problems with his disks in
> rewriting and sometimes redesigning his products - if you tell them and
> market dictades to react (see degeneration of performace before 1.11
> firmware).
>
> perhaps its time to act and not only to complain about the fact.

Or, you could just buy higher quality equipment that was designed with
this in mind.

There is nothing unique to SSD here IMHO.  I wouldn't run my production
grade databases on consumer grade HDD, I wouldn't run them on consumer
grade SSD either.


--
Brad Nicholson  416-673-4106
Database Administrator, Afilias Canada Corp.



From:
Greg Smith
Date:

Scott Carey wrote:
> Have PG wait a half second (configurable) after the checkpoint fsync()
> completes before deleting/ overwriting any WAL segments.  This would be a
> trivial "feature" to add to a postgres release, I think.  Actually, it
> already exists!  Turn on log archiving, and have the script that it runs after a checkpoint sleep().
>
That won't help.  Once the checkpoint is done, the problem isn't just
that the WAL segments are recycled.  The server isn't going to use them
even if they were there.  The reason why you can erase/recycle them is
that you're doing so *after* writing out a checkpoint record that says
you don't have to ever look at them again.  What you'd actually have to
do is hack the server code to insert that delay after every fsync--there
are none that you can cheat on and not introduce a corruption
possibility.  The whole WAL/recovery mechanism in PostgreSQL doesn't
make a lot of assumptions about what the underlying disk has to actually
do beyond the fsync requirement; the flip side to that robustness is
that it's the one you can't ever violate safely.
> BTW, the information I have seen indicates that the write cache is 256K on
> the Intel drives, the 32MB/64MB of other RAM is working memory for the drive
> block mapping / wear leveling algorithms (tracking 160GB of 4k blocks takes
> space).
>
Right.  It's not used like the write-cache on a regular hard drive,
where they're buffering 8MB-32MB worth of writes just to keep seek
overhead down.  It's there primarily to allow combining writes into
large chunks, to better match the block size of the underlying SSD flash
cells (128K).  Having enough space for two full cells allows spooling
out the flash write to a whole block while continuing to buffer the next
one.

This is why turning the cache off can tank performance so badly--you're
going to be writing a whole 128K block no matter what if it's force to
disk without caching, even if it's just to write a 8K page to it.
That's only going to reach 1/16 of the usual write speed on single page
writes.  And that's why you should also be concerned at whether
disabling the write cache impacts the drive longevity, lots of small
writes going out in small chunks is going to wear flash out much faster
than if the drive is allowed to wait until it's got a full sized block
to write every time.

The fact that the cache is so small is also why it's harder to catch the
drive doing the wrong thing here.  The plug test is pretty sensitive to
a problem when you've got megabytes worth of cached writes that are
spooling to disk at spinning hard drive speeds.  The window for loss on
a SSD with no seek overhead and only a moderate number of KB worth of
cached data is much, much smaller.  Doesn't mean it's gone though.  It's
a shame that the design wasn't improved just a little bit; a cheap
capacitor and blocking new writes once the incoming power dropped is all
it would take to make these much more reliable for database use.  But
that would raise the price, and not really help anybody but the small
subset of the market that cares about durable writes.
> 4: Yet another solution:  The drives DO adhere to write barriers properly.
> A filesystem that used these in the process of fsync() would be fine too.
> So XFS without LVM or MD (or the newer versions of those that don't ignore
> barriers) would work too.
>
If I really trusted anything beyond the very basics of the filesystem to
really work well on Linux, this whole issue would be moot for most of
the production deployments I do.  Ideally, fsync would just push out the
minimum of what's needed, it would call the appropriate write cache
flush mechanism the way the barrier implementation does when that all
works, life would be good.  Alternately, you might even switch to using
O_SYNC writes instead, which on a good filesystem implementation are
both accelerated and safe compared to write/fsync (I've seen that work
as expected on Vertias VxFS for example).

Meanwhile, in the actual world we live, patches that make writes more
durable by default are dropped by the Linux community because they tank
performance for too many types of loads, I'm frightened to turn on
O_SYNC at all on ext3 because of reports of corruption on the lists
here, fsync does way more work than it needs to, and the way the
filesystem and block drivers have been separated makes it difficult to
do any sort of device write cache control from userland.  This is why I
try to use the simplest, best tested approach out there whenever possible.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
  www.2ndQuadrant.com


From:
Greg Smith
Date:

Scott Marlowe wrote:
> On Thu, Nov 19, 2009 at 10:01 AM, Merlin Moncure <> wrote:
>
>> pgbench is actually a pretty awesome i/o tester assuming you have big
>> enough scaling factor
> Seeing as how pgbench only goes to scaling factor of 4000, are the any
> plans on enlarging that number?
>
I'm doing pgbench tests now on a system large enough for this limit to
matter, so I'm probably going to have to fix that for 8.5 just to
complete my own work.

You can use pgbench to either get interesting peak read results, or peak
write ones, but it's not real useful for things in between.  The
standard test basically turns into a huge stack of writes to a single
table, and the select-only one is interesting to gauge either cached or
uncached read speed (depending on the scale).  It's not very useful for
getting a feel for how something with a mixed read/write workload does
though, which is unfortunate because I think that scenario is much more
common than what it does test.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
  www.2ndQuadrant.com


From:
Merlin Moncure
Date:

On Thu, Nov 19, 2009 at 4:10 PM, Greg Smith <> wrote:
> You can use pgbench to either get interesting peak read results, or peak
> write ones, but it's not real useful for things in between.  The standard
> test basically turns into a huge stack of writes to a single table, and the
> select-only one is interesting to gauge either cached or uncached read speed
> (depending on the scale).  It's not very useful for getting a feel for how
> something with a mixed read/write workload does though, which is unfortunate
> because I think that scenario is much more common than what it does test.

all true, but it's pretty easy to rig custom (-f) commands for
virtually any test you want,.

merlin

From:
Scott Marlowe
Date:

On Thu, Nov 19, 2009 at 2:39 PM, Merlin Moncure <> wrote:
> On Thu, Nov 19, 2009 at 4:10 PM, Greg Smith <> wrote:
>> You can use pgbench to either get interesting peak read results, or peak
>> write ones, but it's not real useful for things in between.  The standard
>> test basically turns into a huge stack of writes to a single table, and the
>> select-only one is interesting to gauge either cached or uncached read speed
>> (depending on the scale).  It's not very useful for getting a feel for how
>> something with a mixed read/write workload does though, which is unfortunate
>> because I think that scenario is much more common than what it does test.
>
> all true, but it's pretty easy to rig custom (-f) commands for
> virtually any test you want,.

My primary use of pgbench is to exercise a machine as a part of
acceptance testing.  After using it to do power plug pulls, I run it
for a week or two to exercise the drive array and controller mainly.
Any machine that runs smooth for a week with a load factor of 20 or 30
and the amount of updates that pgbench generates don't overwhelm it
I'm pretty happy.

From:
Axel Rau
Date:

Am 13.11.2009 um 14:57 schrieb Laszlo Nagy:

> I was thinking about ARECA 1320 with 2GB memory + BBU.
> Unfortunately, I cannot find information about using ARECA cards
> with SSD drives.
They told me: currently not supported, but they have positive customer
reports. No date yet for implementation of the TRIM command in firmware.
...
> My other option is to buy two SLC SSD drives and use RAID1. It would
> cost about the same, but has less redundancy and less capacity.
> Which is the faster? 8-10 MLC disks in RAID 6 with a good caching
> controller, or two SLC disks in RAID1?
I just went the MLC path with X25-Ms mainly to save energy.
The fresh assembled box has one SSD for WAL and one RAID 0 with for
SSDs as table space.
Everything runs smoothly on a areca 1222 with BBU, which turned all
write caches off.
OS is FreeBSD 8.0. I aligned all partitions on 1 MB boundaries.
Next week I will install 8.4.1 and run pgbench for pull-the-plug-
testing.

I would like to get some advice from the list for testing the SSDs!

Axel
---
  PGP-Key:29E99DD6  +49 151 2300 9283  computing @
chaos claudius










From:
Matthew Wakeling
Date:

On Thu, 19 Nov 2009, Greg Smith wrote:
> This is why turning the cache off can tank performance so badly--you're going
> to be writing a whole 128K block no matter what if it's force to disk without
> caching, even if it's just to write a 8K page to it.

Theoretically, this does not need to be the case. Now, I don't know what
the Intel drives actually do, but remember that for flash, it is the
*erase* cycle that has to be done in large blocks. Writing itself can be
done in small blocks, to previously erased sites.

The technology for combining small writes into sequential writes has been
around for 17 years or so in
http://portal.acm.org/citation.cfm?id=146943&dl= so there really isn't any
excuse for modern flash drives not giving really fast small writes.

Matthew

--
 for a in past present future; do
   for b in clients employers associates relatives neighbours pets; do
   echo "The opinions here in no way reflect the opinions of my $a $b."
 done; done

From:
Jeff Janes
Date:

On Wed, Nov 18, 2009 at 8:24 PM, Tom Lane <> wrote:
> Scott Carey <> writes:
>> For your database DATA disks, leaving the write cache on is 100% acceptable,
>> even with power loss, and without a RAID controller.  And even in high write
>> environments.
>
> Really?  How hard have you tested that configuration?
>
>> That is what the XLOG is for, isn't it?
>
> Once we have fsync'd a data change, we discard the relevant XLOG
> entries.  If the disk hasn't actually put the data on stable storage
> before it claims the fsync is done, you're screwed.
>
> XLOG only exists to centralize the writes that have to happen before
> a transaction can be reported committed (in particular, to avoid a
> lot of random-access writes at commit).  It doesn't make any
> fundamental change in the rules of the game: a disk that lies about
> write complete will still burn you.
>
> In a zero-seek-cost environment I suspect that XLOG wouldn't actually
> be all that useful.

You would still need it to guard against partial page writes, unless
we have some guarantee that those can't happen.

And once your transaction has scattered its transaction id into
various xmin and xmax over many tables, you need an atomic, durable
repository to decide if that id has or has not committed.  Maybe clog
fsynced on commit would serve this purpose?

Jeff

From:
Richard Neill
Date:

Axel Rau wrote:
>
> Am 13.11.2009 um 14:57 schrieb Laszlo Nagy:
>
>> I was thinking about ARECA 1320 with 2GB memory + BBU. Unfortunately,
>> I cannot find information about using ARECA cards with SSD drives.
> They told me: currently not supported, but they have positive customer
> reports. No date yet for implementation of the TRIM command in firmware.
> ...
>> My other option is to buy two SLC SSD drives and use RAID1. It would
>> cost about the same, but has less redundancy and less capacity. Which
>> is the faster? 8-10 MLC disks in RAID 6 with a good caching
>> controller, or two SLC disks in RAID1?

Despite my other problems, I've found that the Intel X25-Es work
remarkably well. The key issue for short,fast transactions seems to be
how fast an fdatasync() call can run, forcing the commit to disk, and
allowing the transaction to return to userspace.
With all the caches off, the intel X25-E beat a standard disk by a
factor of about 10.
Attached is a short C program which may be of use.


For what it's worth, we have actually got a pretty decent (and
redundant) setup using a RAIS array of RAID1.


[primary server]

SSD }
      }  RAID1  -------------------}  DRBD --- /var/lib/postgresql
SSD }                            }
                                   }
                                   }
                                   }
                                   }
[secondary server]               }
                                   }
SSD }                            }
      }  RAID1 --------gigE--------}
SSD }



The servers connect back-to-back with a dedicated Gigabit ethernet
cable, and DRBD is running in protocol B.

We can pull the power out of 1 server, and be using the next within 30
seconds, and with no dataloss.


Richard



#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

#define NUM_ITER 1024

int main ( int argc, char **argv ) {
    const char data[] = "Liberate";
    size_t data_len = strlen ( data );
    const char *filename;
    int fd;
    unsigned int i;

    if ( argc != 2 ) {
        fprintf ( stderr, "Syntax: %s output_file\n", argv[0] );
        exit ( 1 );
    }
    filename = argv[1];
    fd = open ( filename, ( O_WRONLY | O_CREAT | O_EXCL ), 0666 );
    if ( fd < 0 ) {
        fprintf ( stderr, "Could not create \"%s\": %s\n",
              filename, strerror ( errno ) );
        exit ( 1 );
    }

    for ( i = 0 ; i < NUM_ITER ; i++ ) {
        if ( write ( fd, data, data_len ) != data_len ) {
            fprintf ( stderr, "Could not write: %s\n",
                  strerror ( errno ) );
            exit ( 1 );
        }
        if ( fdatasync ( fd ) != 0 ) {
            fprintf ( stderr, "Could not fdatasync: %s\n",
                  strerror ( errno ) );
            exit ( 1 );
        }
    }
    return 0;
}


From:
Greg Smith
Date:

Richard Neill wrote:
> The key issue for short,fast transactions seems to be
> how fast an fdatasync() call can run, forcing the commit to disk, and
> allowing the transaction to return to userspace.
> Attached is a short C program which may be of use.
Right.  I call this the "commit rate" of the storage, and on traditional
spinning disks it's slightly below the rotation speed of the media (i.e.
7200RPM = 120 commits/second).    If you've got a battery-backed cache
in front of standard disks, you can easily clear 10K commits/second.

I normally test that out with sysbench, because I use that for some
other tests anyway:

sysbench --test=fileio --file-fsync-freq=1 --file-num=1
--file-total-size=16384 --file-test-mode=rndwr run | grep "Requests/sec"

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
  www.2ndQuadrant.com


From:
Merlin Moncure
Date:

On Fri, Nov 20, 2009 at 7:27 PM, Greg Smith <> wrote:
> Richard Neill wrote:
>>
>> The key issue for short,fast transactions seems to be
>> how fast an fdatasync() call can run, forcing the commit to disk, and
>> allowing the transaction to return to userspace.
>> Attached is a short C program which may be of use.
>
> Right.  I call this the "commit rate" of the storage, and on traditional
> spinning disks it's slightly below the rotation speed of the media (i.e.
> 7200RPM = 120 commits/second).    If you've got a battery-backed cache in
> front of standard disks, you can easily clear 10K commits/second.


...until you overflow the cache.  battery backed cache does not break
the laws of physics...it just provides a higher burst rate (plus what
ever advantages can be gained by peeking into the write queue and
re-arranging/grouping.  I learned the hard way that how your raid
controller behaves in overflow situations can cause catastrophic
performance degradations...

merlin

From:
Bruce Momjian
Date:

Greg Smith wrote:
> Merlin Moncure wrote:
> > I am right now talking to someone on postgresql irc who is measuring
> > 15k iops from x25-e and no data loss following power plug test.
> The funny thing about Murphy is that he doesn't visit when things are
> quiet.  It's quite possible the window for data loss on the drive is
> very small.  Maybe you only see it one out of 10 pulls with a very
> aggressive database-oriented write test.  Whatever the odd conditions
> are, you can be sure you'll see them when there's a bad outage in actual
> production though.
>
> A good test program that is a bit better at introducing and detecting
> the write cache issue is described at
> http://brad.livejournal.com/2116715.html

Wow, I had not seen that tool before.  I have added a link to it from
our documentation, and also added a mention of our src/tools/fsync test
tool to our docs.

--
  Bruce Momjian  <>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +
Index: doc/src/sgml/config.sgml
===================================================================
RCS file: /cvsroot/pgsql/doc/src/sgml/config.sgml,v
retrieving revision 1.233
diff -c -c -r1.233 config.sgml
*** doc/src/sgml/config.sgml    13 Nov 2009 22:43:39 -0000    1.233
--- doc/src/sgml/config.sgml    28 Nov 2009 16:12:46 -0000
***************
*** 1432,1437 ****
--- 1432,1439 ----
          The default is the first method in the above list that is supported
          by the platform.
          The <literal>open_</>* options also use <literal>O_DIRECT</> if available.
+         The utility <filename>src/tools/fsync</> in the PostgreSQL source tree
+         can do performance testing of various fsync methods.
          This parameter can only be set in the <filename>postgresql.conf</>
          file or on the server command line.
         </para>
Index: doc/src/sgml/wal.sgml
===================================================================
RCS file: /cvsroot/pgsql/doc/src/sgml/wal.sgml,v
retrieving revision 1.59
diff -c -c -r1.59 wal.sgml
*** doc/src/sgml/wal.sgml    9 Apr 2009 16:20:50 -0000    1.59
--- doc/src/sgml/wal.sgml    28 Nov 2009 16:12:57 -0000
***************
*** 86,91 ****
--- 86,93 ----
     ensure data integrity.  Avoid disk controllers that have non-battery-backed
     write caches.  At the drive level, disable write-back caching if the
     drive cannot guarantee the data will be written before shutdown.
+    You can test for reliable I/O subsystem behavior using <ulink
+    url="http://brad.livejournal.com/2116715.html">diskchecker.pl</ulink>.
    </para>

    <para>

From:
Ron Mayer
Date:

Bruce Momjian wrote:
> Greg Smith wrote:
>> A good test program that is a bit better at introducing and detecting
>> the write cache issue is described at
>> http://brad.livejournal.com/2116715.html
>
> Wow, I had not seen that tool before.  I have added a link to it from
> our documentation, and also added a mention of our src/tools/fsync test
> tool to our docs.

One challenge with many of these test programs is that some
filesystem (ext3 is one) will flush drive caches on fsync()
*sometimes, but not always.   If your test program happens to do
a sequence of commands that makes an fsync() actually flush a
disk's caches, it might mislead you if your actual application
has a different series of system calls.

For example, ext3 fsync() will issue write barrier commands
if the inode was modified; but not if the inode wasn't.

See test program here:
http://www.mail-archive.com//msg272253.html
and read two paragraphs further to see how touching
the inode makes ext3 fsync behave differently.




From:
Bruce Momjian
Date:

Ron Mayer wrote:
> Bruce Momjian wrote:
> > Greg Smith wrote:
> >> A good test program that is a bit better at introducing and detecting
> >> the write cache issue is described at
> >> http://brad.livejournal.com/2116715.html
> >
> > Wow, I had not seen that tool before.  I have added a link to it from
> > our documentation, and also added a mention of our src/tools/fsync test
> > tool to our docs.
>
> One challenge with many of these test programs is that some
> filesystem (ext3 is one) will flush drive caches on fsync()
> *sometimes, but not always.   If your test program happens to do
> a sequence of commands that makes an fsync() actually flush a
> disk's caches, it might mislead you if your actual application
> has a different series of system calls.
>
> For example, ext3 fsync() will issue write barrier commands
> if the inode was modified; but not if the inode wasn't.
>
> See test program here:
> http://www.mail-archive.com//msg272253.html
> and read two paragraphs further to see how touching
> the inode makes ext3 fsync behave differently.

I thought our only problem was testing the I/O subsystem --- I never
suspected the file system might lie too.  That email indicates that a
large percentage of our install base is running on unreliable file
systems --- why have I not heard about this before?  Do the write
barriers allow data loss but prevent data inconsistency?  It sound like
they are effectively running with synchronous_commit = off.

--
  Bruce Momjian  <>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

From:
Greg Smith
Date:

Bruce Momjian wrote:
> I thought our only problem was testing the I/O subsystem --- I never
> suspected the file system might lie too.  That email indicates that a
> large percentage of our install base is running on unreliable file
> systems --- why have I not heard about this before?  Do the write
> barriers allow data loss but prevent data inconsistency?  It sound like
> they are effectively running with synchronous_commit = off.
>
You might occasionally catch me ranting here that Linux write barriers
are not a useful solution at all for PostgreSQL, and that you must turn
the disk write cache off rather than expect the barrier implementation
to do the right thing.  This sort of buginess is why.  The reason why it
doesn't bite more people is that most Linux systems don't turn on write
barrier support by default, and there's a number of situations that can
disable barriers even if you did try to enable them.  It's still pretty
unusual to have a working system with barriers turned on nowadays; I
really doubt it's "a large percentage of our install base".

I've started keeping most of my notes about where ext3 is vulnerable to
issues in Wikipedia, specifically
http://en.wikipedia.org/wiki/Ext3#No_checksumming_in_journal ; I just
updated that section to point out the specific issue Ron pointed out.
Maybe we should point people toward that in the docs, I try to keep that
article correct.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
  www.2ndQuadrant.com


From:
Bruce Momjian
Date:

Greg Smith wrote:
> Bruce Momjian wrote:
> > I thought our only problem was testing the I/O subsystem --- I never
> > suspected the file system might lie too.  That email indicates that a
> > large percentage of our install base is running on unreliable file
> > systems --- why have I not heard about this before?  Do the write
> > barriers allow data loss but prevent data inconsistency?  It sound like
> > they are effectively running with synchronous_commit = off.
> >
> You might occasionally catch me ranting here that Linux write barriers
> are not a useful solution at all for PostgreSQL, and that you must turn
> the disk write cache off rather than expect the barrier implementation
> to do the right thing.  This sort of buginess is why.  The reason why it
> doesn't bite more people is that most Linux systems don't turn on write
> barrier support by default, and there's a number of situations that can
> disable barriers even if you did try to enable them.  It's still pretty
> unusual to have a working system with barriers turned on nowadays; I
> really doubt it's "a large percentage of our install base".

Ah, so it is only when write barriers are enabled, and they are not
enabled by default --- OK, that makes sense.

> I've started keeping most of my notes about where ext3 is vulnerable to
> issues in Wikipedia, specifically
> http://en.wikipedia.org/wiki/Ext3#No_checksumming_in_journal ; I just
> updated that section to point out the specific issue Ron pointed out.
> Maybe we should point people toward that in the docs, I try to keep that
> article correct.

Yes, good idea.

--
  Bruce Momjian  <>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

From:
Ron Mayer
Date:

Bruce Momjian wrote:
>> For example, ext3 fsync() will issue write barrier commands
>> if the inode was modified; but not if the inode wasn't.
>>
>> See test program here:
>> http://www.mail-archive.com//msg272253.html
>> and read two paragraphs further to see how touching
>> the inode makes ext3 fsync behave differently.
>
> I thought our only problem was testing the I/O subsystem --- I never
> suspected the file system might lie too.  That email indicates that a
> large percentage of our install base is running on unreliable file
> systems --- why have I not heard about this before?

It came up a on these lists a few times in the past.  Here's one example.
http://archives.postgresql.org/pgsql-performance/2008-08/msg00159.php

As far as I can tell, most of the threads ended with people still
suspecting lying hard drives.  But to the best of my ability I can't
find any drives that actually lie when sent the commands to flush
their caches.  But various combinations of ext3 & linux MD that
decide not to send IDE FLUSH_CACHE_EXT (nor the similiar
SCSI SYNCHRONIZE CACHE command) under various situations.

I wonder if there are enough ext3 users out there that postgres should
touch the inodes before doing a fsync.

> Do the write barriers allow data loss but prevent data inconsistency?

If I understand right, data inconsistency could occur too.  One
aspect of the write barriers is flushing a hard drive's caches.

> It sound like they are effectively running with synchronous_commit = off.

And with the (mythical?) hard drive with lying caches.



From:
Ron Mayer
Date:

Bruce Momjian wrote:
> Greg Smith wrote:
>> Bruce Momjian wrote:
>>> I thought our only problem was testing the I/O subsystem --- I never
>>> suspected the file system might lie too.  That email indicates that a
>>> large percentage of our install base is running on unreliable file
>>> systems --- why have I not heard about this before?
>>>
>> he reason why it
>> doesn't bite more people is that most Linux systems don't turn on write
>> barrier support by default, and there's a number of situations that can
>> disable barriers even if you did try to enable them.  It's still pretty
>> unusual to have a working system with barriers turned on nowadays; I
>> really doubt it's "a large percentage of our install base".
>
> Ah, so it is only when write barriers are enabled, and they are not
> enabled by default --- OK, that makes sense.

The test program I linked up-thread shows that fsync does nothing
unless the inode's touched on an out-of-the-box Ubuntu 9.10 using
ext3 on a straight from Dell system.

Surely that's a common config, no?

If I uncomment the fchmod lines below I can see that even with ext3
and write caches enabled on my drives it does indeed wait.
Note that EXT4 doesn't show the problem on the same system.

Here's a slightly modified test program that's a bit easier to run.
If you run the program and it exits right away, your system isn't
waiting for platters to spin.

////////////////////////////////////////////////////////////////////
/*
** based on http://article.gmane.org/gmane.linux.file-systems/21373
** http://thread.gmane.org/gmane.linux.kernel/646040
** If this program returns instantly, the fsync() lied.
** If it takes a second or so, fsync() probably works.
** On ext3 and drives that cache writes, you probably need
** to uncomment the fchmod's to make fsync work right.
*/
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc,char *argv[]) {
  if (argc<2) {
    printf("usage: fs <filename>\n");
    exit(1);
  }
  int fd = open (argv[1], O_RDWR | O_CREAT | O_TRUNC, 0666);
  int i;
  for (i=0;i<100;i++) {
    char byte;
    pwrite (fd, &byte, 1, 0);
    // fchmod (fd, 0644); fchmod (fd, 0664);
    fsync (fd);
  }
}
////////////////////////////////////////////////////////////////////
ron@ron-desktop:/tmp$ /usr/bin/time ./a.out foo
0.00user 0.00system 0:00.01elapsed 21%CPU (0avgtext+0avgdata 0maxresident)k



From:
Bruce Momjian
Date:

Ron Mayer wrote:
> Bruce Momjian wrote:
> > Greg Smith wrote:
> >> Bruce Momjian wrote:
> >>> I thought our only problem was testing the I/O subsystem --- I never
> >>> suspected the file system might lie too.  That email indicates that a
> >>> large percentage of our install base is running on unreliable file
> >>> systems --- why have I not heard about this before?
> >>>
> >> he reason why it
> >> doesn't bite more people is that most Linux systems don't turn on write
> >> barrier support by default, and there's a number of situations that can
> >> disable barriers even if you did try to enable them.  It's still pretty
> >> unusual to have a working system with barriers turned on nowadays; I
> >> really doubt it's "a large percentage of our install base".
> >
> > Ah, so it is only when write barriers are enabled, and they are not
> > enabled by default --- OK, that makes sense.
>
> The test program I linked up-thread shows that fsync does nothing
> unless the inode's touched on an out-of-the-box Ubuntu 9.10 using
> ext3 on a straight from Dell system.
>
> Surely that's a common config, no?

Yea, this certainly suggests that the problem is wide-spread.

--
  Bruce Momjian  <>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

From:
Scott Carey
Date:

On 11/19/09 1:04 PM, "Greg Smith" <> wrote:

> That won't help.  Once the checkpoint is done, the problem isn't just
> that the WAL segments are recycled.  The server isn't going to use them
> even if they were there.  The reason why you can erase/recycle them is
> that you're doing so *after* writing out a checkpoint record that says
> you don't have to ever look at them again.  What you'd actually have to
> do is hack the server code to insert that delay after every fsync--there
> are none that you can cheat on and not introduce a corruption
> possibility.  The whole WAL/recovery mechanism in PostgreSQL doesn't
> make a lot of assumptions about what the underlying disk has to actually
> do beyond the fsync requirement; the flip side to that robustness is
> that it's the one you can't ever violate safely.

Yeah, I guess its not so easy.  Having the system "hold" one extra
checkpoint worth of segments and then during recovery, always replay that
previoius one plus the current might work, but I don't know if that could
cause corruption.  I assume replaying a log twice won't, so replaying N-1
checkpoint, then the current one, might work.  If so that would be a cool
feature -- so long as the N-2 checkpoint is no longer in the OS or I/O
hardware caches when checkpoint N completes, you're safe!  Its probably more
complicated though, especially with respect to things like MVCC on DDL
changes.

> Right.  It's not used like the write-cache on a regular hard drive,
> where they're buffering 8MB-32MB worth of writes just to keep seek
> overhead down.  It's there primarily to allow combining writes into
> large chunks, to better match the block size of the underlying SSD flash
> cells (128K).  Having enough space for two full cells allows spooling
> out the flash write to a whole block while continuing to buffer the next
> one.
>
> This is why turning the cache off can tank performance so badly--you're
> going to be writing a whole 128K block no matter what if it's force to
> disk without caching, even if it's just to write a 8K page to it.

As others mentioned, flash must erase a whole block at once, but it can
write sequentially to a block in much smaller chunks.   I believe that MLC
and SLC differ a bit here, SLC can write smaller subsections of the erase
block.

A little old but still very useful:
http://research.microsoft.com/apps/pubs/?id=63596

> That's only going to reach 1/16 of the usual write speed on single page
> writes.  And that's why you should also be concerned at whether
> disabling the write cache impacts the drive longevity, lots of small
> writes going out in small chunks is going to wear flash out much faster
> than if the drive is allowed to wait until it's got a full sized block
> to write every time.

This is still a concern, since even if the SLC cells are technically capable
of writing sequentially in smaller chunks, with the write cache off they may
not do so.

>
> The fact that the cache is so small is also why it's harder to catch the
> drive doing the wrong thing here.  The plug test is pretty sensitive to
> a problem when you've got megabytes worth of cached writes that are
> spooling to disk at spinning hard drive speeds.  The window for loss on
> a SSD with no seek overhead and only a moderate number of KB worth of
> cached data is much, much smaller.  Doesn't mean it's gone though.  It's
> a shame that the design wasn't improved just a little bit; a cheap
> capacitor and blocking new writes once the incoming power dropped is all
> it would take to make these much more reliable for database use.  But
> that would raise the price, and not really help anybody but the small
> subset of the market that cares about durable writes.

Yup.  There are manufacturers who claim no data loss on power failure,
hopefully these become more common.
http://www.wdc.com/en/products/ssd/technology.asp?id=1

I still contend its a lot more safe than a hard drive.  I have not seen one
fail yet (out of about 150 heavy use drive-years on X25-Ms).  Any system
that does not have a battery backed write cache will be faster and safer if
an SSD, with write cache on, than hard drives with write cache on.

BBU caching is not fail-safe either, batteries wear out, cards die or
malfunction.
If you need the maximum data integrity, you will probably go with a
battery-backed cache raid setup with or without SSDs.  If you don't go that
route SSD's seem like the best option.  The 'middle ground' of software raid
with hard drives with their write caches off doesn't seem useful to me at
all.  I can't think of one use case that isn't better served by a slightly
cheaper array of disks with a hardware bbu card (if the data is important or
data size is large) OR a set of SSD's (if performance is more important than
data safety).

>> 4: Yet another solution:  The drives DO adhere to write barriers properly.
>> A filesystem that used these in the process of fsync() would be fine too.
>> So XFS without LVM or MD (or the newer versions of those that don't ignore
>> barriers) would work too.
>>
> If I really trusted anything beyond the very basics of the filesystem to
> really work well on Linux, this whole issue would be moot for most of
> the production deployments I do.  Ideally, fsync would just push out the
> minimum of what's needed, it would call the appropriate write cache
> flush mechanism the way the barrier implementation does when that all
> works, life would be good.  Alternately, you might even switch to using
> O_SYNC writes instead, which on a good filesystem implementation are
> both accelerated and safe compared to write/fsync (I've seen that work
> as expected on Vertias VxFS for example).
>

We could all move to OpenSolaris where that stuff does work right...  ;)
I think a lot of the things that make ZFS slower for some tasks is that it
correctly implements and uses write barriers...

> Meanwhile, in the actual world we live, patches that make writes more
> durable by default are dropped by the Linux community because they tank
> performance for too many types of loads, I'm frightened to turn on
> O_SYNC at all on ext3 because of reports of corruption on the lists
> here, fsync does way more work than it needs to, and the way the
> filesystem and block drivers have been separated makes it difficult to
> do any sort of device write cache control from userland.  This is why I
> try to use the simplest, best tested approach out there whenever possible.
>

Oh I hear you :)  At least ext4 looks like an improvement for the
RHEL6/CentOS6 timeframe.  Checksums are handy.

Many of my systems though don't need the highest data reliability.  And a
raid 0 of X-25 M's will be much, much more safe than the same thing of
regular hard drives, and faster.  Putting in a few of those on one system
soon (yes M, won't put WAL on it).  2 such drives kick the crap out of
anything else for the price when performance is most important and the data
is just a copy of something stored in a much safer place than any single
server.  Previously on such systems, a caching raid card would be needed for
performance, but without a bbu data loss risk is very high (much higher than
a ssd with caching on -- 256K versus 512M cache!).  And a SSD costs less
than the raid card.  So long as the total data size isn't too big they work
well.  And even then, some tablespaces can be put on a large HD leaving the
more critical ones on the SSD.
I estimate the likelihood of complete data loss from a 2 SSD raid-0 as the
same as a 4-disk RAID 5 of hard drives.  There is a big difference between a
couple corrupted files and a lost drive...  I have recovered postgres
systems with corruption by reindexing and restoring single tables from
backups.  When one drive in a stripe is lost or a pair in a raid 10 go down,
all is lost.

I wonder -- has anyone seen an  Intel SSD randomly die like a hard drive?
I'm still trying to get a "M" to wear out by writing about 120GB a day to it
for a year.  But rough calculations show that I'm likely years from
trouble...  By then I'll have upgraded to the gen 3 or 4 drives.

> --
> Greg Smith    2ndQuadrant   Baltimore, MD
> PostgreSQL Training, Services and Support
>   www.2ndQuadrant.com
>
>


From:
Matthew Wakeling
Date:

On Fri, 13 Nov 2009, Greg Smith wrote:
> In order for a drive to work reliably for database use such as for
> PostgreSQL, it cannot have a volatile write cache.  You either need a write
> cache with a battery backup (and a UPS doesn't count), or to turn the cache
> off.  The SSD performance figures you've been looking at are with the drive's
> write cache turned on, which means they're completely fictitious and
> exaggerated upwards for your purposes.  In the real world, that will result
> in database corruption after a crash one day.

Seagate are claiming to be on the ball with this one.

http://www.theregister.co.uk/2009/12/08/seagate_pulsar_ssd/

Matthew

--
 The third years are wandering about all worried at the moment because they
 have to hand in their final projects. Please be sympathetic to them, say
 things like "ha-ha-ha", but in a sympathetic tone of voice
                                        -- Computer Science Lecturer

From:
Bruce Momjian
Date:

Matthew Wakeling wrote:
> On Fri, 13 Nov 2009, Greg Smith wrote:
> > In order for a drive to work reliably for database use such as for
> > PostgreSQL, it cannot have a volatile write cache.  You either need a write
> > cache with a battery backup (and a UPS doesn't count), or to turn the cache
> > off.  The SSD performance figures you've been looking at are with the drive's
> > write cache turned on, which means they're completely fictitious and
> > exaggerated upwards for your purposes.  In the real world, that will result
> > in database corruption after a crash one day.
>
> Seagate are claiming to be on the ball with this one.
>
> http://www.theregister.co.uk/2009/12/08/seagate_pulsar_ssd/

I have updated our documentation to mention that even SSD drives often
have volatile write-back caches.  Patch attached and applied.

--
  Bruce Momjian  <>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +
Index: doc/src/sgml/wal.sgml
===================================================================
RCS file: /cvsroot/pgsql/doc/src/sgml/wal.sgml,v
retrieving revision 1.61
diff -c -c -r1.61 wal.sgml
*** doc/src/sgml/wal.sgml    3 Feb 2010 17:25:06 -0000    1.61
--- doc/src/sgml/wal.sgml    20 Feb 2010 18:26:40 -0000
***************
*** 59,65 ****
     same concerns about data loss exist for write-back drive caches as
     exist for disk controller caches.  Consumer-grade IDE and SATA drives are
     particularly likely to have write-back caches that will not survive a
!    power failure.  To check write caching on <productname>Linux</> use
     <command>hdparm -I</>;  it is enabled if there is a <literal>*</> next
     to <literal>Write cache</>; <command>hdparm -W</> to turn off
     write caching.  On <productname>FreeBSD</> use
--- 59,66 ----
     same concerns about data loss exist for write-back drive caches as
     exist for disk controller caches.  Consumer-grade IDE and SATA drives are
     particularly likely to have write-back caches that will not survive a
!    power failure.  Many solid-state drives also have volatile write-back
!    caches.  To check write caching on <productname>Linux</> use
     <command>hdparm -I</>;  it is enabled if there is a <literal>*</> next
     to <literal>Write cache</>; <command>hdparm -W</> to turn off
     write caching.  On <productname>FreeBSD</> use

From:
Dan Langille
Date:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Bruce Momjian wrote:
> Matthew Wakeling wrote:
>> On Fri, 13 Nov 2009, Greg Smith wrote:
>>> In order for a drive to work reliably for database use such as for
>>> PostgreSQL, it cannot have a volatile write cache.  You either need a write
>>> cache with a battery backup (and a UPS doesn't count), or to turn the cache
>>> off.  The SSD performance figures you've been looking at are with the drive's
>>> write cache turned on, which means they're completely fictitious and
>>> exaggerated upwards for your purposes.  In the real world, that will result
>>> in database corruption after a crash one day.
>> Seagate are claiming to be on the ball with this one.
>>
>> http://www.theregister.co.uk/2009/12/08/seagate_pulsar_ssd/
>
> I have updated our documentation to mention that even SSD drives often
> have volatile write-back caches.  Patch attached and applied.

Hmmm.  That got me thinking: consider ZFS and HDD with volatile cache.
Do the characteristics of ZFS avoid this issue entirely?

- --
Dan Langille

BSDCan - The Technical BSD Conference : http://www.bsdcan.org/
PGCon  - The PostgreSQL Conference:     http://www.pgcon.org/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.13 (FreeBSD)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkuAayQACgkQCgsXFM/7nTyMggCgnZUbVzldxjp/nPo8EL1Nq6uG
6+IAoNGIB9x8/mwUQidjM9nnAADRbr9j
=3RJi
-----END PGP SIGNATURE-----

From:
Bruce Momjian
Date:

Dan Langille wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Bruce Momjian wrote:
> > Matthew Wakeling wrote:
> >> On Fri, 13 Nov 2009, Greg Smith wrote:
> >>> In order for a drive to work reliably for database use such as for
> >>> PostgreSQL, it cannot have a volatile write cache.  You either need a write
> >>> cache with a battery backup (and a UPS doesn't count), or to turn the cache
> >>> off.  The SSD performance figures you've been looking at are with the drive's
> >>> write cache turned on, which means they're completely fictitious and
> >>> exaggerated upwards for your purposes.  In the real world, that will result
> >>> in database corruption after a crash one day.
> >> Seagate are claiming to be on the ball with this one.
> >>
> >> http://www.theregister.co.uk/2009/12/08/seagate_pulsar_ssd/
> >
> > I have updated our documentation to mention that even SSD drives often
> > have volatile write-back caches.  Patch attached and applied.
>
> Hmmm.  That got me thinking: consider ZFS and HDD with volatile cache.
> Do the characteristics of ZFS avoid this issue entirely?

No, I don't think so.  ZFS only avoids partial page writes.  ZFS still
assumes something sent to the drive is permanent or it would have no way
to operate.

--
  Bruce Momjian  <>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +

From:
Scott Carey
Date:

On Feb 20, 2010, at 3:19 PM, Bruce Momjian wrote:

> Dan Langille wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Bruce Momjian wrote:
>>> Matthew Wakeling wrote:
>>>> On Fri, 13 Nov 2009, Greg Smith wrote:
>>>>> In order for a drive to work reliably for database use such as for
>>>>> PostgreSQL, it cannot have a volatile write cache.  You either need a write
>>>>> cache with a battery backup (and a UPS doesn't count), or to turn the cache
>>>>> off.  The SSD performance figures you've been looking at are with the drive's
>>>>> write cache turned on, which means they're completely fictitious and
>>>>> exaggerated upwards for your purposes.  In the real world, that will result
>>>>> in database corruption after a crash one day.
>>>> Seagate are claiming to be on the ball with this one.
>>>>
>>>> http://www.theregister.co.uk/2009/12/08/seagate_pulsar_ssd/
>>>
>>> I have updated our documentation to mention that even SSD drives often
>>> have volatile write-back caches.  Patch attached and applied.
>>
>> Hmmm.  That got me thinking: consider ZFS and HDD with volatile cache.
>> Do the characteristics of ZFS avoid this issue entirely?
>
> No, I don't think so.  ZFS only avoids partial page writes.  ZFS still
> assumes something sent to the drive is permanent or it would have no way
> to operate.
>

ZFS is write-back cache aware, and safe provided the drive's cache flushing and write barrier related commands work.
Itwill flush data in 'transaction groups' and flush the drive write caches at the end of those transactions.  Since its
copyon write, it can ensure that all the changes in the transaction group appear on disk, or all are lost.  This all
worksso long as the cache flush commands do. 


> --
>  Bruce Momjian  <>        http://momjian.us
>  EnterpriseDB                             http://enterprisedb.com
>  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
>  + If your life is a hard drive, Christ can be your backup. +
>
> --
> Sent via pgsql-performance mailing list ()
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


From:
Bruce Momjian
Date:

Scott Carey wrote:
> On Feb 20, 2010, at 3:19 PM, Bruce Momjian wrote:
>
> > Dan Langille wrote:
> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> Hash: SHA1
> >>
> >> Bruce Momjian wrote:
> >>> Matthew Wakeling wrote:
> >>>> On Fri, 13 Nov 2009, Greg Smith wrote:
> >>>>> In order for a drive to work reliably for database use such as for
> >>>>> PostgreSQL, it cannot have a volatile write cache.  You either need a write
> >>>>> cache with a battery backup (and a UPS doesn't count), or to turn the cache
> >>>>> off.  The SSD performance figures you've been looking at are with the drive's
> >>>>> write cache turned on, which means they're completely fictitious and
> >>>>> exaggerated upwards for your purposes.  In the real world, that will result
> >>>>> in database corruption after a crash one day.
> >>>> Seagate are claiming to be on the ball with this one.
> >>>>
> >>>> http://www.theregister.co.uk/2009/12/08/seagate_pulsar_ssd/
> >>>
> >>> I have updated our documentation to mention that even SSD drives often
> >>> have volatile write-back caches.  Patch attached and applied.
> >>
> >> Hmmm.  That got me thinking: consider ZFS and HDD with volatile cache.
> >> Do the characteristics of ZFS avoid this issue entirely?
> >
> > No, I don't think so.  ZFS only avoids partial page writes.  ZFS still
> > assumes something sent to the drive is permanent or it would have no way
> > to operate.
> >
>
> ZFS is write-back cache aware, and safe provided the drive's
> cache flushing and write barrier related commands work.  It will
> flush data in 'transaction groups' and flush the drive write
> caches at the end of those transactions.  Since its copy on
> write, it can ensure that all the changes in the transaction
> group appear on disk, or all are lost.  This all works so long
> as the cache flush commands do.

Agreed, thought I thought the problem was that SSDs lie about their
cache flush like SATA drives do, or is there something I am missing?

--
  Bruce Momjian  <>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +

From:
Ron Mayer
Date:

Bruce Momjian wrote:
> Agreed, thought I thought the problem was that SSDs lie about their
> cache flush like SATA drives do, or is there something I am missing?

There's exactly one case I can find[1] where this century's IDE
drives lied more than any other drive with a cache:

  Under 120GB Maxtor drives from late 2003 to early 2004.

and it's apparently been worked around for years.

Those drives claimed to support the "FLUSH_CACHE_EXT" feature (IDE
command 0xEA), but did not support sending 48-bit commands which
was needed to send the cache flushing command.

And for that case a workaround for Linux was quickly identified by
checking for *both* the support for 48-bit commands and support for the
flush cache extension[2].


Beyond those 2004 drive + 2003 kernel systems, I think most the rest
of such reports have been various misfeatures in some of Linux's
filesystems (like EXT3 that only wants to send drives cache-flushing
commands when inode change[3]) and linux software raid misfeatures....

...and ISTM those would affect SSDs the same way they'd affect SATA drives.


[1] http://lkml.org/lkml/2004/5/12/132
[2] http://lkml.org/lkml/2004/5/12/200
[3] http://www.mail-archive.com//msg272253.html



From:
Greg Smith
Date:

Ron Mayer wrote:
Bruce Momjian wrote: 
Agreed, thought I thought the problem was that SSDs lie about their
cache flush like SATA drives do, or is there something I am missing?   
There's exactly one case I can find[1] where this century's IDE
drives lied more than any other drive with a cache:

Ron is correct that the problem of mainstream SATA drives accepting the cache flush command but not actually doing anything with it is long gone at this point.  If you have a regular SATA drive, it almost certainly supports proper cache flushing.  And if your whole software/storage stacks understands all that, you should not end up with corrupted data just because there's a volative write cache in there.

But the point of this whole testing exercise coming back into vogue again is that SSDs have returned this negligent behavior to the mainstream again.  See http://opensolaris.org/jive/thread.jspa?threadID=121424 for a discussion of this in a ZFS context just last month.  There are many documented cases of Intel SSDs that will fake a cache flush, such that the only way to get good reliable writes is to totally disable their writes caches--at which point performance is so bad you might as well have gotten a RAID10 setup instead (and longevity is toast too).

This whole area remains a disaster area and extreme distrust of all the SSD storage vendors is advisable at this point.  Basically, if I don't see the capacitor responsible for flushing outstanding writes, and get a clear description from the manufacturer how the cached writes are going to be handled in the event of a power failure, at this point I have to assume the answer is "badly and your data will be eaten".  And the prices for SSDs that meet that requirement are still quite steep.  I keep hoping somebody will address this market at something lower than the standard "enterprise" prices.  The upcoming SandForce designs seem to have thought this through correctly:  http://www.anandtech.com/storage/showdoc.aspx?i=3702&p=6  But the product's not out to the general public yet (just like the Seagate units that claim to have capacitor backups--I heard a rumor those are also Sandforce designs actually, so they may be the only ones doing this right and aiming at a lower price).

-- 
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
   www.2ndQuadrant.us
From:
Arjen van der Meijden
Date:

On 22-2-2010 6:39 Greg Smith wrote:
> But the point of this whole testing exercise coming back into vogue
> again is that SSDs have returned this negligent behavior to the
> mainstream again. See
> http://opensolaris.org/jive/thread.jspa?threadID=121424 for a discussion
> of this in a ZFS context just last month. There are many documented
> cases of Intel SSDs that will fake a cache flush, such that the only way
> to get good reliable writes is to totally disable their writes
> caches--at which point performance is so bad you might as well have
> gotten a RAID10 setup instead (and longevity is toast too).

That's weird. Intel's SSD's didn't have a write cache afaik:
"I asked Intel about this and it turns out that the DRAM on the Intel
drive isn't used for user data because of the risk of data loss, instead
it is used as memory by the Intel SATA/flash controller for deciding
exactly where to write data (I'm assuming for the wear
leveling/reliability algorithms)."
http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3403&p=10

But that is the old version, perhaps the second generation does have a
bit of write caching.

I can understand a SSD might do unexpected things when it loses power
all of a sudden. It will probably try to group writes to fill a single
block (and those blocks vary in size but are normally way larger than
those of a normal spinning disk, they are values like 256 or 512KB) and
it might loose that "waiting until a full block can be written"-data or
perhaps it just couldn't complete a full block-write due to the power
failure.
Although that behavior isn't really what you want, it would be incorrect
to blame write caching for the behavior if the device doesn't even have
a write cache ;)

Best regards,

Arjen


From:
Bruce Momjian
Date:

Greg Smith wrote:
> Ron Mayer wrote:
> > Bruce Momjian wrote:
> >
> >> Agreed, thought I thought the problem was that SSDs lie about their
> >> cache flush like SATA drives do, or is there something I am missing?
> >>
> >
> > There's exactly one case I can find[1] where this century's IDE
> > drives lied more than any other drive with a cache:
>
> Ron is correct that the problem of mainstream SATA drives accepting the
> cache flush command but not actually doing anything with it is long gone
> at this point.  If you have a regular SATA drive, it almost certainly
> supports proper cache flushing.  And if your whole software/storage
> stacks understands all that, you should not end up with corrupted data
> just because there's a volative write cache in there.

OK, but I have a few questions.  Is a write to the drive and a cache
flush command the same?  Which file systems implement both?  I thought a
write to the drive was always assumed to flush it to the platters,
assuming the drive's cache is set to write-through.

--
  Bruce Momjian  <>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +

From:
Bruce Momjian
Date:

Ron Mayer wrote:
> Bruce Momjian wrote:
> > Agreed, thought I thought the problem was that SSDs lie about their
> > cache flush like SATA drives do, or is there something I am missing?
>
> There's exactly one case I can find[1] where this century's IDE
> drives lied more than any other drive with a cache:
>
>   Under 120GB Maxtor drives from late 2003 to early 2004.
>
> and it's apparently been worked around for years.
>
> Those drives claimed to support the "FLUSH_CACHE_EXT" feature (IDE
> command 0xEA), but did not support sending 48-bit commands which
> was needed to send the cache flushing command.
>
> And for that case a workaround for Linux was quickly identified by
> checking for *both* the support for 48-bit commands and support for the
> flush cache extension[2].
>
>
> Beyond those 2004 drive + 2003 kernel systems, I think most the rest
> of such reports have been various misfeatures in some of Linux's
> filesystems (like EXT3 that only wants to send drives cache-flushing
> commands when inode change[3]) and linux software raid misfeatures....
>
> ...and ISTM those would affect SSDs the same way they'd affect SATA drives.

I think the point is not that drives lie about their write-back and
write-through behavior, but rather that many SATA/IDE drives default to
write-back, and not write-through, and many administrators an file
systems are not aware of this behavior.

--
  Bruce Momjian  <>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +

From:
Ron Mayer
Date:

Bruce Momjian wrote:
> Greg Smith wrote:
>> .... If you have a regular SATA drive, it almost certainly
>> supports proper cache flushing....
>
> OK, but I have a few questions.  Is a write to the drive and a cache
> flush command the same?

I believe they're different as of ATAPI-6 from 2001.

> Which file systems implement both?

Seems ZFS and recent ext4 have thought these interactions out
thoroughly.   Find a slow ext4 that people complain about, and
that's the one doing it right :-).

Ext3 has some particularly odd annoyances where it flushes and waits
for certain writes (ones involving inode changes) but doesn't bother
to flush others (just data changes).   As far as I can tell, with
ext3 you need userspace utilities to make sure flushes occur when
you need them.    At one point I was tempted to try to put such
userspace hacks into postgres.

I know less about other file systems.  Apparently the NTFS guys
are aware of such stuff - but don't know what kinds of fsync equivalent
you'd need to make it happen.

Also worth noting - Linux's software raid stuff (MD and LVM)
need to handle this right as well - and last I checked (sometime
last year) the default setups didn't.

>  I thought a
> write to the drive was always assumed to flush it to the platters,
> assuming the drive's cache is set to write-through.

Apparently somewhere around here:
http://www.t10.org/t13/project/d1410r3a-ATA-ATAPI-6.pdf
they were separated in the IDE world.

From:
Greg Smith
Date:

Arjen van der Meijden wrote:
> That's weird. Intel's SSD's didn't have a write cache afaik:
> "I asked Intel about this and it turns out that the DRAM on the Intel
> drive isn't used for user data because of the risk of data loss,
> instead it is used as memory by the Intel SATA/flash controller for
> deciding exactly where to write data (I'm assuming for the wear
> leveling/reliability algorithms)."
> http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3403&p=10

Read further down:

"Despite the presence of the external DRAM, both the Intel controller
and the JMicron rely on internal buffers to cache accesses to the
SSD...Intel's controller has a 256KB SRAM on-die."

That's the problematic part:  the Intel controllers have a volatile
256KB write cache stored deep inside the SSD controller, and issuing a
standard SATA write cache flush command doesn't seem to clear it.  Makes
the drives troublesome for database use.

> I can understand a SSD might do unexpected things when it loses power
> all of a sudden. It will probably try to group writes to fill a single
> block (and those blocks vary in size but are normally way larger than
> those of a normal spinning disk, they are values like 256 or 512KB)
> and it might loose that "waiting until a full block can be
> written"-data or perhaps it just couldn't complete a full block-write
> due to the power failure.
> Although that behavior isn't really what you want, it would be
> incorrect to blame write caching for the behavior if the device
> doesn't even have a write cache ;)

If you write data and that write call returns before the data hits disk,
it's a write cache, period.  And if that write cache loses its contents
if power is lost, it's a volatile write cache that can cause database
corruption.  The fact that the one on the Intel devices is very small,
basically just dealing with the block chunking behavior you describe,
doesn't change either of those facts.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
   www.2ndQuadrant.us


From:
Mark Mielke
Date:

On 02/22/2010 08:04 PM, Greg Smith wrote:
> Arjen van der Meijden wrote:
>> That's weird. Intel's SSD's didn't have a write cache afaik:
>> "I asked Intel about this and it turns out that the DRAM on the Intel
>> drive isn't used for user data because of the risk of data loss,
>> instead it is used as memory by the Intel SATA/flash controller for
>> deciding exactly where to write data (I'm assuming for the wear
>> leveling/reliability algorithms)."
>> http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3403&p=10
>
> Read further down:
>
> "Despite the presence of the external DRAM, both the Intel controller
> and the JMicron rely on internal buffers to cache accesses to the
> SSD...Intel's controller has a 256KB SRAM on-die."
>
> That's the problematic part:  the Intel controllers have a volatile
> 256KB write cache stored deep inside the SSD controller, and issuing a
> standard SATA write cache flush command doesn't seem to clear it.
> Makes the drives troublesome for database use.

I had read the above when posted, and then looked up SRAM. SRAM seems to
suggest it will hold the data even after power loss, but only for a
period of time. As long as power can restore within a few minutes, it
seemed like this would be ok?

>> I can understand a SSD might do unexpected things when it loses power
>> all of a sudden. It will probably try to group writes to fill a
>> single block (and those blocks vary in size but are normally way
>> larger than those of a normal spinning disk, they are values like 256
>> or 512KB) and it might loose that "waiting until a full block can be
>> written"-data or perhaps it just couldn't complete a full block-write
>> due to the power failure.
>> Although that behavior isn't really what you want, it would be
>> incorrect to blame write caching for the behavior if the device
>> doesn't even have a write cache ;)
>
> If you write data and that write call returns before the data hits
> disk, it's a write cache, period.  And if that write cache loses its
> contents if power is lost, it's a volatile write cache that can cause
> database corruption.  The fact that the one on the Intel devices is
> very small, basically just dealing with the block chunking behavior
> you describe, doesn't change either of those facts.
>

The SRAM seems to suggest that it does not necessarily lose its contents
if power is lost - it just doesn't say how long you have to plug it back
in. Isn't this similar to a battery-backed cache or capacitor-backed cache?

I'd love to have a better guarantee - but is SRAM really such a bad model?

Cheers,
mark


From:
Greg Smith
Date:

Ron Mayer wrote:
> I know less about other file systems.  Apparently the NTFS guys
> are aware of such stuff - but don't know what kinds of fsync equivalent
> you'd need to make it happen.
>

It's actually pretty straightforward--better than ext3.  Windows with
NTFS has been perfectly aware how to do write-through on drives that
support it when you execute _commit for some time:
http://msdn.microsoft.com/en-us/library/17618685(VS.80).aspx

If you switch the postgresql.conf setting to fsync_writethrough on
Windows, it will execute _commit where it would execute fsync on other
platforms, and that pushes through the drive's caches as it should
(unlike fsync in many cases).  More about this at
http://archives.postgresql.org/pgsql-hackers/2005-08/msg00227.php and
http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm (which
also covers OS X).

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
   www.2ndQuadrant.us


From:
Greg Smith
Date:

Mark Mielke wrote:
> I had read the above when posted, and then looked up SRAM. SRAM seems
> to suggest it will hold the data even after power loss, but only for a
> period of time. As long as power can restore within a few minutes, it
> seemed like this would be ok?

The normal type of RAM everyone uses is DRAM, which requires constrant
"refresh" cycles to keep it working and is pretty power hungry as a
result.  Power gone, data gone an instant later.

There is also Non-volatile SRAM that includes an integrated battery (
http://www.maxim-ic.com/quick_view2.cfm/qv_pk/2648 is a typical
example), and that sort of design can be used to build the sort of
battery-backed caches that RAID controllers provide.  If Intel's drives
were built using a NV-SRAM implementation, I'd be a happy owner of one
instead of a constant critic of their drives.

But regular old SRAM is still completely volatile and loses its contents
very quickly after power fails.  I doubt you'd even get minutes of time
before it's gone.  The ease at which data loss failures with these Intel
drives continue to be duplicated in the field says their design isn't
anywhere near good enough to be considered non-volatile.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
   www.2ndQuadrant.us


From:
Scott Marlowe
Date:

On Mon, Feb 22, 2010 at 6:39 PM, Greg Smith <> wrote:
> Mark Mielke wrote:
>>
>> I had read the above when posted, and then looked up SRAM. SRAM seems to
>> suggest it will hold the data even after power loss, but only for a period
>> of time. As long as power can restore within a few minutes, it seemed like
>> this would be ok?
>
> The normal type of RAM everyone uses is DRAM, which requires constrant
> "refresh" cycles to keep it working and is pretty power hungry as a result.
>  Power gone, data gone an instant later.

Actually, oddly enough, per bit stored dram is much lower power usage
than sram, because it only has something like 2 transistors per bit,
while sram needs something like 4 or 5 (it's been a couple decades
since I took the classes on each).  Even with the constant refresh,
dram has a lower power draw than sram.

From:
Scott Marlowe
Date:

On Mon, Feb 22, 2010 at 7:21 PM, Scott Marlowe <> wrote:
> On Mon, Feb 22, 2010 at 6:39 PM, Greg Smith <> wrote:
>> Mark Mielke wrote:
>>>
>>> I had read the above when posted, and then looked up SRAM. SRAM seems to
>>> suggest it will hold the data even after power loss, but only for a period
>>> of time. As long as power can restore within a few minutes, it seemed like
>>> this would be ok?
>>
>> The normal type of RAM everyone uses is DRAM, which requires constrant
>> "refresh" cycles to keep it working and is pretty power hungry as a result.
>>  Power gone, data gone an instant later.
>
> Actually, oddly enough, per bit stored dram is much lower power usage
> than sram, because it only has something like 2 transistors per bit,
> while sram needs something like 4 or 5 (it's been a couple decades
> since I took the classes on each).  Even with the constant refresh,
> dram has a lower power draw than sram.

Note that's power draw per bit.  dram is usually much more densely
packed (it can be with fewer transistors per cell) so the individual
chips for each may have similar power draws while the dram will be 10
times as densely packed as the sram.

From:
david@lang.hm
Date:

On Mon, 22 Feb 2010, Ron Mayer wrote:

>
> Also worth noting - Linux's software raid stuff (MD and LVM)
> need to handle this right as well - and last I checked (sometime
> last year) the default setups didn't.
>

I think I saw some stuff in the last few months on this issue on the
kernel mailing list. you may want to doublecheck this when 2.6.33 gets
released (probably this week)

David Lang

From:
"Pierre C"
Date:

> Note that's power draw per bit.  dram is usually much more densely
> packed (it can be with fewer transistors per cell) so the individual
> chips for each may have similar power draws while the dram will be 10
> times as densely packed as the sram.

Differences between SRAM and DRAM :

- price per byte (DRAM much cheaper)

- silicon area per byte (DRAM much smaller)

- random access latency
    SRAM = fast, uniform, and predictable, usually 0/1 cycles
    DRAM = "a few" up to "a lot" of cycles depending on chip type,
    which page/row/column you want to access, wether it's R or W,
    wether the page is already open, etc

In fact, DRAM is the new harddisk. SRAM is used mostly when low-latency is
needed (caches, etc).

- ease of use :
    SRAM very easy to use : address, data, read, write, clock.
    SDRAM needs a smart controller.
    SRAM easier to instantiate on a silicon chip

- power draw
    When used at high speeds, SRAM ist't power-saving at all, it's used for
speed.
    However when not used, the power draw is really negligible.

While it is true that you can recover *some* data out of a SRAM/DRAM chip
that hasn't been powered for a few seconds, you can't really trust that
data. It's only a forensics tool.

Most DRAM now (especially laptop DRAM) includes special power-saving modes
which only keep the data retention logic (refresh, etc) powered, but not
the rest of the chip (internal caches, IO buffers, etc). Laptops, PDAs,
etc all use this feature in suspend-to-RAM mode. In this mode, the power
draw is higher than SRAM, but still pretty minimal, so a laptop can stay
in suspend-to-RAM mode for days.

Anyway, the SRAM vs DRAM isn't really relevant for the debate of SSD data
integrity. You can backup both with a small battery of ultra-cap.

What is important too is that the entire SSD chipset must have been
designed with this in mind : it must detect power loss, and correctly
react to it, and especially not reset itself or do funny stuff to the
memory when the power comes back. Which means at least some parts of the
chipset must stay powered to keep their state.

Now I wonder about something. SSDs use wear-leveling which means the
information about which block was written where must be kept somewhere.
Which means this information must be updated. I wonder how crash-safe and
how atomic these updates are, in the face of a power loss.  This is just
like a filesystem. You've been talking only about data, but the block
layout information (metadata) is subject to the same concerns. If the
drive says it's written, not only the data must have been written, but
also the information needed to locate that data...

Therefore I think the yank-the-power-cord test should be done with random
writes happening on an aged and mostly-full SSD... and afterwards, I'd be
interested to know if not only the last txn really committed, but if some
random parts of other stuff weren't "wear-leveled" into oblivion at the
power loss...






From:
Nikolas Everett
Date:



On Tue, Feb 23, 2010 at 6:49 AM, Pierre C <> wrote:
Note that's power draw per bit.  dram is usually much more densely
packed (it can be with fewer transistors per cell) so the individual
chips for each may have similar power draws while the dram will be 10
times as densely packed as the sram.

Differences between SRAM and DRAM :

[lots of informative stuff]

I've been slowly reading the paper at http://people.redhat.com/drepper/cpumemory.pdf  which has a big section on SRAM vs DRAM with nice pretty pictures. While not strictly relevant its been illuminating and I wanted to share.
 

From:
Scott Carey
Date:

On Feb 23, 2010, at 3:49 AM, Pierre C wrote:
> Now I wonder about something. SSDs use wear-leveling which means the
> information about which block was written where must be kept somewhere.
> Which means this information must be updated. I wonder how crash-safe and
> how atomic these updates are, in the face of a power loss.  This is just
> like a filesystem. You've been talking only about data, but the block
> layout information (metadata) is subject to the same concerns. If the
> drive says it's written, not only the data must have been written, but
> also the information needed to locate that data...
>
> Therefore I think the yank-the-power-cord test should be done with random
> writes happening on an aged and mostly-full SSD... and afterwards, I'd be
> interested to know if not only the last txn really committed, but if some
> random parts of other stuff weren't "wear-leveled" into oblivion at the
> power loss...
>

A couple years ago I postulated that SSD's could do random writes fast if they remapped blocks.  Microsoft's SSD
whitepaperat the time hinted at this too. 
Persisting the remap data is not hard.  It goes in the same location as the data, or a separate area that can be
writtento linearly. 

Each block may contain its LBA and a transaction ID or other atomic count.  Or another block can have that info.  When
theSSD 
powers up, it can build its table of LBA > block by looking at that data and inverting it and keeping the highest
transactionID for duplicate LBA claims. 

Although SSD's have to ERASE data in a large block at a time (256K to 2M typically), they can write linearly to an
erasedblock in much smaller chunks. 
Thus, to commit a write, either:
Data, LBA tag, and txID in same block (may require oddly sized blocks).
or
Data written to one block (not committed yet), then LBA tag and txID written elsewhere (which commits the write).
Sinceits all copy on write, partial writes can't happen. 
If a block is being moved or compressed when power fails data should never be lost since the old data still exists, the
newversion just didn't commit.  But new data that is being written may not be committed yet in the case of a power
failureunless other measures are taken. 

>
>
>
>
>
> --
> Sent via pgsql-performance mailing list ()
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


From:
david@lang.hm
Date:

On Tue, 23 Feb 2010,  wrote:

> On Mon, 22 Feb 2010, Ron Mayer wrote:
>
>>
>> Also worth noting - Linux's software raid stuff (MD and LVM)
>> need to handle this right as well - and last I checked (sometime
>> last year) the default setups didn't.
>>
>
> I think I saw some stuff in the last few months on this issue on the kernel
> mailing list. you may want to doublecheck this when 2.6.33 gets released
> (probably this week)

to clarify further (after getting more sleep ;-)

I believe that the linux software raid always did the right thing if you
did a fsync/fdatacync. however barriers that filesystems attempted to use
to avoid the need for a hard fsync used to be silently ignored. I believe
these are now honored (in at least some configurations)

However, one thing that you do not get protection against with software
raid is the potential for the writes to hit some drives but not others. If
this happens the software raid cannot know what the correct contents of
the raid stripe are, and so you could loose everything in that stripe
(including contents of other files that are not being modified that
happened to be in the wrong place on the array)

If you have critical data, you _really_ want to use a raid controller with
battery backup so that if you loose power you have a chance of eventually
completing the write.

David Lang

From:
Aidan Van Dyk
Date:

*  <> [100223 15:05]:

> However, one thing that you do not get protection against with software
> raid is the potential for the writes to hit some drives but not others.
> If this happens the software raid cannot know what the correct contents
> of the raid stripe are, and so you could loose everything in that stripe
> (including contents of other files that are not being modified that
> happened to be in the wrong place on the array)

That's for stripe-based raid.  Mirror sets like raid-1 should give you
either the old data, or the new data, both acceptable responses since
the fsync/barreir hasn't "completed".

Or have I missed another subtle interaction?

a.

--
Aidan Van Dyk                                             Create like a god,
                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

From:
david@lang.hm
Date:

On Tue, 23 Feb 2010, Aidan Van Dyk wrote:

> *  <> [100223 15:05]:
>
>> However, one thing that you do not get protection against with software
>> raid is the potential for the writes to hit some drives but not others.
>> If this happens the software raid cannot know what the correct contents
>> of the raid stripe are, and so you could loose everything in that stripe
>> (including contents of other files that are not being modified that
>> happened to be in the wrong place on the array)
>
> That's for stripe-based raid.  Mirror sets like raid-1 should give you
> either the old data, or the new data, both acceptable responses since
> the fsync/barreir hasn't "completed".
>
> Or have I missed another subtle interaction?

one problem is that when the system comes back up and attempts to check
the raid array, it is not going to know which drive has valid data. I
don't know exactly what it does in that situation, but this type of error
in other conditions causes the system to take the array offline.

David Lang

From:
Mark Mielke
Date:

On 02/23/2010 04:22 PM,  wrote:
> On Tue, 23 Feb 2010, Aidan Van Dyk wrote:
>
>> *  <> [100223 15:05]:
>>
>>> However, one thing that you do not get protection against with software
>>> raid is the potential for the writes to hit some drives but not others.
>>> If this happens the software raid cannot know what the correct contents
>>> of the raid stripe are, and so you could loose everything in that
>>> stripe
>>> (including contents of other files that are not being modified that
>>> happened to be in the wrong place on the array)
>>
>> That's for stripe-based raid.  Mirror sets like raid-1 should give you
>> either the old data, or the new data, both acceptable responses since
>> the fsync/barreir hasn't "completed".
>>
>> Or have I missed another subtle interaction?
>
> one problem is that when the system comes back up and attempts to
> check the raid array, it is not going to know which drive has valid
> data. I don't know exactly what it does in that situation, but this
> type of error in other conditions causes the system to take the array
> offline.

I think the real concern here is that depending on how the data is read
later - and depending on which disks it reads from - it could read
*either* old or new, at any time in the future. I.e. it reads "new" from
disk 1 the first time, and then an hour later it reads "old" from disk 2.

I think this concern might be invalid for a properly running system,
though. When a RAID array is not cleanly shut down, the RAID array
should run in "degraded" mode until it can be sure that the data is
consistent. In this case, it should pick one drive, and call it the
"live" one, and then rebuild the other from the "live" one. Until it is
re-built, it should only satisfy reads from the "live" one, or parts of
the "rebuilding" one that are known to be clean.

I use mdadm software RAID, and all of me reading (including some of its
source code) and experience (shutting down the box uncleanly) tells me,
it is working properly. In fact, the "rebuild" process can get quite
ANNOYING as the whole system becomes much slower during rebuild, and
rebuild of large partitions can take hours to complete.

For mdadm, there is a not-so-well-known "write-intent bitmap"
capability. Once enabled, mdadm will embed a small bitmap (128 bits?)
into the partition, and each bit will indicate a section of the
partition. Before writing to a section, it will mark that section as
dirty using this bitmap. It will leave this bit set for some time after
the partition is "clean" (lazy clear). The effect of this, is that at
any point in time, only certain sections of the drive are dirty, and on
recovery, it is a lot cheaper to only rebuild the dirty sections. It
works really well.

So, I don't think this has to be a problem. There are solutions, and any
solution that claims to be complete should offer these sorts of
capabilities.

Cheers,
mark


From:
Dave Crooke
Date:

It's always possible to rebuild into a consistent configuration by assigning a precedence order; for parity RAID, the data drives take precedence over parity drives, and for RAID-1 sets it assigns an arbitrary master.

You *should* never lose a whole stripe ... for example, RAID-5 updates do "read old data / parity, write new data, write new parity" ... there is no need to touch any other data disks, so they will be preserved through the rebuild. Similarly, if only one block is being updated there is no need to update the entire stripe.

David - what caused /dev/md to decide to take an array offline?

Cheers
Dave

On Tue, Feb 23, 2010 at 3:22 PM, <> wrote:
On Tue, 23 Feb 2010, Aidan Van Dyk wrote:

* <> [100223 15:05]:

However, one thing that you do not get protection against with software
raid is the potential for the writes to hit some drives but not others.
If this happens the software raid cannot know what the correct contents
of the raid stripe are, and so you could loose everything in that stripe
(including contents of other files that are not being modified that
happened to be in the wrong place on the array)

That's for stripe-based raid.  Mirror sets like raid-1 should give you
either the old data, or the new data, both acceptable responses since
the fsync/barreir hasn't "completed".

Or have I missed another subtle interaction?

one problem is that when the system comes back up and attempts to check the raid array, it is not going to know which drive has valid data. I don't know exactly what it does in that situation, but this type of error in other conditions causes the system to take the array offline.


David Lang

--
Sent via pgsql-performance mailing list ()
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

From:
Bruce Momjian
Date:

I have added documentation about the ATAPI drive flush command, and the
typical SSD behavior.

---------------------------------------------------------------------------

Greg Smith wrote:
> Ron Mayer wrote:
> > Bruce Momjian wrote:
> >
> >> Agreed, thought I thought the problem was that SSDs lie about their
> >> cache flush like SATA drives do, or is there something I am missing?
> >>
> >
> > There's exactly one case I can find[1] where this century's IDE
> > drives lied more than any other drive with a cache:
>
> Ron is correct that the problem of mainstream SATA drives accepting the
> cache flush command but not actually doing anything with it is long gone
> at this point.  If you have a regular SATA drive, it almost certainly
> supports proper cache flushing.  And if your whole software/storage
> stacks understands all that, you should not end up with corrupted data
> just because there's a volative write cache in there.
>
> But the point of this whole testing exercise coming back into vogue
> again is that SSDs have returned this negligent behavior to the
> mainstream again.  See
> http://opensolaris.org/jive/thread.jspa?threadID=121424 for a discussion
> of this in a ZFS context just last month.  There are many documented
> cases of Intel SSDs that will fake a cache flush, such that the only way
> to get good reliable writes is to totally disable their writes
> caches--at which point performance is so bad you might as well have
> gotten a RAID10 setup instead (and longevity is toast too).
>
> This whole area remains a disaster area and extreme distrust of all the
> SSD storage vendors is advisable at this point.  Basically, if I don't
> see the capacitor responsible for flushing outstanding writes, and get a
> clear description from the manufacturer how the cached writes are going
> to be handled in the event of a power failure, at this point I have to
> assume the answer is "badly and your data will be eaten".  And the
> prices for SSDs that meet that requirement are still quite steep.  I
> keep hoping somebody will address this market at something lower than
> the standard "enterprise" prices.  The upcoming SandForce designs seem
> to have thought this through correctly:
> http://www.anandtech.com/storage/showdoc.aspx?i=3702&p=6  But the
> product's not out to the general public yet (just like the Seagate units
> that claim to have capacitor backups--I heard a rumor those are also
> Sandforce designs actually, so they may be the only ones doing this
> right and aiming at a lower price).
>
> --
> Greg Smith  2ndQuadrant US  Baltimore, MD
> PostgreSQL Training, Services and Support
>    www.2ndQuadrant.us
>

--
  Bruce Momjian  <>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +
Index: doc/src/sgml/wal.sgml
===================================================================
RCS file: /cvsroot/pgsql/doc/src/sgml/wal.sgml,v
retrieving revision 1.62
diff -c -c -r1.62 wal.sgml
*** doc/src/sgml/wal.sgml    20 Feb 2010 18:28:37 -0000    1.62
--- doc/src/sgml/wal.sgml    27 Feb 2010 01:37:03 -0000
***************
*** 59,66 ****
     same concerns about data loss exist for write-back drive caches as
     exist for disk controller caches.  Consumer-grade IDE and SATA drives are
     particularly likely to have write-back caches that will not survive a
!    power failure.  Many solid-state drives also have volatile write-back
!    caches.  To check write caching on <productname>Linux</> use
     <command>hdparm -I</>;  it is enabled if there is a <literal>*</> next
     to <literal>Write cache</>; <command>hdparm -W</> to turn off
     write caching.  On <productname>FreeBSD</> use
--- 59,69 ----
     same concerns about data loss exist for write-back drive caches as
     exist for disk controller caches.  Consumer-grade IDE and SATA drives are
     particularly likely to have write-back caches that will not survive a
!    power failure, though <acronym>ATAPI-6</> introduced a drive cache
!    flush command that some file systems use, e.g. <acronym>ZFS</>.
!    Many solid-state drives also have volatile write-back
!    caches, and many do not honor cache flush commands by default.
!    To check write caching on <productname>Linux</> use
     <command>hdparm -I</>;  it is enabled if there is a <literal>*</> next
     to <literal>Write cache</>; <command>hdparm -W</> to turn off
     write caching.  On <productname>FreeBSD</> use

From:
Greg Smith
Date:

Bruce Momjian wrote:
> I have added documentation about the ATAPI drive flush command, and the
> typical SSD behavior.
>

If one of us goes back into that section one day to edit again it might
be worth mentioning that FLUSH CACHE EXT is the actual ATAPI-6 command
that a drive needs to support properly.  I wouldn't bother with another
doc edit commit just for that specific part though, pretty obscure.

I find it kind of funny how many discussions run in parallel about even
really detailed technical implementation details around the world.  For
example, doesn't
http://www.mail-archive.com//msg30585.html
look exactly like the exchange between myself and Arjen the other day,
referencing the same AnandTech page?

Could be worse; one of us could be the poor sap at
http://opensolaris.org/jive/thread.jspa;jsessionid=41B679C30D136C059E1BB7C06CA7DCE0?messageID=397730
who installed Windows XP, VirtualBox for Windows, an OpenSolaris VM
inside of it, and then was shocked that cache flushes didn't make their
way all the way through that chain and had his 10TB ZFS pool corrupted
as a result.  Hurray for virtualization!

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
   www.2ndQuadrant.us


From:
Bruce Momjian
Date:

Greg Smith wrote:
> Bruce Momjian wrote:
> > I have added documentation about the ATAPI drive flush command, and the
> > typical SSD behavior.
> >
>
> If one of us goes back into that section one day to edit again it might
> be worth mentioning that FLUSH CACHE EXT is the actual ATAPI-6 command
> that a drive needs to support properly.  I wouldn't bother with another
> doc edit commit just for that specific part though, pretty obscure.

That setting name was not easy to find so I added it to the
documentation.

--
  Bruce Momjian  <>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do

From:
Ron Mayer
Date:

Bruce Momjian wrote:
> Greg Smith wrote:
>> Bruce Momjian wrote:
>>> I have added documentation about the ATAPI drive flush command, and the
>>
>> If one of us goes back into that section one day to edit again it might
>> be worth mentioning that FLUSH CACHE EXT is the actual ATAPI-6 command
>> that a drive needs to support properly.  I wouldn't bother with another
>> doc edit commit just for that specific part though, pretty obscure.
>
> That setting name was not easy to find so I added it to the
> documentation.

If we're spelling out specific IDE commands, it might be worth
noting that the corresponding SCSI command is "SYNCHRONIZE CACHE"[1].


Linux apparently sends FLUSH_CACHE commands to IDE drives in the
exact sample places it sends SYNCHRONIZE CACHE commands to SCSI
drives[2].

It seems that the same file systems, SW raid layers,
virtualization platforms, and kernels that have a problem
sending FLUSH CACHE commands to SATA drives have he same exact
same problems sending SYNCHRONIZE CACHE commands to SCSI drives.
With the exact same effect of not getting writes all the way
through disk caches.

No?


[1] http://linux.die.net/man/8/sg_sync
[2] http://hardware.slashdot.org/comments.pl?sid=149349&cid=12519114

From:
Greg Smith
Date:

Ron Mayer wrote:
> Linux apparently sends FLUSH_CACHE commands to IDE drives in the
> exact sample places it sends SYNCHRONIZE CACHE commands to SCSI
> drives[2].
>   [2] http://hardware.slashdot.org/comments.pl?sid=149349&cid=12519114
>

Well, that's old enough to not even be completely right anymore about
SATA disks and kernels.  It's FLUSH_CACHE_EXT that's been added to ATA-6
to do the right thing on modern drives and that gets used nowadays, and
that doesn't necessarily do so on most of the SSDs out there; all of
which Bruce's recent doc additions now talk about correctly.

There's this one specific area we know about that the most popular
systems tend to get really wrong all the time; that's got the
appropriate warning now with the right magic keywords that people can
look into it more if motivated.  While it would be nice to get super
thorough and document everything, I think there's already more docs in
there than this project would prefer to have to maintain in this area.

Are we going to get into IDE, SATA, SCSI, SAS, FC, and iSCSI?  If the
idea is to be complete that's where this would go.  I don't know that
the documentation needs to address every possible way every possible
filesystem can be flushed.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
   www.2ndQuadrant.us


From:
Bruce Momjian
Date:

Ron Mayer wrote:
> Bruce Momjian wrote:
> > Greg Smith wrote:
> >> Bruce Momjian wrote:
> >>> I have added documentation about the ATAPI drive flush command, and the
> >>
> >> If one of us goes back into that section one day to edit again it might
> >> be worth mentioning that FLUSH CACHE EXT is the actual ATAPI-6 command
> >> that a drive needs to support properly.  I wouldn't bother with another
> >> doc edit commit just for that specific part though, pretty obscure.
> >
> > That setting name was not easy to find so I added it to the
> > documentation.
>
> If we're spelling out specific IDE commands, it might be worth
> noting that the corresponding SCSI command is "SYNCHRONIZE CACHE"[1].
>
>
> Linux apparently sends FLUSH_CACHE commands to IDE drives in the
> exact sample places it sends SYNCHRONIZE CACHE commands to SCSI
> drives[2].
>
> It seems that the same file systems, SW raid layers,
> virtualization platforms, and kernels that have a problem
> sending FLUSH CACHE commands to SATA drives have he same exact
> same problems sending SYNCHRONIZE CACHE commands to SCSI drives.
> With the exact same effect of not getting writes all the way
> through disk caches.

I always assumed SCSI disks had a write-through cache and therefore
didn't need a drive cache flush comment.

--
  Bruce Momjian  <>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do

From:
Bruce Momjian
Date:

Greg Smith wrote:
> Ron Mayer wrote:
> > Linux apparently sends FLUSH_CACHE commands to IDE drives in the
> > exact sample places it sends SYNCHRONIZE CACHE commands to SCSI
> > drives[2].
> >   [2] http://hardware.slashdot.org/comments.pl?sid=149349&cid=12519114
> >
>
> Well, that's old enough to not even be completely right anymore about
> SATA disks and kernels.  It's FLUSH_CACHE_EXT that's been added to ATA-6
> to do the right thing on modern drives and that gets used nowadays, and
> that doesn't necessarily do so on most of the SSDs out there; all of
> which Bruce's recent doc additions now talk about correctly.
>
> There's this one specific area we know about that the most popular
> systems tend to get really wrong all the time; that's got the
> appropriate warning now with the right magic keywords that people can
> look into it more if motivated.  While it would be nice to get super
> thorough and document everything, I think there's already more docs in
> there than this project would prefer to have to maintain in this area.
>
> Are we going to get into IDE, SATA, SCSI, SAS, FC, and iSCSI?  If the
> idea is to be complete that's where this would go.  I don't know that
> the documentation needs to address every possible way every possible
> filesystem can be flushed.

The bottom line is that the reason we have so much detailed
documentation about this is that mostly only database folks care about
such issues, so we end up having to research and document this
ourselves --- I don't see any alternatives.

--
  Bruce Momjian  <>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do

From:
Greg Smith
Date:

Bruce Momjian wrote:
> I always assumed SCSI disks had a write-through cache and therefore
> didn't need a drive cache flush comment.
>

There's more detail on all this mess at
http://wiki.postgresql.org/wiki/SCSI_vs._IDE/SATA_Disks and it includes
this perception, which I've recently come to believe isn't actually
correct anymore.  Like the IDE crowd, it looks like one day somebody
said "hey, we lose every write heavy benchmark badly because we only
have a write-through cache", and that principle got lost along the
wayside.  What has been true, and I'm staring to think this is what
we've all been observing rather than a write-through cache, is that the
proper cache flushing commands have been there in working form for so
much longer that it's more likely your SCSI driver and drive do the
right thing if the filesystem asks them to.  SCSI SYNCHRONIZE CACHE has
a much longer and prouder history than IDE's FLUSH_CACHE and SATA's
FLUSH_CACHE_EXT.

It's also worth noting that many current SAS drives, the current SCSI
incarnation, are basically SATA drives with a bridge chipset stuck onto
them, or with just the interface board swapped out.  This one reason why
top-end SAS capacities lag behind consumer SATA drives.  They use the
consumers as beta testers to get the really fundamental firmware issues
sorted out, and once things are stable they start stamping out the
version with the SAS interface instead.  (Note that there's a parallel
manufacturing approach that makes much smaller SAS drives, the 2.5"
server models or those at higher RPMs, that doesn't go through this
path.  Those are also the really expensive models, due to economy of
scale issues).  The idea that these would have fundamentally different
write cache behavior doesn't really follow from that development model.

At this point, there are only two common differences between "consumer"
and "enterprise" hard drives of the same size and RPM when there are
directly matching ones:

1) You might get SAS instead of SATA as the interface, which provides
the more mature command set I was talking about above--and therefore may
give you a sane write-back cache with proper flushing, which is all the
database really expects.

2) The timeouts when there's a read/write problem are tuned down in the
enterprise version, to be more compatible with RAID setups where you
want to push the drive off-line when this happens rather than presuming
you can fix it.  Consumers would prefer that the drive spent a lot of
time doing heroics to try and save their sole copy of the apparently
missing data.

You might get a slightly higher grade of parts if you're lucky too; I
wouldn't count on it though.  That seems to be saved for the high RPM or
smaller size drives only.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
   www.2ndQuadrant.us


From:
"Pierre C"
Date:

> I always assumed SCSI disks had a write-through cache and therefore
> didn't need a drive cache flush comment.

Maximum performance can only be reached with a writeback cache so the
drive can reorder and cluster writes, according to the realtime position
of the heads and platter rotation.

The problem is not the write cache itself, it is that, for your data to be
safe, the "flush cache" or "barrier" command must get all the way through
the application / filesystem to the hardware, going through a nondescript
number of software/firmware/hardware layers, all of which may :

- not specify if they honor or ignore flush/barrier commands, and which
ones
- not specify if they will reordre writes ignoring barriers/flushes or not
- have been written by people who are not aware of such issues
- have been written by companies who are perfectly aware of such issues
but chose to ignore them to look good in benchmarks
- have some incompatibilities that result in broken behaviour
- have bugs

As far as I'm concerned, a configuration that doesn't properly respect the
commands needed for data integrity is broken.

The sad truth is that given a software/hardware IO stack, there's no way
to be sure, and testing isn't easy, if at all possible to do. Some cache
flushes might be ignored under some circumstances.

For this to change, you don't need a hardware change, but a mentality
change.

Flash filesystem developers use flash simulators which measure wear
leveling, etc.

We'd need a virtual box with a simulated virtual harddrive which is able
to check this.

What a mess.


From:
Ron Mayer
Date:

Greg Smith wrote:
> Bruce Momjian wrote:
>> I always assumed SCSI disks had a write-through cache and therefore
>> didn't need a drive cache flush comment.

Some do.  SCSI disks have write-back caches.

Some have both(!) - a write-back cache but the user can explicitly
send write-through requests.

Microsoft explains it well (IMHO) here:
http://msdn.microsoft.com/en-us/library/aa508863.aspx
  "For example, suppose that the target is a SCSI device with
   a write-back cache. If the device supports write-through
   requests, the initiator can bypass the write cache by
   setting the force unit access (FUA) bit in the command
   descriptor block (CDB) of the write command."

> this perception, which I've recently come to believe isn't actually
> correct anymore.  ... I'm staring to think this is what
> we've all been observing rather than a write-through cache

I think what we've been observing is that guys with SCSI drives
are more likely to either
 (a) have battery-backed RAID controllers that insure writes succeed,
or
 (b) have other decent RAID controllers that understand details
     like that FUA bit to send write-through requests even if
     a SCSI devices has a write-back cache.

In contrast, most guys with PATA drives are probably running
software RAID (if any) with a RAID stack (older LVM and MD)
known to lose the cache flushing commands.