Thread: Reliability with RAID 10 SSD and Streaming Replication

Reliability with RAID 10 SSD and Streaming Replication

From
Cuong Hoang
Date:
Hi all,

Our application is heavy write and IO utilisation has been the problem for us for a while. We've decided to use RAID 10 of 4x500GB Samsung 840 Pro for the master server. I'm aware of write cache issue on SSDs in case of power loss. However, our hosting provider doesn't offer any other choices of SSD drives with supercapacitor. To minimise risk, we will also set up another RAID 10 SAS in streaming replication mode. For our application, a few seconds of data loss is acceptable. 

My question is, would corrupted data files on the primary server affect the streaming standby? In other word, is this setup acceptable in terms of minimising deficiency of SSDs?

Cheers,
Cuong

Re: Reliability with RAID 10 SSD and Streaming Replication

From
Merlin Moncure
Date:
On Thu, May 16, 2013 at 9:46 AM, Cuong Hoang <climbingrose@gmail.com> wrote:
> Hi all,
>
> Our application is heavy write and IO utilisation has been the problem for
> us for a while. We've decided to use RAID 10 of 4x500GB Samsung 840 Pro for
> the master server. I'm aware of write cache issue on SSDs in case of power
> loss. However, our hosting provider doesn't offer any other choices of SSD
> drives with supercapacitor. To minimise risk, we will also set up another
> RAID 10 SAS in streaming replication mode. For our application, a few
> seconds of data loss is acceptable.
>
> My question is, would corrupted data files on the primary server affect the
> streaming standby? In other word, is this setup acceptable in terms of
> minimising deficiency of SSDs?

Data corruption caused by sudden power event on the master will not
cross over.  Basically with this configuration you must switch over to
the standby in that case.  Corruption caused by other issues, say a
faulty drive, will transfer over however.  Block checksum feature of
9.3 as a strategy to reduce the risk of that class of issue.

merlin


Re: Reliability with RAID 10 SSD and Streaming Replication

From
Jeff Janes
Date:
On Thu, May 16, 2013 at 7:46 AM, Cuong Hoang <climbingrose@gmail.com> wrote:
Hi all,

Our application is heavy write and IO utilisation has been the problem for us for a while. We've decided to use RAID 10 of 4x500GB Samsung 840 Pro for the master server. I'm aware of write cache issue on SSDs in case of power loss. However, our hosting provider doesn't offer any other choices of SSD drives with supercapacitor. To minimise risk, we will also set up another RAID 10 SAS in streaming replication mode. For our application, a few seconds of data loss is acceptable. 

My question is, would corrupted data files on the primary server affect the streaming standby? In other word, is this setup acceptable in terms of minimising deficiency of SSDs?


That seems rather scary to me for two reasons.  

If the data center has a sudden power failure, why would it not take out both machines either simultaneously or in short succession?  Can you verify that the hosting provider does not have them on the same UPS (or even worse, as two virtual machines on the same physical host)?

The other issue is that you'd have to make sure the master does not restart after a crash.  If your init.d scripts just blindly start postgresql, then after a sudden OS restart it will automatically enter recovery and then open as usual, even though it might be silently corrupt.  At that point it will be generating WAL based on corrupt data (and incorrect query results), and propagating that to the standby.   So you have to be paranoid that if the master ever crashes, it is shot in the head and then reconstructed from the standby.

Cheers,

Jeff

Re: Reliability with RAID 10 SSD and Streaming Replication

From
Merlin Moncure
Date:
On Thu, May 16, 2013 at 1:34 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Thu, May 16, 2013 at 7:46 AM, Cuong Hoang <climbingrose@gmail.com> wrote:
>>
>> Hi all,
>>
>> Our application is heavy write and IO utilisation has been the problem for
>> us for a while. We've decided to use RAID 10 of 4x500GB Samsung 840 Pro for
>> the master server. I'm aware of write cache issue on SSDs in case of power
>> loss. However, our hosting provider doesn't offer any other choices of SSD
>> drives with supercapacitor. To minimise risk, we will also set up another
>> RAID 10 SAS in streaming replication mode. For our application, a few
>> seconds of data loss is acceptable.
>>
>> My question is, would corrupted data files on the primary server affect
>> the streaming standby? In other word, is this setup acceptable in terms of
>> minimising deficiency of SSDs?
>
>
>
> That seems rather scary to me for two reasons.
>
> If the data center has a sudden power failure, why would it not take out
> both machines either simultaneously or in short succession?  Can you verify
> that the hosting provider does not have them on the same UPS (or even worse,
> as two virtual machines on the same physical host)?

I took it to mean that his standby's "raid 10 SAS" meant disk drive
based standby.  Agree that server should not be configured to
autostart through init.d.

merlin


Re: Reliability with RAID 10 SSD and Streaming Replication

From
Jeff Janes
Date:
On Thu, May 16, 2013 at 11:46 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Thu, May 16, 2013 at 1:34 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Thu, May 16, 2013 at 7:46 AM, Cuong Hoang <climbingrose@gmail.com> wrote:
>>
>> Hi all,
>>
>> Our application is heavy write and IO utilisation has been the problem for
>> us for a while. We've decided to use RAID 10 of 4x500GB Samsung 840 Pro for
>> the master server. I'm aware of write cache issue on SSDs in case of power
>> loss. However, our hosting provider doesn't offer any other choices of SSD
>> drives with supercapacitor. To minimise risk, we will also set up another
>> RAID 10 SAS in streaming replication mode. For our application, a few
>> seconds of data loss is acceptable.
>>
>> My question is, would corrupted data files on the primary server affect
>> the streaming standby? In other word, is this setup acceptable in terms of
>> minimising deficiency of SSDs?
>
>
>
> That seems rather scary to me for two reasons.
>
> If the data center has a sudden power failure, why would it not take out
> both machines either simultaneously or in short succession?  Can you verify
> that the hosting provider does not have them on the same UPS (or even worse,
> as two virtual machines on the same physical host)?

I took it to mean that his standby's "raid 10 SAS" meant disk drive
based standby. 
 
I had not considered that.   If the master can't keep up with IO using disk drives, wouldn't a replica using them probably fall infinitely far behind trying to keep up with the workload?

Maybe the best choice would just be stick with the current set-up (one server, spinning rust) and just turn off synchrounous_commit, since he is already willing to take the loss of a few seconds of transactions.  

Cheers,

Jeff

Re: Reliability with RAID 10 SSD and Streaming Replication

From
Cuong Hoang
Date:
Thank you for your advice guys. We'll definitely turn off init.d script for PostgreSQL on the master. The standby host will be disk-based so it will be less vulnerable to power loss.

I forgot to mention that we'll set up Wal-e to ship base backups and WALs to Amazon S3 continuous as another safety measure. Again, the lost of a few WALs would not be a big issue for us. 

Do you think that this setup will be acceptable for our purposes?

Thanks,
Cuong


On Fri, May 17, 2013 at 8:39 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Thu, May 16, 2013 at 11:46 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Thu, May 16, 2013 at 1:34 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Thu, May 16, 2013 at 7:46 AM, Cuong Hoang <climbingrose@gmail.com> wrote:
>>
>> Hi all,
>>
>> Our application is heavy write and IO utilisation has been the problem for
>> us for a while. We've decided to use RAID 10 of 4x500GB Samsung 840 Pro for
>> the master server. I'm aware of write cache issue on SSDs in case of power
>> loss. However, our hosting provider doesn't offer any other choices of SSD
>> drives with supercapacitor. To minimise risk, we will also set up another
>> RAID 10 SAS in streaming replication mode. For our application, a few
>> seconds of data loss is acceptable.
>>
>> My question is, would corrupted data files on the primary server affect
>> the streaming standby? In other word, is this setup acceptable in terms of
>> minimising deficiency of SSDs?
>
>
>
> That seems rather scary to me for two reasons.
>
> If the data center has a sudden power failure, why would it not take out
> both machines either simultaneously or in short succession?  Can you verify
> that the hosting provider does not have them on the same UPS (or even worse,
> as two virtual machines on the same physical host)?

I took it to mean that his standby's "raid 10 SAS" meant disk drive
based standby. 
 
I had not considered that.   If the master can't keep up with IO using disk drives, wouldn't a replica using them probably fall infinitely far behind trying to keep up with the workload?

Maybe the best choice would just be stick with the current set-up (one server, spinning rust) and just turn off synchrounous_commit, since he is already willing to take the loss of a few seconds of transactions.  

Cheers,

Jeff

Re: Reliability with RAID 10 SSD and Streaming Replication

From
Tomas Vondra
Date:
Hi,

On 16.5.2013 16:46, Cuong Hoang wrote:
> Hi all,
>
> Our application is heavy write and IO utilisation has been the problem
> for us for a while. We've decided to use RAID 10 of 4x500GB Samsung 840

What does "heavy write" mean in your case? Does that mean a lot of small
transactions or few large ones?

What have you done to tune the server?

> Pro for the master server. I'm aware of write cache issue on SSDs in
> case of power loss. However, our hosting provider doesn't offer any
> other choices of SSD drives with supercapacitor. To minimise risk, we
> will also set up another RAID 10 SAS in streaming replication mode. For
> our application, a few seconds of data loss is acceptable.

Streaming replication allows zero data loss if used in synchronous mode.

> My question is, would corrupted data files on the primary server affect
> the streaming standby? In other word, is this setup acceptable in terms
> of minimising deficiency of SSDs?

It should be.

Have you considered using a UPS? That would make the SSDs about as
reliable as SATA/SAS drives - the UPS may fail, but so may a BBU unit on
the SAS controller.

Tomas


Re: Reliability with RAID 10 SSD and Streaming Replication

From
Cuong Hoang
Date:
Hi Tomas,

We have a lot of small updates and some inserts. The database size is at 35GB including indexes and TOAST. We think it will keep growing to about 200GB. We usually have a burst of about 500k writes in about 5-10 minutes which basically cripples IO on the current servers. I've tried to increase the checkpoint_segments, checkpoint_timeout etc. as recommended in "PostgreSQL 9.0 Performance" book. However, it seems like our server just couldn't handle the current load.

Here is the server specs:

Dual E5620, 32GB RAM, 4x1TB SAS 15k in RAID10

Here are some core PostgreSQL configs:

shared_buffers = 2GB                    # min 128kB
work_mem = 64MB                         # min 64kB
maintenance_work_mem = 1GB              # min 1MB
wal_buffers = 16MB
checkpoint_segments = 128
checkpoint_timeout = 30min
checkpoint_completion_target = 0.7


Thanks,
Cuong


On Fri, May 17, 2013 at 10:06 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
Hi,

On 16.5.2013 16:46, Cuong Hoang wrote:
> Hi all,
>
> Our application is heavy write and IO utilisation has been the problem
> for us for a while. We've decided to use RAID 10 of 4x500GB Samsung 840

What does "heavy write" mean in your case? Does that mean a lot of small
transactions or few large ones?

What have you done to tune the server?

> Pro for the master server. I'm aware of write cache issue on SSDs in
> case of power loss. However, our hosting provider doesn't offer any
> other choices of SSD drives with supercapacitor. To minimise risk, we
> will also set up another RAID 10 SAS in streaming replication mode. For
> our application, a few seconds of data loss is acceptable.

Streaming replication allows zero data loss if used in synchronous mode.

> My question is, would corrupted data files on the primary server affect
> the streaming standby? In other word, is this setup acceptable in terms
> of minimising deficiency of SSDs?

It should be.

Have you considered using a UPS? That would make the SSDs about as
reliable as SATA/SAS drives - the UPS may fail, but so may a BBU unit on
the SAS controller.

Tomas


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Reliability with RAID 10 SSD and Streaming Replication

From
Mark Kirkwood
Date:
On 17/05/13 12:06, Tomas Vondra wrote:
> Hi,
>
> On 16.5.2013 16:46, Cuong Hoang wrote:

>> Pro for the master server. I'm aware of write cache issue on SSDs in
>> case of power loss. However, our hosting provider doesn't offer any
>> other choices of SSD drives with supercapacitor. To minimise risk, we
>> will also set up another RAID 10 SAS in streaming replication mode. For
>> our application, a few seconds of data loss is acceptable.
>
> Streaming replication allows zero data loss if used in synchronous mode.
>

I'm not sure synchronous replication is really an option here as it will
slow the master down to spinning disk io speeds, unless the standby is
configured with SSDs as well - which probably defeats the purpose of
this setup.

On the other hand, if the system is so loaded that a pure SAS (spinning
drive) solution can't keen up, then the standby lag may get to be way
more than a few seconds...which means look out for huge data loss.

I'd be inclined to apply more leverage to hosting provider to source
SSDs suitable for your needs, or change hosting providers.

Regards

Mark


Re: Reliability with RAID 10 SSD and Streaming Replication

From
David Rees
Date:
On Thu, May 16, 2013 at 7:46 AM, Cuong Hoang <climbingrose@gmail.com> wrote:
> For our application, a few seconds of data loss is acceptable.

If a few seconds of data loss is acceptable, I would seriously look at
the synchronous_commit setting and think about turning that off rather
than risk silent corruption with non-enterprise SSDs.

http://www.postgresql.org/docs/9.2/interactive/runtime-config-wal.html#GUC-SYNCHRONOUS-COMMIT

"Unlike fsync, setting this parameter to off does not create any risk
of database inconsistency: an operating system or database crash might
result in some recent allegedly-committed transactions being lost, but
the database state will be just the same as if those transactions had
been aborted cleanly. So, turning synchronous_commit off can be a
useful alternative when performance is more important than exact
certainty about the durability of a transaction."

With a default wal_writer_delay setting of 200ms, you will only be at
risk of losing at most 600ms of transactions in the event of an
unexpected crash or power loss, but write performance should go up a
huge amount, especially if they are a lot of small writes as you
describe.

-Dave


Re: Reliability with RAID 10 SSD and Streaming Replication

From
Merlin Moncure
Date:
On Fri, May 17, 2013 at 1:34 AM, David Rees <drees76@gmail.com> wrote:
> On Thu, May 16, 2013 at 7:46 AM, Cuong Hoang <climbingrose@gmail.com> wrote:
>> For our application, a few seconds of data loss is acceptable.
>
> If a few seconds of data loss is acceptable, I would seriously look at
> the synchronous_commit setting and think about turning that off rather
> than risk silent corruption with non-enterprise SSDs.

That is not going to help.  Since the drives lie about fsync, upon a
power event you must assume the database is corrupt.  I think his
proposed configuration is the best bet (although I would strongly
consider putting SSD on the standby as well).   Personally, I think
non SSD drives are obsolete for database purposes and will not
recommend them for any configuration.  Ideally though, OP would be
using S3700 and we wouldn't be having this conversation.

merlin


Re: Reliability with RAID 10 SSD and Streaming Replication

From
Merlin Moncure
Date:
On Fri, May 17, 2013 at 8:17 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Fri, May 17, 2013 at 1:34 AM, David Rees <drees76@gmail.com> wrote:
>> On Thu, May 16, 2013 at 7:46 AM, Cuong Hoang <climbingrose@gmail.com> wrote:
>>> For our application, a few seconds of data loss is acceptable.
>>
>> If a few seconds of data loss is acceptable, I would seriously look at
>> the synchronous_commit setting and think about turning that off rather
>> than risk silent corruption with non-enterprise SSDs.
>
> That is not going to help.


whoops -- misread your post heh (you were suggesting to use classic
hard drives).  yeah, that might work but it only buys you so much
particuarly if there is a lot of random activity in the heap.

merlin


Re: Reliability with RAID 10 SSD and Streaming Replication

From
Tomas Vondra
Date:
On 17.5.2013 03:34, Mark Kirkwood wrote:
> On 17/05/13 12:06, Tomas Vondra wrote:
>> Hi,
>>
>> On 16.5.2013 16:46, Cuong Hoang wrote:
>
>>> Pro for the master server. I'm aware of write cache issue on SSDs in
>>> case of power loss. However, our hosting provider doesn't offer any
>>> other choices of SSD drives with supercapacitor. To minimise risk, we
>>> will also set up another RAID 10 SAS in streaming replication mode. For
>>> our application, a few seconds of data loss is acceptable.
>>
>> Streaming replication allows zero data loss if used in synchronous mode.
>>
>
> I'm not sure synchronous replication is really an option here as it will
> slow the master down to spinning disk io speeds, unless the standby is
> configured with SSDs as well - which probably defeats the purpose of
> this setup.

The master waits for reception of the data, not writing them to the
disks. It will have to write them eventually (and that might cause
issues), but I'm not really sure it's that simple.

> On the other hand, if the system is so loaded that a pure SAS (spinning
> drive) solution can't keen up, then the standby lag may get to be way
> more than a few seconds...which means look out for huge data loss.

Don't forget the slave does not perform all the I/O (searching for the
row etc.). It's difficult to say how much this will save, though.

Tomas


Re: Reliability with RAID 10 SSD and Streaming Replication

From
Tomas Vondra
Date:
Do you really need a running standby for fast failover? What about doing
plain WAL archiging? I'd definitely consider that, because even if you
setup a SAS-based replica, you can't use it for production as it does no
handle the load.

I think you could setup WAL archiving and in case of crash just use the
base backup and replay the WAL from the archive.

This means the SAS-based system is purely for WAL archiving, i.e.
performs only sequential writes which should not be a big deal.

The recovery will be performed on the SSD system, which should handle it
fine. If you need faster recovery, you may perform it incrementally on
the SAS system (it will take some time, but it won't influence the
master). You might do that daily or something like that.

The only problem with this is that this is file based, and could mean
lag (up to 16MB or archive_timeout). But this should not be problem if
you place the WAL on SAS drives with controller. If you use RAID, you
should be perfectly fine.

So this is what I'd suggest:

  1) use SSD for data files, SAS RAID1 for WAL on the master
  2) setup WAL archiving (base backup + archive on SAS system)
  3) update the base backup daily (incremental recovery)
  4) in case of crash, keep WAL from the archive and pg_xlog on the
     SAS RAID (on master)


Tomas


On 17.5.2013 02:21, Cuong Hoang wrote:
> Hi Tomas,
>
> We have a lot of small updates and some inserts. The database size is at
> 35GB including indexes and TOAST. We think it will keep growing to about
> 200GB. We usually have a burst of about 500k writes in about 5-10
> minutes which basically cripples IO on the current servers. I've tried
> to increase the checkpoint_segments, checkpoint_timeout etc. as
> recommended in "PostgreSQL 9.0 Performance" book. However, it seems like
> our server just couldn't handle the current load.
>
> Here is the server specs:
>
> Dual E5620, 32GB RAM, 4x1TB SAS 15k in RAID10
>
> Here are some core PostgreSQL configs:
>
> shared_buffers = 2GB                    # min 128kB
> work_mem = 64MB                         # min 64kB
> maintenance_work_mem = 1GB              # min 1MB
> wal_buffers = 16MB
> checkpoint_segments = 128
> checkpoint_timeout = 30min
> checkpoint_completion_target = 0.7
>
>
> Thanks,
> Cuong
>
>
> On Fri, May 17, 2013 at 10:06 AM, Tomas Vondra <tv@fuzzy.cz
> <mailto:tv@fuzzy.cz>> wrote:
>
>     Hi,
>
>     On 16.5.2013 16:46, Cuong Hoang wrote:
>     > Hi all,
>     >
>     > Our application is heavy write and IO utilisation has been the problem
>     > for us for a while. We've decided to use RAID 10 of 4x500GB
>     Samsung 840
>
>     What does "heavy write" mean in your case? Does that mean a lot of small
>     transactions or few large ones?
>
>     What have you done to tune the server?
>
>     > Pro for the master server. I'm aware of write cache issue on SSDs in
>     > case of power loss. However, our hosting provider doesn't offer any
>     > other choices of SSD drives with supercapacitor. To minimise risk, we
>     > will also set up another RAID 10 SAS in streaming replication
>     mode. For
>     > our application, a few seconds of data loss is acceptable.
>
>     Streaming replication allows zero data loss if used in synchronous mode.
>
>     > My question is, would corrupted data files on the primary server
>     affect
>     > the streaming standby? In other word, is this setup acceptable in
>     terms
>     > of minimising deficiency of SSDs?
>
>     It should be.
>
>     Have you considered using a UPS? That would make the SSDs about as
>     reliable as SATA/SAS drives - the UPS may fail, but so may a BBU unit on
>     the SAS controller.
>
>     Tomas
>
>
>     --
>     Sent via pgsql-performance mailing list
>     (pgsql-performance@postgresql.org
>     <mailto:pgsql-performance@postgresql.org>)
>     To make changes to your subscription:
>     http://www.postgresql.org/mailpref/pgsql-performance
>
>



Re: Reliability with RAID 10 SSD and Streaming Replication

From
Cuong Hoang
Date:
Thanks for suggestion Tomas. We're about to set up WAL backup to Amazon S3. I think this should cover all of our bases. At least for the moment, SAS-based standby seems to keep up with the master because that's its sole purpose. We're not sending queries to the hot standby. We also consider switching the hot standby to fast failover as you suggested. I guess for now we should stick to streaming replication because the slave is still keeping up with the master.

Btw, after switching to SSD, performance improves vastly. IO utilisation drops from 100% to 6% in peak periods. That's an order of magnitude faster!

Cheers,
Cuong


On Mon, May 20, 2013 at 8:34 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
Do you really need a running standby for fast failover? What about doing
plain WAL archiging? I'd definitely consider that, because even if you
setup a SAS-based replica, you can't use it for production as it does no
handle the load.

I think you could setup WAL archiving and in case of crash just use the
base backup and replay the WAL from the archive.

This means the SAS-based system is purely for WAL archiving, i.e.
performs only sequential writes which should not be a big deal.

The recovery will be performed on the SSD system, which should handle it
fine. If you need faster recovery, you may perform it incrementally on
the SAS system (it will take some time, but it won't influence the
master). You might do that daily or something like that.

The only problem with this is that this is file based, and could mean
lag (up to 16MB or archive_timeout). But this should not be problem if
you place the WAL on SAS drives with controller. If you use RAID, you
should be perfectly fine.

So this is what I'd suggest:

  1) use SSD for data files, SAS RAID1 for WAL on the master
  2) setup WAL archiving (base backup + archive on SAS system)
  3) update the base backup daily (incremental recovery)
  4) in case of crash, keep WAL from the archive and pg_xlog on the
     SAS RAID (on master)


Tomas


On 17.5.2013 02:21, Cuong Hoang wrote:
> Hi Tomas,
>
> We have a lot of small updates and some inserts. The database size is at
> 35GB including indexes and TOAST. We think it will keep growing to about
> 200GB. We usually have a burst of about 500k writes in about 5-10
> minutes which basically cripples IO on the current servers. I've tried
> to increase the checkpoint_segments, checkpoint_timeout etc. as
> recommended in "PostgreSQL 9.0 Performance" book. However, it seems like
> our server just couldn't handle the current load.
>
> Here is the server specs:
>
> Dual E5620, 32GB RAM, 4x1TB SAS 15k in RAID10
>
> Here are some core PostgreSQL configs:
>
> shared_buffers = 2GB                    # min 128kB
> work_mem = 64MB                         # min 64kB
> maintenance_work_mem = 1GB              # min 1MB
> wal_buffers = 16MB
> checkpoint_segments = 128
> checkpoint_timeout = 30min
> checkpoint_completion_target = 0.7
>
>
> Thanks,
> Cuong
>
>
> On Fri, May 17, 2013 at 10:06 AM, Tomas Vondra <tv@fuzzy.cz
> <mailto:tv@fuzzy.cz>> wrote:
>
>     Hi,
>
>     On 16.5.2013 16:46, Cuong Hoang wrote:
>     > Hi all,
>     >
>     > Our application is heavy write and IO utilisation has been the problem
>     > for us for a while. We've decided to use RAID 10 of 4x500GB
>     Samsung 840
>
>     What does "heavy write" mean in your case? Does that mean a lot of small
>     transactions or few large ones?
>
>     What have you done to tune the server?
>
>     > Pro for the master server. I'm aware of write cache issue on SSDs in
>     > case of power loss. However, our hosting provider doesn't offer any
>     > other choices of SSD drives with supercapacitor. To minimise risk, we
>     > will also set up another RAID 10 SAS in streaming replication
>     mode. For
>     > our application, a few seconds of data loss is acceptable.
>
>     Streaming replication allows zero data loss if used in synchronous mode.
>
>     > My question is, would corrupted data files on the primary server
>     affect
>     > the streaming standby? In other word, is this setup acceptable in
>     terms
>     > of minimising deficiency of SSDs?
>
>     It should be.
>
>     Have you considered using a UPS? That would make the SSDs about as
>     reliable as SATA/SAS drives - the UPS may fail, but so may a BBU unit on
>     the SAS controller.
>
>     Tomas
>
>
>     --
>     Sent via pgsql-performance mailing list
>     (pgsql-performance@postgresql.org
>     <mailto:pgsql-performance@postgresql.org>)
>     To make changes to your subscription:
>     http://www.postgresql.org/mailpref/pgsql-performance
>
>



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Reliability with RAID 10 SSD and Streaming Replication

From
Greg Smith
Date:
On 5/16/13 8:06 PM, Tomas Vondra wrote:
> Have you considered using a UPS? That would make the SSDs about as
> reliable as SATA/SAS drives - the UPS may fail, but so may a BBU unit on
> the SAS controller.

That's not true at all.  Any decent RAID controller will have an option
to stop write-back caching when the battery is bad.  Things will slow
badly when that happens, but there is zero data risk from a short-term
BBU failure.  The only serious risk with a good BBU setup are that
you'll have a power failure lasting so long that the battery runs down
before the cache can be flushed to disk.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


Re: Reliability with RAID 10 SSD and Streaming Replication

From
Greg Smith
Date:
On 5/16/13 7:52 PM, Cuong Hoang wrote:
> The standby host will be disk-based so it
> will be less vulnerable to power loss.

If it can keep up with replay from the faster master, that sounds like a
decent backup.  Make sure you setup all write caches very carefully on
that system, because it's going to be your best hope to come back up
quickly after a real crash.

Any vendor that pushes Samsung 840 drives for database use should be
ashamed of themselves.  Those drives are turning into the new
incarnation of what we saw with the Intel X25-E/X-25-M:  they're very
popular, but any system built with them will corrupt itself on the first
failure.   I expect to see a new spike in people needing data recovery
help after losing their Samsung 840 based servers start soon.

> I forgot to mention that we'll set up Wal-e
> <https://github.com/wal-e/wal-e> to ship base backups and WALs to Amazon
> S3 continuous as another safety measure. Again, the lost of a few WALs
> would not be a big issue for us.

That's a useful plan.  Just make sure you ship new base backups fairly
often.  If you have to fall back to that copy of the data, you'll need
to replay anything that's happened since the last base backup happened.
  That can easily result in a week of downtime if you're only shipping
backups once per month, for example.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


Re: Reliability with RAID 10 SSD and Streaming Replication

From
Tomas Vondra
Date:
On 20.5.2013 05:00, Greg Smith wrote:
> On 5/16/13 8:06 PM, Tomas Vondra wrote:
>> Have you considered using a UPS? That would make the SSDs about as
>> reliable as SATA/SAS drives - the UPS may fail, but so may a BBU unit on
>> the SAS controller.
>
> That's not true at all.  Any decent RAID controller will have an option
> to stop write-back caching when the battery is bad.  Things will slow
> badly when that happens, but there is zero data risk from a short-term
> BBU failure.  The only serious risk with a good BBU setup are that
> you'll have a power failure lasting so long that the battery runs down
> before the cache can be flushed to disk.

That's true, no doubt about that. What I was trying to say is that a
controller with BBU (or a SSD with proper write cache protection) is
about as safe as an UPS when it comes to power outages. Assuming both
are properly configured / watched / checked.

Sure, there are scenarios where UPS is not going to help (e.g. a PSU
failure) so a controller with BBU is better from this point of view.
I've seen crashes with both options (BBU / UPS), both because of
misconfiguration and hw issues. BTW I don't know what controller are we
talking about here - it might be as crappy as the SSD drives.

What I was thinking about in this case is using two SSD-based systems
with UPSes. That'd allow fast failover (which may not be possible with
the SAS based replica, as it does not handle the load).

But yes, I do agree that the provider should be ashamed for not
providing reliable SSDs in the first place. Getting reliable SSDs should
be the first option - all these suggestions are really just workarounds
of this rather simple issue.

Tomas


Re: Reliability with RAID 10 SSD and Streaming Replication

From
Merlin Moncure
Date:
On Mon, May 20, 2013 at 3:57 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
> But yes, I do agree that the provider should be ashamed for not
> providing reliable SSDs in the first place. Getting reliable SSDs should
> be the first option - all these suggestions are really just workarounds
> of this rather simple issue.

Absolutely.  Reliable SSD should be the first and only option.  They
are significantly more expensive (more than 2x) but are worth it.

When it comes to databases, particularly in the open source postgres
world, hard drives are completely obsolete.  SSD are a couple of
orders of magnitude faster and this (while still slow in computer
terms) is fast enough to put storage into the modern area by anyone
who is smart enough to connect a sata cable.  While everyone likes to
obsess over super scalable architectures technology has finally
advanced to the point where your typical SMB system can be handled by
a sincle device.

merlin


Re: Reliability with RAID 10 SSD and Streaming Replication

From
Greg Smith
Date:
On 5/20/13 6:32 PM, Merlin Moncure wrote:

> When it comes to databases, particularly in the open source postgres
> world, hard drives are completely obsolete.  SSD are a couple of
> orders of magnitude faster and this (while still slow in computer
> terms) is fast enough to put storage into the modern area by anyone
> who is smart enough to connect a sata cable.

You're skirting the edge of vendor Kool-Aid here.  I'm working on a very
detailed benchmark vs. real world piece centered on Intel's 710 models,
one of the few reliable drives on the market.  (Yes, I have a DC S3700
too, just not as much data yet)  While in theory these drives will hit
two orders of magnitude speed improvement, and I have benchmarks where
that's the case, in practice I've seen them deliver less than 5X better
too.  You get one guess which I'd consider more likely to happen on a
difficult database server workload.

The only really huge gain to be had using SSD is commit rate at a low
client count.  There you can easily do 5,000/second instead of a
spinning disk that is closer to 100, for less than what the
battery-backed RAID card along costs to speed up mechanical drives.  My
test server's 100GB DC S3700 was $250.  That's still not two orders of
magnitude faster though.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


Re: Reliability with RAID 10 SSD and Streaming Replication

From
Merlin Moncure
Date:
On Tue, May 21, 2013 at 7:19 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> On 5/20/13 6:32 PM, Merlin Moncure wrote:
>
>> When it comes to databases, particularly in the open source postgres
>> world, hard drives are completely obsolete.  SSD are a couple of
>> orders of magnitude faster and this (while still slow in computer
>> terms) is fast enough to put storage into the modern area by anyone
>> who is smart enough to connect a sata cable.
>
>
> You're skirting the edge of vendor Kool-Aid here.  I'm working on a very
> detailed benchmark vs. real world piece centered on Intel's 710 models, one
> of the few reliable drives on the market.  (Yes, I have a DC S3700 too, just
> not as much data yet)  While in theory these drives will hit two orders of
> magnitude speed improvement, and I have benchmarks where that's the case, in
> practice I've seen them deliver less than 5X better too.  You get one guess
> which I'd consider more likely to happen on a difficult database server
> workload.
>
> The only really huge gain to be had using SSD is commit rate at a low client
> count.  There you can easily do 5,000/second instead of a spinning disk that
> is closer to 100, for less than what the battery-backed RAID card along
> costs to speed up mechanical drives.  My test server's 100GB DC S3700 was
> $250.  That's still not two orders of magnitude faster though.

That's most certainly *not* the only gain to be had: random read rates
of large databases (a very important metric for data analysis) can
easily hit 20k tps.  So I'll stand by the figure. Another point: that
5000k commit raid is sustained, whereas a raid card will spectacularly
degrade until the cache overflows; it's not fair to compare burst with
sustained performance.  To hit 5000k sustained commit rate along with
good random read performance, you'd need a very expensive storage
system.   Right now I'm working (not by choice) with a teir-1 storage
system (let's just say it rhymes with 'weefax') and I would trade it
for direct attached SSD in a heartbeat.

Also, note that 3rd party benchmarking is showing the 3700 completely
smoking the 710 in database workloads (for example, see
http://www.anandtech.com/show/6433/intel-ssd-dc-s3700-200gb-review/6).

Anyways, SSD installation in the post-capactior era has been 100.0%
correlated in my experience (admittedly, around a dozen or so systems)
with removal of storage as the primary performance bottleneck, and
I'll stand by that.  I'm not claiming to work with extremely high
transaction rate systems but then again neither are most of the people
reading this list.  Disk drives are obsolete for database
installations.

merlin


Re: Reliability with RAID 10 SSD and Streaming Replication

From
Greg Smith
Date:
On 5/22/13 9:30 AM, Merlin Moncure wrote:
> That's most certainly *not* the only gain to be had: random read rates
> of large databases (a very important metric for data analysis) can
> easily hit 20k tps.  So I'll stand by the figure.

They can easily hit that number.  Or they can do this:

Device:     r/s    w/s  rMB/s  wMB/s avgrq-sz avgqu-sz  await svctm  %util
sdd     2702.80  19.40  19.67   0.16    14.91   273.68  71.74  0.37 100.00
sdd     2707.60  13.00  19.53   0.10    14.78   276.61  90.34  0.37 100.00

That's an Intel 710 being crushed by a random read database server
workload, unable to deliver even 3000 IOPS / 20MB/s.  I have hours of
data like this from several servers.  Yes, the DC S3700 drives are at
least twice as fast on average, but I haven't had one for long enough to
see what its worst case really looks like yet.

Here's a mechanical drive hitting its limits on the same server as the
above:

Device:     r/s     w/s  rMB/s  wMB/s avgrq-sz avgqu-sz   await  svctm
%util
sdb      100.80  220.60   1.06   1.79    18.16   228.78  724.11   3.11
100.00
sdb      119.20  220.40   1.09   1.77    17.22   228.36  677.46   2.94
100.00

Giving around 3MB/s.  I am quite happy saying the SSD is delivering
about a single order of magnitude improvement, in both throughput and
latency.  But that's it, and a single order of magnitude improvement is
sometimes not good enough to solve all storage issues.

If all you care about is speed, the main situation where I've found
there to still be value in "tier 1 storage" are extremely write-heavy
workloads.  The best write numbers I've seen out of Postgres are still
going into a monster EMC unit, simply because the unit I was working
with had 16GB of durable cache.  Yes, that only supports burst speeds,
but 16GB absorbs a whole lot of writes before it fills.  Write
re-ordering and combining can accelerate traditional disk quite a bit
when it's across a really large horizon like that.

> Anyways, SSD installation in the post-capacitor era has been 100.0%
> correlated in my experience (admittedly, around a dozen or so systems)
> with removal of storage as the primary performance bottleneck, and
> I'll stand by that.

I wish it were that easy for everyone, but that's simply not true.  Are
there lots of systems where SSD makes storage look almost free it's so
fast?  Sure.  But presuming all systems will look like that is
optimistic, and it sets unreasonable expectations.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


Re: Reliability with RAID 10 SSD and Streaming Replication

From
Merlin Moncure
Date:
On Wed, May 22, 2013 at 9:18 AM, Greg Smith <greg@2ndquadrant.com> wrote:
> On 5/22/13 9:30 AM, Merlin Moncure wrote:
>>
>> That's most certainly *not* the only gain to be had: random read rates
>> of large databases (a very important metric for data analysis) can
>> easily hit 20k tps.  So I'll stand by the figure.
>
>
> They can easily hit that number.  Or they can do this:
>
> Device:     r/s    w/s  rMB/s  wMB/s avgrq-sz avgqu-sz  await svctm  %util
> sdd     2702.80  19.40  19.67   0.16    14.91   273.68  71.74  0.37 100.00
> sdd     2707.60  13.00  19.53   0.10    14.78   276.61  90.34  0.37 100.00

yup -- I've seen this too...the high transaction rates quickly fall
over when there is concurrent writing (but for bulk 100% read OLAP
queries I see the higher figure more often than not).   Even so, it's
a huge difference over 100.   unfortunately, I don't have a s3700 to
test with, but based on everything i've seen it looks like it's a
mostly solved problem. (for example, see here:
http://www.storagereview.com/intel_ssd_dc_s3700_series_enterprise_ssd_review).
  Tests that drive the 710 to <3k iops were not able to take the 3700
down under 10k at any queue depth.  Take a good look at the 8k
preconditioning curve latency chart -- everything you need to know is
right there; it's a completely different controller and offers much
better worst case performance.

merlin


Re: Reliability with RAID 10 SSD and Streaming Replication

From
Greg Smith
Date:
On 5/22/13 11:05 AM, Merlin Moncure wrote:
> unfortunately, I don't have a s3700 to
> test with, but based on everything i've seen it looks like it's a
> mostly solved problem. (for example, see here:
> http://www.storagereview.com/intel_ssd_dc_s3700_series_enterprise_ssd_review).
>    Tests that drive the 710 to <3k iops were not able to take the 3700
> down under 10k at any queue depth.

I have two weeks of real-world data from DC S3700 units in production
and a pile of synthetic test results.  The S3700 drives are at least 2X
as fast as the 710 models, and there are synthetic tests where it's
closer to 10X.

On a 5,000 IOPS workload that crushed a pair of 710 units, the new
drives are only hitting 50% utilization now.  Does that make worst-case
10K?  Maybe.  I can't just extrapolate from the 50% figures and predict
the throughput I'll see at 100% though, so I'm still waiting for more
data before I feel comfortable saying exactly what the worst case looks
like.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


Re: Reliability with RAID 10 SSD and Streaming Replication

From
Shaun Thomas
Date:
On 05/22/2013 08:30 AM, Merlin Moncure wrote:

> I'm not claiming to work with extremely high transaction rate systems
> but then again neither are most of the people reading this list.
> Disk drives are obsolete for database installations.

Well, you may not be able to make that claim, but I can. While we don't
use Intel SSDs, our first-gen FusinoIO cards can deliver about 20k
PostgreSQL TPS of our real-world data right off the device before
caching effects start boosting the numbers. These days, devices like
this make our current batch look like rusty old hulks in comparison, so
the gap is just widening. Hard drives stand no chance at all.

An 8-drive 15k RPM RAID-10 gave us about 1800 TPS back when we switched
to FusionIO about two years ago. So, while Intel drives themselves may
not be able to hit sustained 100x speeds over spindles, it's pretty
clear that that's a firmware or implementation limitation.

The main "issue" is that the sustained sequence scan speeds are
generally less than an order of magnitude faster than drives. So as soon
as you hit something that isn't limited by random IOPS, spindles get a
chance to catch up. But those situations are few and far between in a
heavy transactional setting. Having used NVRAM/SSDs, I could never go
back so long as the budget allows us to procure them.

A data warehouse? Maybe spindles still have a place there. Heavy
transactional system? Not a chance.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: Reliability with RAID 10 SSD and Streaming Replication

From
David Boreham
Date:
On 5/22/2013 8:18 AM, Greg Smith wrote:
>
> They can easily hit that number.  Or they can do this:
>
> Device:     r/s    w/s  rMB/s  wMB/s avgrq-sz avgqu-sz  await svctm
> %util
> sdd     2702.80  19.40  19.67   0.16    14.91   273.68  71.74 0.37 100.00
> sdd     2707.60  13.00  19.53   0.10    14.78   276.61  90.34 0.37 100.00
>
> That's an Intel 710 being crushed by a random read database server
> workload, unable to deliver even 3000 IOPS / 20MB/s.  I have hours of
> data like this from several servers.

This is interesting. Do you know what it is about the workload that
leads to the unusually low rps ?





Re: Reliability with RAID 10 SSD and Streaming Replication

From
Greg Smith
Date:
On 5/22/13 12:56 PM, Shaun Thomas wrote:
> Well, you may not be able to make that claim, but I can. While we don't
> use Intel SSDs, our first-gen FusinoIO cards can deliver about 20k
> PostgreSQL TPS of our real-world data right off the device before
> caching effects start boosting the numbers.

I've seen FusionIO hit that 20K commit number, as well as hitting 75K
IOPS on random reads (600MB/s).  They are roughly 5 to 10X faster than
the Intel 320/710 drives.  There's a corresponding price hit though, and
having to provision PCI-E cards is a pain in some systems.

A claim that a FusionIO drive in particular is capable of 100X the
performance of a spinning drive, that I wouldn't dispute.  I even made
that claim myself with some benchmark numbers to back it up:
http://www.fusionio.com/blog/fusion-io-boosts-postgresql-performance/
That's not just a generic SSD anymore though.

> An 8-drive 15k RPM RAID-10 gave us about 1800 TPS back when we switched
> to FusionIO about two years ago. So, while Intel drives themselves may
> not be able to hit sustained 100x speeds over spindles, it's pretty
> clear that that's a firmware or implementation limitation.

1800 TPS to 20K TPS is just over a 10X speedup.

As for Intel vs. FusionIO, rather than implementation quality it's more
what architecture you're willing to pay for.  If you test a few models
across Intel's product line, you can see there's a rough size vs. speed
correlation.  The larger units have more channels of flash going at the
same time.  FusionIO has architected such that there is a wide write
path even on their smallest cards.  That 75K IOPS number I got even out
of their little 80GB card.  (since dropped from the product line)

I can buy a good number of Intel DC S3700 drives for what a FusionIO
card costs though.

> The main "issue" is that the sustained sequence scan speeds are
> generally less than an order of magnitude faster than drives. So as soon
> as you hit something that isn't limited by random IOPS, spindles get a
> chance to catch up.

I have some moderately fast SSD based transactional systems that are
still using traditional drives with battery-backed cache for the
sequential writes of the WAL volume, where the data volume is on Intel
710 disks.  WAL writes really burn through flash cells, too, so keeping
them on traditional drives can be cost effective in a few ways.  That
approach is lucky to hit 10K TPS though, so it can't compete against
what a PCI-E card like the FusionIO drives are capable of.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


Re: Reliability with RAID 10 SSD and Streaming Replication

From
Shaun Thomas
Date:
On 05/22/2013 01:06 PM, Greg Smith wrote:

> There's a corresponding price hit though, and
> having to provision PCI-E cards is a pain in some systems.

Oh, totally. Specialist devices like RAMSAN, FusionIO, Virident, or
Whiptail are hideously expensive, even compared to high-end SSDs. I was
just pointing out out that the technical limitations of the underlying
chips (NVRAM) can be overcome or augmented in ways Intel isn't doing (yet).

> 1800 TPS to 20K TPS is just over a 10X speedup.

True. But in that case, it was a single device pitted against 8 very
high-end 15k RPM spindles. I'd need 80-100 drives in a massive SAN to
get similar numbers, and at that point, we're not really saving any
money and have a lot more failure points and maintenance.

I guess you get way more space, though. :)

> The larger units have more channels of flash going at the
> same time.  FusionIO has architected such that there is a wide write
> path even on their smallest cards.

Yep. And I've been watching these technologies like a hawk waiting for
the new chips and their performance profiles. Some of the newer chips
have performance multipliers even on a single die in the larger sizes.

> I can buy a good number of Intel DC S3700 drives for what a FusionIO
> card costs though.

I know. :(

But knowing the performance they can deliver, I often dream of a perfect
device comprised of several PCIe-based NVRAM cards in a hot-swap PCIe
enclosure (they exist!). Something like that in a 3U piece of gear would
absolutely annihilate even the largest SAN.

At the mere cost of a half million or so. :p

> I have some moderately fast SSD based transactional systems that are
> still using traditional drives with battery-backed cache for the
> sequential writes of the WAL volume, where the data volume is on
> Intel 710 disks.

That sounds like a very sane and recommendable approach, and
coincidentally the same we would use if we couldn't afford the FusionIO
drives.

I'm actually curious to see how using ZFS with its CoW profile and using
a bundle of SSDs as a ZIL would compare. It's still disk-based, but the
transparent SSD layer acting as a gigantic passive read and write cache
intrigue me. It seems like it would also make a good middle-ground
concerning cost vs. performance.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: Reliability with RAID 10 SSD and Streaming Replication

From
Shaun Thomas
Date:
On 05/22/2013 12:31 PM, David Boreham wrote:

>> Device:     r/s    w/s  rMB/s  wMB/s avgrq-sz avgqu-sz  await svctm %util
>> sdd     2702.80  19.40  19.67   0.16    14.91   273.68  71.74 0.37 100.00
>> sdd     2707.60  13.00  19.53   0.10    14.78   276.61  90.34 0.37 100.00
>>
>> That's an Intel 710 being crushed by a random read database server
>> workload, unable to deliver even 3000 IOPS / 20MB/s.  I have hours of
>> data like this from several servers.
>
> This is interesting. Do you know what it is about the workload that
> leads to the unusually low rps ?

That read rate and that throughput suggest 8k reads. The queue size is
270+, which is pretty high for a single device, even when it's an SSD.
Some SSDs seem to break down on queue sizes over 4, and 15 sectors
spread across a read queue of 270 is pretty hash. The drive tested here
basically fell over on servicing a huge diverse read queue, which
suggests a firmware issue.

Often this is because the device was optimized for sequential reads and
post lower IOPS than is theoretically possible so they can advertise
higher numbers alongside consumer-grade disks. They're Greg's disks
though. :)

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: Reliability with RAID 10 SSD and Streaming Replication

From
"Joshua D. Drake"
Date:
On 05/22/2013 11:06 AM, Greg Smith wrote:

> I have some moderately fast SSD based transactional systems that are
> still using traditional drives with battery-backed cache for the
> sequential writes of the WAL volume, where the data volume is on Intel
> 710 disks.  WAL writes really burn through flash cells, too, so keeping
> them on traditional drives can be cost effective in a few ways.  That
> approach is lucky to hit 10K TPS though, so it can't compete against
> what a PCI-E card like the FusionIO drives are capable of.

Greg, can you elaborate on the SSD + Xlog issue? What type of burn
through are we talking about?

JD



--
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
    a rose in the deeps of my heart. - W.B. Yeats


Re: Reliability with RAID 10 SSD and Streaming Replication

From
Greg Smith
Date:
On 5/22/13 3:06 PM, Joshua D. Drake wrote:
> Greg, can you elaborate on the SSD + Xlog issue? What type of burn
> through are we talking about?

You're burning through flash cells at a multiple of the total WAL write
volume.  The system I gave iostat snapshots from upthread (with the
Intel 710 hitting its limit) archives about 1TB of WAL each week.  The
actual amount of WAL written in terms of erased flash blocks is even
higher though, because sometimes the flash is hit with partial page
writes.  The write amplification of WAL is much worse than the main
database.

I gave a rough intro to this on the Intel drives at
http://blog.2ndquadrant.com/intel_ssds_lifetime_and_the_32/ and there's
a nice "Write endurance" table at
http://www.tomshardware.com/reviews/ssd-710-enterprise-x25-e,3038-2.html

The cheapest of the Intel SSDs I have here only guarantees 15TB of total
write endurance.  Eliminating >1TB of writes per week by moving the WAL
off SSD is a pretty significant change, even though the burn rate isn't
a simple linear thing--you won't burn the flash out in only 15 weeks.

The production server is actually using the higher grade 710 drives that
aim for 900TB instead.  But I do have standby servers using the low
grade stuff, so anything I can do to decrease SSD burn rate without
dropping performance is useful.  And only the top tier of transaction
rates will outrun a RAID1 pair of 15K drives dedicated to WAL.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


Re: Reliability with RAID 10 SSD and Streaming Replication

From
Merlin Moncure
Date:
On Wed, May 22, 2013 at 2:30 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> On 5/22/13 3:06 PM, Joshua D. Drake wrote:
>>
>> Greg, can you elaborate on the SSD + Xlog issue? What type of burn
>> through are we talking about?
>
>
> You're burning through flash cells at a multiple of the total WAL write
> volume.  The system I gave iostat snapshots from upthread (with the Intel
> 710 hitting its limit) archives about 1TB of WAL each week.  The actual
> amount of WAL written in terms of erased flash blocks is even higher though,
> because sometimes the flash is hit with partial page writes.  The write
> amplification of WAL is much worse than the main database.
>
> I gave a rough intro to this on the Intel drives at
> http://blog.2ndquadrant.com/intel_ssds_lifetime_and_the_32/ and there's a
> nice "Write endurance" table at
> http://www.tomshardware.com/reviews/ssd-710-enterprise-x25-e,3038-2.html
>
> The cheapest of the Intel SSDs I have here only guarantees 15TB of total
> write endurance.  Eliminating >1TB of writes per week by moving the WAL off
> SSD is a pretty significant change, even though the burn rate isn't a simple
> linear thing--you won't burn the flash out in only 15 weeks.

Certainly, intel 320 is not designed for 1tb/week workloads.

> The production server is actually using the higher grade 710 drives that aim
> for 900TB instead.  But I do have standby servers using the low grade stuff,
> so anything I can do to decrease SSD burn rate without dropping performance
> is useful.  And only the top tier of transaction rates will outrun a RAID1
> pair of 15K drives dedicated to WAL.

s3700 is rated for 10 drive writes/day for 5 years. so, for 200gb drive, that's
200gb * 10/day * 365 days * 5, that's 3.65 million gigabytes or ~ 3.5 petabytes.

1tb/week would take 67 years to burn through / whatever you assume for
write amplification / whatever extra penalty you give if you are
shooting for > 5 year duty cycle (flash degrades faster the older it
is)  *for a single 200gb device*.  write endurance is not a problem
for this drive, in fact it's a very reasonable assumption that the
faster worst case random performance is directly related to reduced
write amplification.  btw,  cost/pb of this drive is less than half of
the 710 (which IMO was obsolete the day the s3700 hit the street).

merlin


Re: Reliability with RAID 10 SSD and Streaming Replication

From
Shaun Thomas
Date:
On 05/22/2013 02:51 PM, Merlin Moncure wrote:

> s3700 is rated for 10 drive writes/day for 5 years. so, for 200gb
> drive, that's 200gb * 10/day * 365 days * 5, that's 3.65 million
> gigabytes or ~ 3.5 petabytes.

Nice. And on that note:

http://www.tomshardware.com/reviews/ssd-dc-s3700-raid-0-benchmarks,3480.html

They actually over-saturated the backplane with 24 of these drives in a
giant RAID-0, tipping the scales at around 3.1M IOPS. Not bad for
consumer-level drives. I'd love to see a RAID-10 of these.

I'm having a hard time coming up with a database workload that would run
into performance problems with a (relatively inexpensive) setup like this.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: Reliability with RAID 10 SSD and Streaming Replication

From
Greg Smith
Date:
On 5/22/13 3:51 PM, Merlin Moncure wrote:
> s3700 is rated for 10 drive writes/day for 5 years. so, for 200gb drive, that's
> 200gb * 10/day * 365 days * 5, that's 3.65 million gigabytes or ~ 3.5 petabytes.

Yes, they've improved on the 1.5PB that the 710 drives topped out at.
For that particular drive, this is unlikely to be a problem.  But I'm
not willing to toss out longevity issues at therefore irrelevant in all
cases.  Some flash still costs a lot more than Intel's SSDs do, like the
FusionIO products.  Chop even a few percent of the wear out of the price
tag on a RAMSAN and you've saved some real money.

And there are some other products with interesting
price/performance/capacity combinations that are also sensitive to
wearout.  Seagate's hybrid drives have turned interesting now that they
cache writes safely for example.  There's no cheaper way to get 1TB with
flash write speeds for small commits than that drive right now.  (Test
results on that drive coming soon, along with my full DC S3700 review)

> btw,  cost/pb of this drive is less than half of
> the 710 (which IMO was obsolete the day the s3700 hit the street).

You bet, and I haven't recommended anyone buy a 710 since the
announcement.  However, "hit the street" is still an issue.  No one has
been able to keep DC S3700 drives in stock very well yet.  It took me
three tries through Newegg before my S3700 drive actually shipped.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


Re: Reliability with RAID 10 SSD and Streaming Replication

From
Merlin Moncure
Date:
On Wed, May 22, 2013 at 3:06 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> You bet, and I haven't recommended anyone buy a 710 since the announcement.
> However, "hit the street" is still an issue.  No one has been able to keep
> DC S3700 drives in stock very well yet.  It took me three tries through
> Newegg before my S3700 drive actually shipped.

Well, let's look a the facts:
*) >2x write endurance vs 710 (500x 320)
*) 2-10x performance depending on workload specifics
*) much better worst case/average latency
*) half the cost of the 710!?

After obsoleting hard drives with the introduction of the 320/710,
intel managed to obsolete their *own* entire lineup with the s3700
(with the exception of the pcie devices and the ultra low cost
notebook 1$/gb segment).  I'm amazed these drives were sold at that
price point: they could have been sold at 3-4x the current price and
still have a willing market (note, please don't do this).  Presumably
most of the inventory is being bought up by small channel resellers
for a quick profit.

Even by the fast moving standards of the SSD world this product is an
absolute game changer and has ushered in the new era of fast storage
with a loud 'gong'. Oh, the major vendors will still keep their
rip-off going on a little longer selling their storage trays, raid
controllers, entry/mid level SANS, SAS HBAs etc at huge markup to
customers who don't need them (some will still need them, but the bar
suddenly just got spectacularly raised before you have to look into
enterprise gear).  CRT was overtaken by LCD monitor in mind 2004 in
terms of sales: I'd say it's late 2002/early 2003, at least for new
deployments.

merlin


Re: Reliability with RAID 10 SSD and Streaming Replication

From
CSS
Date:
On May 22, 2013, at 4:06 PM, Greg Smith wrote:

> And there are some other products with interesting price/performance/capacity combinations that are also sensitive to
wearout. Seagate's hybrid drives have turned interesting now that they cache writes safely for example.  There's no
cheaperway to get 1TB with flash write speeds for small commits than that drive right now.  (Test results on that drive
comingsoon, along with my full DC S3700 review) 

I am really looking forward to that.  Will you announce here or just post on the 2ndQuadrant blog?

Another "hybrid" solution is to run ZFS on some decent hard drives and then put the ZFS intent log on SSDs.  With very
syntheticbenchmarks, the random write performance is excellent. 

All of these discussions about alternate storage media are great - everyone has different needs and there are certainly
anumber of deployments that can "get away" with spending much less money by adding some solid state storage.  There's
reallyan amazing number of options today… 

Thanks,

Charles

Re: Reliability with RAID 10 SSD and Streaming Replication

From
"Joshua D. Drake"
Date:
On 05/22/2013 01:57 PM, Merlin Moncure wrote:
>
> On Wed, May 22, 2013 at 3:06 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>> You bet, and I haven't recommended anyone buy a 710 since the announcement.
>> However, "hit the street" is still an issue.  No one has been able to keep
>> DC S3700 drives in stock very well yet.  It took me three tries through
>> Newegg before my S3700 drive actually shipped.
>
> Well, let's look a the facts:
> *) >2x write endurance vs 710 (500x 320)
> *) 2-10x performance depending on workload specifics
> *) much better worst case/average latency
> *) half the cost of the 710!?

I am curious how the 710 or S3700 stacks up against the new M500 from
Crucial? I know Intel is kind of the goto for these things but the m500
is power off protected and rated at: Endurance: 72TB total bytes written
(TBW), equal to 40GB per day for 5 years .

Granted it isn't he fasted pig in the poke but it sure seems like a very
reasonable drive for the price:

http://www.newegg.com/Product/Product.aspx?Item=20-148-695&ParentOnly=1&IsVirtualParent=1

Sincerely,

Joshua D. Drake

--
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
    a rose in the deeps of my heart. - W.B. Yeats


Re: Reliability with RAID 10 SSD and Streaming Replication

From
Merlin Moncure
Date:
On Wed, May 22, 2013 at 5:42 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
> I am curious how the 710 or S3700 stacks up against the new M500 from
> Crucial? I know Intel is kind of the goto for these things but the m500 is
> power off protected and rated at: Endurance: 72TB total bytes written (TBW),
> equal to 40GB per day for 5 years .

I don't think the m500 is power safe (nor is any drive at the <1$/gb
price point).  This drive is positioned as a desktop class disk drive.
 AFAIK, the s3700 strongly outclasses all competitors on price,
performance, or both.  Once you give up enterprise features of
endurance and iops you have many options (samsung 840 is another one).
 Pretty soon these types of drives are going to be standard kit in
workstations (and we'll be back to the IDE area of corrupted data,
ha!).  I would recommend none of them for server class use, they are
inferior in terms of $/iop and $/gb written.

for server class drives, see:
hitachi ssd400m (10$/gb, slower!)
kingston e100,
etc.

merlin


Re: Reliability with RAID 10 SSD and Streaming Replication

From
"Joshua D. Drake"
Date:
On 05/22/2013 04:37 PM, Merlin Moncure wrote:
>
> On Wed, May 22, 2013 at 5:42 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
>> I am curious how the 710 or S3700 stacks up against the new M500 from
>> Crucial? I know Intel is kind of the goto for these things but the m500 is
>> power off protected and rated at: Endurance: 72TB total bytes written (TBW),
>> equal to 40GB per day for 5 years .
>
> I don't think the m500 is power safe (nor is any drive at the <1$/gb
> price point).

According the the data sheet it is power safe.

http://investors.micron.com/releasedetail.cfm?ReleaseID=732650
http://www.micron.com/products/solid-state-storage/client-ssd/m500-ssd

Sincerely,

JD




--
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
    a rose in the deeps of my heart. - W.B. Yeats


Re: Reliability with RAID 10 SSD and Streaming Replication

From
Mark Kirkwood
Date:
On 23/05/13 13:01, Joshua D. Drake wrote:
>
> On 05/22/2013 04:37 PM, Merlin Moncure wrote:
>>
>> On Wed, May 22, 2013 at 5:42 PM, Joshua D. Drake
>> <jd@commandprompt.com> wrote:
>>> I am curious how the 710 or S3700 stacks up against the new M500 from
>>> Crucial? I know Intel is kind of the goto for these things but the
>>> m500 is
>>> power off protected and rated at: Endurance: 72TB total bytes
>>> written (TBW),
>>> equal to 40GB per day for 5 years .
>>
>> I don't think the m500 is power safe (nor is any drive at the <1$/gb
>> price point).
>
> According the the data sheet it is power safe.
>
> http://investors.micron.com/releasedetail.cfm?ReleaseID=732650
> http://www.micron.com/products/solid-state-storage/client-ssd/m500-ssd
>
>

Yeah - they apparently have a capacitor on board.

Their write endurance is where they don't compare so favorably to the
S3700 (they are *much* cheaper mind you):

- M500 120GB drive: 40GB per day for 5 years
- S3700 100GB drive: 1000GB per day for 5 years

But great to see more reasonably priced SSD with power off protection.

Cheers

Mark


Re: Reliability with RAID 10 SSD and Streaming Replication

From
Greg Smith
Date:
On 5/22/13 6:42 PM, Joshua D. Drake wrote:
> I am curious how the 710 or S3700 stacks up against the new M500 from
> Crucial? I know Intel is kind of the goto for these things but the m500
> is power off protected and rated at: Endurance: 72TB total bytes written
> (TBW), equal to 40GB per day for 5 years .

The M500 is fine on paper, I had that one on my list of things to
evaluate when I can.  The general reliability of Crucial's consumer SSD
has looked good recently.  I'm not going to recommend that one until I
actually see one work as expected though.  I'm waiting for one to pass
by or I reach a new toy purchasing spree.

What makes me step very carefully here is watching what Intel went
through when they released their first supercap drive, the 320 series.
If you look at the nastiest of the firmware bugs they had, like the
infamous "8MB bug", a lot of them were related to the new clean shutdown
feature.  It's the type of firmware that takes some exposure to the real
world to flush out the bugs.  The last of the enthusiast SSD players who
tried to take this job on was OCZ with the Vertex 3 Pro, and they never
got that model quite right before abandoning it altogether.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


Re: Reliability with RAID 10 SSD and Streaming Replication

From
Greg Smith
Date:
On 5/22/13 4:57 PM, Merlin Moncure wrote:
> Oh, the major vendors will still keep their
> rip-off going on a little longer selling their storage trays, raid
> controllers, entry/mid level SANS, SAS HBAs etc at huge markup to
> customers who don't need them (some will still need them, but the bar
> suddenly just got spectacularly raised before you have to look into
> enterprise gear).

The angle to distinguish "enterprise" hardware is moving on to error
related capabilities.  Soon we'll see SAS drives with the 520 byte
sectors and checksumming for example.

And while SATA drives have advanced a long way, they haven't caught up
with SAS for failure handling.  It's still far too easy for a single
crazy SATA device to force crippling bus resets for example.  Individual
SATA ports don't expect to share things with others, while SAS chains
have a much better protocol for handling things.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


Re: Reliability with RAID 10 SSD and Streaming Replication

From
Mark Kirkwood
Date:
On 23/05/13 13:32, Mark Kirkwood wrote:
> On 23/05/13 13:01, Joshua D. Drake wrote:
>>
>> On 05/22/2013 04:37 PM, Merlin Moncure wrote:
>>>
>>> On Wed, May 22, 2013 at 5:42 PM, Joshua D. Drake
>>> <jd@commandprompt.com> wrote:
>>>> I am curious how the 710 or S3700 stacks up against the new M500 from
>>>> Crucial? I know Intel is kind of the goto for these things but the
>>>> m500 is
>>>> power off protected and rated at: Endurance: 72TB total bytes
>>>> written (TBW),
>>>> equal to 40GB per day for 5 years .
>>>
>>> I don't think the m500 is power safe (nor is any drive at the <1$/gb
>>> price point).
>>
>> According the the data sheet it is power safe.
>>
>> http://investors.micron.com/releasedetail.cfm?ReleaseID=732650
>> http://www.micron.com/products/solid-state-storage/client-ssd/m500-ssd
>>
>>
>
> Yeah - they apparently have a capacitor on board.
>

Make that quite a few capacitors (top right corner):

http://regmedia.co.uk/2013/05/07/m500_4.jpg


Re: Reliability with RAID 10 SSD and Streaming Replication

From
Merlin Moncure
Date:


On Wednesday, May 22, 2013, Joshua D. Drake <jd@commandprompt.com> wrote:
>
> On 05/22/2013 04:37 PM, Merlin Moncure wrote:
>>
>> On Wed, May 22, 2013 at 5:42 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
>>>
>>> I am curious how the 710 or S3700 stacks up against the new M500 from
>>> Crucial? I know Intel is kind of the goto for these things but the m500 is
>>> power off protected and rated at: Endurance: 72TB total bytes written (TBW),
>>> equal to 40GB per day for 5 years .
>>
>> I don't think the m500 is power safe (nor is any drive at the <1$/gb
>> price point).
>
> According the the data sheet it is power safe.
>
> http://investors.micron.com/releasedetail.cfm?ReleaseID=732650
> http://www.micron.com/products/solid-state-storage/client-ssd/m500-ssd

Wow, that seems like a pretty good deal then assuming it works and performs decently.

merlin

Re: Reliability with RAID 10 SSD and Streaming Replication

From
Greg Smith
Date:
On 5/22/13 10:04 PM, Mark Kirkwood wrote:
> Make that quite a few capacitors (top right corner):
> http://regmedia.co.uk/2013/05/07/m500_4.jpg

There are some more shots and descriptions of the internals in the
excellent review at
http://techreport.com/review/24666/crucial-m500-ssd-reviewed

That also highlights the big problem with this drive that's kept me from
buying one so far:

"Unlike rivals Intel and Samsung, Crucial doesn't provide utility
software with a built-in health indicator. The M500's payload of SMART
attributes doesn't contain any references to flash wear or bytes
written, either. Several of the SMART attributes are labeled
"Vendor-specific," but you'll need to guess what they track and read the
associated values using third-party software."

That's a serious problem for most business use of this sort of drive.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


Re: Reliability with RAID 10 SSD and Streaming Replication

From
Mark Kirkwood
Date:
On 23/05/13 14:22, Greg Smith wrote:
> On 5/22/13 10:04 PM, Mark Kirkwood wrote:
>> Make that quite a few capacitors (top right corner):
>> http://regmedia.co.uk/2013/05/07/m500_4.jpg
>
> There are some more shots and descriptions of the internals in the
> excellent review at
> http://techreport.com/review/24666/crucial-m500-ssd-reviewed
>
> That also highlights the big problem with this drive that's kept me
> from buying one so far:
>
> "Unlike rivals Intel and Samsung, Crucial doesn't provide utility
> software with a built-in health indicator. The M500's payload of SMART
> attributes doesn't contain any references to flash wear or bytes
> written, either. Several of the SMART attributes are labeled
> "Vendor-specific," but you'll need to guess what they track and read
> the associated values using third-party software."
>
> That's a serious problem for most business use of this sort of drive.
>

Agreed - I was thinking the same thing!

Cheers

Mark


Re: Reliability with RAID 10 SSD and Streaming Replication

From
Mark Kirkwood
Date:
On 23/05/13 14:26, Mark Kirkwood wrote:
> On 23/05/13 14:22, Greg Smith wrote:
>> On 5/22/13 10:04 PM, Mark Kirkwood wrote:
>>> Make that quite a few capacitors (top right corner):
>>> http://regmedia.co.uk/2013/05/07/m500_4.jpg
>>
>> There are some more shots and descriptions of the internals in the
>> excellent review at
>> http://techreport.com/review/24666/crucial-m500-ssd-reviewed
>>
>> That also highlights the big problem with this drive that's kept me
>> from buying one so far:
>>
>> "Unlike rivals Intel and Samsung, Crucial doesn't provide utility
>> software with a built-in health indicator. The M500's payload of
>> SMART attributes doesn't contain any references to flash wear or
>> bytes written, either. Several of the SMART attributes are labeled
>> "Vendor-specific," but you'll need to guess what they track and read
>> the associated values using third-party software."
>>
>> That's a serious problem for most business use of this sort of drive.
>>
>
> Agreed - I was thinking the same thing!
>
>

Having said that, there does seem to be a wear leveling counter in its
SMART attributes - but, yes - I'd like to see indicators more similar
the level of detail that Intel provides.

Cheers

Mark



Re: Reliability with RAID 10 SSD and Streaming Replication

From
"Joshua D. Drake"
Date:
On 05/22/2013 07:17 PM, Merlin Moncure wrote:

>  > According the the data sheet it is power safe.
>  >
>  > http://investors.micron.com/releasedetail.cfm?ReleaseID=732650
>  > http://www.micron.com/products/solid-state-storage/client-ssd/m500-ssd
>
> Wow, that seems like a pretty good deal then assuming it works and
> performs decently.

Yeah that was my thinking. Sure it isn't an S3700 but for the money it
is still faster than the comparable spindle configuration.

JD

>
> merlin



Re: Reliability with RAID 10 SSD and Streaming Replication

From
Andrea Suisani
Date:
On 05/22/2013 03:30 PM, Merlin Moncure wrote:
> On Tue, May 21, 2013 at 7:19 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>> On 5/20/13 6:32 PM, Merlin Moncure wrote:

[cut]

>> The only really huge gain to be had using SSD is commit rate at a low client
>> count.  There you can easily do 5,000/second instead of a spinning disk that
>> is closer to 100, for less than what the battery-backed RAID card along
>> costs to speed up mechanical drives.  My test server's 100GB DC S3700 was
>> $250.  That's still not two orders of magnitude faster though.
>
> That's most certainly *not* the only gain to be had: random read rates
> of large databases (a very important metric for data analysis) can
> easily hit 20k tps.  So I'll stand by the figure. Another point: that
> 5000k commit raid is sustained, whereas a raid card will spectacularly
> degrade until the cache overflows; it's not fair to compare burst with
> sustained performance.  To hit 5000k sustained commit rate along with
> good random read performance, you'd need a very expensive storage
> system.   Right now I'm working (not by choice) with a teir-1 storage
> system (let's just say it rhymes with 'weefax') and I would trade it
> for direct attached SSD in a heartbeat.
>
> Also, note that 3rd party benchmarking is showing the 3700 completely
> smoking the 710 in database workloads (for example, see
> http://www.anandtech.com/show/6433/intel-ssd-dc-s3700-200gb-review/6).

[cut]

Sorry for interrupting but on a related note I would like to know your
opinions on what the anandtech review said about 3700 poor performance
on "Oracle Swingbench", quoting the relevant part that you can find here (*)

<quote>

[..] There are two components to the Swingbench test we're running here:
the database itself, and the redo log. The redo log stores all changes that
are made to the database, which allows the database to be reconstructed in
the event of a failure. In good DB design, these two would exist on separate
storage systems, but in order to increase IO we combined them both for this test.
Accesses to the DB end up being 8KB and random in nature, a definite strong suit
of the S3700 as we've already shown. The redo log however consists of a bunch
of 1KB - 1.5KB, QD1, sequential accesses. The S3700, like many of the newer
controllers we've tested, isn't optimized for low queue depth, sub-4KB, sequential
workloads like this. [..]

</quote>

Does this kind of scenario apply to postgresql wal files repo ?

Thanks
andrea


(*) http://www.anandtech.com/show/6433/intel-ssd-dc-s3700-200gb-review/5


Re: Reliability with RAID 10 SSD and Streaming Replication

From
Merlin Moncure
Date:
On Thu, May 23, 2013 at 1:56 AM, Andrea Suisani <sickpig@opinioni.net> wrote:
> On 05/22/2013 03:30 PM, Merlin Moncure wrote:
>>
>> On Tue, May 21, 2013 at 7:19 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>>>
>>> On 5/20/13 6:32 PM, Merlin Moncure wrote:
>
>
> [cut]
>
>
>>> The only really huge gain to be had using SSD is commit rate at a low
>>> client
>>> count.  There you can easily do 5,000/second instead of a spinning disk
>>> that
>>> is closer to 100, for less than what the battery-backed RAID card along
>>> costs to speed up mechanical drives.  My test server's 100GB DC S3700 was
>>> $250.  That's still not two orders of magnitude faster though.
>>
>>
>> That's most certainly *not* the only gain to be had: random read rates
>> of large databases (a very important metric for data analysis) can
>> easily hit 20k tps.  So I'll stand by the figure. Another point: that
>> 5000k commit raid is sustained, whereas a raid card will spectacularly
>> degrade until the cache overflows; it's not fair to compare burst with
>> sustained performance.  To hit 5000k sustained commit rate along with
>> good random read performance, you'd need a very expensive storage
>> system.   Right now I'm working (not by choice) with a teir-1 storage
>> system (let's just say it rhymes with 'weefax') and I would trade it
>> for direct attached SSD in a heartbeat.
>>
>> Also, note that 3rd party benchmarking is showing the 3700 completely
>> smoking the 710 in database workloads (for example, see
>> http://www.anandtech.com/show/6433/intel-ssd-dc-s3700-200gb-review/6).
>
>
> [cut]
>
> Sorry for interrupting but on a related note I would like to know your
> opinions on what the anandtech review said about 3700 poor performance
> on "Oracle Swingbench", quoting the relevant part that you can find here (*)
>
> <quote>
>
> [..] There are two components to the Swingbench test we're running here:
> the database itself, and the redo log. The redo log stores all changes that
> are made to the database, which allows the database to be reconstructed in
> the event of a failure. In good DB design, these two would exist on separate
> storage systems, but in order to increase IO we combined them both for this
> test.
> Accesses to the DB end up being 8KB and random in nature, a definite strong
> suit
> of the S3700 as we've already shown. The redo log however consists of a
> bunch
> of 1KB - 1.5KB, QD1, sequential accesses. The S3700, like many of the newer
> controllers we've tested, isn't optimized for low queue depth, sub-4KB,
> sequential
> workloads like this. [..]
>
> </quote>
>
> Does this kind of scenario apply to postgresql wal files repo ?

huh -- I don't think so.  wal file segments are 8kb aligned, ditto
clog, etc.  In XLogWrite():

  /* OK to write the page(s) */
  from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
  nbytes = npages * (Size) XLOG_BLCKSZ;  <--
  errno = 0;
  if (write(openLogFile, from, nbytes) != nbytes)
  {

AFICT, that's the only way you write out xlog.  One thing I would
definitely advise though is to disable partial page writes if it's
enabled.   s3700 is algined on 8kb blocks internally -- hm.

merlin


Re: Reliability with RAID 10 SSD and Streaming Replication

From
Andrea Suisani
Date:
On 05/23/2013 03:47 PM, Merlin Moncure wrote:

[cut]

>> <quote>
>>
>> [..] There are two components to the Swingbench test we're running here:
>> the database itself, and the redo log. The redo log stores all changes that
>> are made to the database, which allows the database to be reconstructed in
>> the event of a failure. In good DB design, these two would exist on separate
>> storage systems, but in order to increase IO we combined them both for this
>> test.
>> Accesses to the DB end up being 8KB and random in nature, a definite strong
>> suit
>> of the S3700 as we've already shown. The redo log however consists of a
>> bunch
>> of 1KB - 1.5KB, QD1, sequential accesses. The S3700, like many of the newer
>> controllers we've tested, isn't optimized for low queue depth, sub-4KB,
>> sequential
>> workloads like this. [..]
>>
>> </quote>
>>
>> Does this kind of scenario apply to postgresql wal files repo ?
>
> huh -- I don't think so.  wal file segments are 8kb aligned, ditto
> clog, etc.  In XLogWrite():
>
>    /* OK to write the page(s) */
>    from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
>    nbytes = npages * (Size) XLOG_BLCKSZ;  <--
>    errno = 0;
>    if (write(openLogFile, from, nbytes) != nbytes)
>    {
>
> AFICT, that's the only way you write out xlog.  One thing I would
> definitely advise though is to disable partial page writes if it's
> enabled.   s3700 is algined on 8kb blocks internally -- hm.

many thanks merlin for both the explanation and the good advice :)

andrea




Re: Reliability with RAID 10 SSD and Streaming Replication

From
Greg Smith
Date:
On 5/22/13 2:45 PM, Shaun Thomas wrote:
> That read rate and that throughput suggest 8k reads. The queue size is
> 270+, which is pretty high for a single device, even when it's an SSD.
> Some SSDs seem to break down on queue sizes over 4, and 15 sectors
> spread across a read queue of 270 is pretty hash. The drive tested here
> basically fell over on servicing a huge diverse read queue, which
> suggests a firmware issue.

That's basically it.  I don't know that I'd put the blame specifically
onto a firmware issue without further evidence that's the case though.
The last time I chased down a SSD performance issue like this it ended
up being a Linux scheduler bug.  One thing I plan to do for future SSD
tests is to try and replicate this issue better, starting by increasing
the number of clients to at least 300.

Related:  if anyone read my "Seeking PostgreSQL" talk last year, some of
my Intel 320 results there were understating the drive's worst-case
performance due to a testing setup error.  I have a blog entry talking
about what was wrong and how it slipped past me at
http://highperfpostgres.com/2013/05/seeking-revisited-intel-320-series-and-ncq/

With that loose end sorted, I'll be kicking off a brand new round of SSD
tests on a 24 core server here soon.  All those will appear on my blog.
  The 320 drive is returning as the bang for buck champ, along with a DC
S3700 and a Seagate 1TB Hybrid drive with NAND durable write cache.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com