Thread: Reliability with RAID 10 SSD and Streaming Replication
Hi all,
Our application is heavy write and IO utilisation has been the problem for us for a while. We've decided to use RAID 10 of 4x500GB Samsung 840 Pro for the master server. I'm aware of write cache issue on SSDs in case of power loss. However, our hosting provider doesn't offer any other choices of SSD drives with supercapacitor. To minimise risk, we will also set up another RAID 10 SAS in streaming replication mode. For our application, a few seconds of data loss is acceptable.
My question is, would corrupted data files on the primary server affect the streaming standby? In other word, is this setup acceptable in terms of minimising deficiency of SSDs?
Cheers,
Cuong
On Thu, May 16, 2013 at 9:46 AM, Cuong Hoang <climbingrose@gmail.com> wrote: > Hi all, > > Our application is heavy write and IO utilisation has been the problem for > us for a while. We've decided to use RAID 10 of 4x500GB Samsung 840 Pro for > the master server. I'm aware of write cache issue on SSDs in case of power > loss. However, our hosting provider doesn't offer any other choices of SSD > drives with supercapacitor. To minimise risk, we will also set up another > RAID 10 SAS in streaming replication mode. For our application, a few > seconds of data loss is acceptable. > > My question is, would corrupted data files on the primary server affect the > streaming standby? In other word, is this setup acceptable in terms of > minimising deficiency of SSDs? Data corruption caused by sudden power event on the master will not cross over. Basically with this configuration you must switch over to the standby in that case. Corruption caused by other issues, say a faulty drive, will transfer over however. Block checksum feature of 9.3 as a strategy to reduce the risk of that class of issue. merlin
On Thu, May 16, 2013 at 7:46 AM, Cuong Hoang <climbingrose@gmail.com> wrote:
Hi all,Our application is heavy write and IO utilisation has been the problem for us for a while. We've decided to use RAID 10 of 4x500GB Samsung 840 Pro for the master server. I'm aware of write cache issue on SSDs in case of power loss. However, our hosting provider doesn't offer any other choices of SSD drives with supercapacitor. To minimise risk, we will also set up another RAID 10 SAS in streaming replication mode. For our application, a few seconds of data loss is acceptable.My question is, would corrupted data files on the primary server affect the streaming standby? In other word, is this setup acceptable in terms of minimising deficiency of SSDs?
That seems rather scary to me for two reasons.
If the data center has a sudden power failure, why would it not take out both machines either simultaneously or in short succession? Can you verify that the hosting provider does not have them on the same UPS (or even worse, as two virtual machines on the same physical host)?
The other issue is that you'd have to make sure the master does not restart after a crash. If your init.d scripts just blindly start postgresql, then after a sudden OS restart it will automatically enter recovery and then open as usual, even though it might be silently corrupt. At that point it will be generating WAL based on corrupt data (and incorrect query results), and propagating that to the standby. So you have to be paranoid that if the master ever crashes, it is shot in the head and then reconstructed from the standby.
Cheers,
Jeff
On Thu, May 16, 2013 at 1:34 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Thu, May 16, 2013 at 7:46 AM, Cuong Hoang <climbingrose@gmail.com> wrote: >> >> Hi all, >> >> Our application is heavy write and IO utilisation has been the problem for >> us for a while. We've decided to use RAID 10 of 4x500GB Samsung 840 Pro for >> the master server. I'm aware of write cache issue on SSDs in case of power >> loss. However, our hosting provider doesn't offer any other choices of SSD >> drives with supercapacitor. To minimise risk, we will also set up another >> RAID 10 SAS in streaming replication mode. For our application, a few >> seconds of data loss is acceptable. >> >> My question is, would corrupted data files on the primary server affect >> the streaming standby? In other word, is this setup acceptable in terms of >> minimising deficiency of SSDs? > > > > That seems rather scary to me for two reasons. > > If the data center has a sudden power failure, why would it not take out > both machines either simultaneously or in short succession? Can you verify > that the hosting provider does not have them on the same UPS (or even worse, > as two virtual machines on the same physical host)? I took it to mean that his standby's "raid 10 SAS" meant disk drive based standby. Agree that server should not be configured to autostart through init.d. merlin
On Thu, May 16, 2013 at 11:46 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Thu, May 16, 2013 at 1:34 PM, Jeff Janes <jeff.janes@gmail.com> wrote:I took it to mean that his standby's "raid 10 SAS" meant disk drive
> On Thu, May 16, 2013 at 7:46 AM, Cuong Hoang <climbingrose@gmail.com> wrote:
>>
>> Hi all,
>>
>> Our application is heavy write and IO utilisation has been the problem for
>> us for a while. We've decided to use RAID 10 of 4x500GB Samsung 840 Pro for
>> the master server. I'm aware of write cache issue on SSDs in case of power
>> loss. However, our hosting provider doesn't offer any other choices of SSD
>> drives with supercapacitor. To minimise risk, we will also set up another
>> RAID 10 SAS in streaming replication mode. For our application, a few
>> seconds of data loss is acceptable.
>>
>> My question is, would corrupted data files on the primary server affect
>> the streaming standby? In other word, is this setup acceptable in terms of
>> minimising deficiency of SSDs?
>
>
>
> That seems rather scary to me for two reasons.
>
> If the data center has a sudden power failure, why would it not take out
> both machines either simultaneously or in short succession? Can you verify
> that the hosting provider does not have them on the same UPS (or even worse,
> as two virtual machines on the same physical host)?
based standby.
I had not considered that. If the master can't keep up with IO using disk drives, wouldn't a replica using them probably fall infinitely far behind trying to keep up with the workload?
Maybe the best choice would just be stick with the current set-up (one server, spinning rust) and just turn off synchrounous_commit, since he is already willing to take the loss of a few seconds of transactions.
Cheers,
Jeff
Thank you for your advice guys. We'll definitely turn off init.d script for PostgreSQL on the master. The standby host will be disk-based so it will be less vulnerable to power loss.
I forgot to mention that we'll set up Wal-e to ship base backups and WALs to Amazon S3 continuous as another safety measure. Again, the lost of a few WALs would not be a big issue for us.
Do you think that this setup will be acceptable for our purposes?
Thanks,
Cuong
On Fri, May 17, 2013 at 8:39 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Thu, May 16, 2013 at 11:46 AM, Merlin Moncure <mmoncure@gmail.com> wrote:On Thu, May 16, 2013 at 1:34 PM, Jeff Janes <jeff.janes@gmail.com> wrote:I took it to mean that his standby's "raid 10 SAS" meant disk drive
> On Thu, May 16, 2013 at 7:46 AM, Cuong Hoang <climbingrose@gmail.com> wrote:
>>
>> Hi all,
>>
>> Our application is heavy write and IO utilisation has been the problem for
>> us for a while. We've decided to use RAID 10 of 4x500GB Samsung 840 Pro for
>> the master server. I'm aware of write cache issue on SSDs in case of power
>> loss. However, our hosting provider doesn't offer any other choices of SSD
>> drives with supercapacitor. To minimise risk, we will also set up another
>> RAID 10 SAS in streaming replication mode. For our application, a few
>> seconds of data loss is acceptable.
>>
>> My question is, would corrupted data files on the primary server affect
>> the streaming standby? In other word, is this setup acceptable in terms of
>> minimising deficiency of SSDs?
>
>
>
> That seems rather scary to me for two reasons.
>
> If the data center has a sudden power failure, why would it not take out
> both machines either simultaneously or in short succession? Can you verify
> that the hosting provider does not have them on the same UPS (or even worse,
> as two virtual machines on the same physical host)?
based standby.I had not considered that. If the master can't keep up with IO using disk drives, wouldn't a replica using them probably fall infinitely far behind trying to keep up with the workload?Maybe the best choice would just be stick with the current set-up (one server, spinning rust) and just turn off synchrounous_commit, since he is already willing to take the loss of a few seconds of transactions.Cheers,Jeff
Hi, On 16.5.2013 16:46, Cuong Hoang wrote: > Hi all, > > Our application is heavy write and IO utilisation has been the problem > for us for a while. We've decided to use RAID 10 of 4x500GB Samsung 840 What does "heavy write" mean in your case? Does that mean a lot of small transactions or few large ones? What have you done to tune the server? > Pro for the master server. I'm aware of write cache issue on SSDs in > case of power loss. However, our hosting provider doesn't offer any > other choices of SSD drives with supercapacitor. To minimise risk, we > will also set up another RAID 10 SAS in streaming replication mode. For > our application, a few seconds of data loss is acceptable. Streaming replication allows zero data loss if used in synchronous mode. > My question is, would corrupted data files on the primary server affect > the streaming standby? In other word, is this setup acceptable in terms > of minimising deficiency of SSDs? It should be. Have you considered using a UPS? That would make the SSDs about as reliable as SATA/SAS drives - the UPS may fail, but so may a BBU unit on the SAS controller. Tomas
Hi Tomas,
We have a lot of small updates and some inserts. The database size is at 35GB including indexes and TOAST. We think it will keep growing to about 200GB. We usually have a burst of about 500k writes in about 5-10 minutes which basically cripples IO on the current servers. I've tried to increase the checkpoint_segments, checkpoint_timeout etc. as recommended in "PostgreSQL 9.0 Performance" book. However, it seems like our server just couldn't handle the current load.
Here is the server specs:
Dual E5620, 32GB RAM, 4x1TB SAS 15k in RAID10
Here are some core PostgreSQL configs:
shared_buffers = 2GB # min 128kB
work_mem = 64MB # min 64kB
maintenance_work_mem = 1GB # min 1MB
wal_buffers = 16MB
checkpoint_segments = 128
checkpoint_timeout = 30min
checkpoint_completion_target = 0.7
Thanks,
Cuong
On Fri, May 17, 2013 at 10:06 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
Hi,What does "heavy write" mean in your case? Does that mean a lot of small
On 16.5.2013 16:46, Cuong Hoang wrote:
> Hi all,
>
> Our application is heavy write and IO utilisation has been the problem
> for us for a while. We've decided to use RAID 10 of 4x500GB Samsung 840
transactions or few large ones?
What have you done to tune the server?Streaming replication allows zero data loss if used in synchronous mode.
> Pro for the master server. I'm aware of write cache issue on SSDs in
> case of power loss. However, our hosting provider doesn't offer any
> other choices of SSD drives with supercapacitor. To minimise risk, we
> will also set up another RAID 10 SAS in streaming replication mode. For
> our application, a few seconds of data loss is acceptable.It should be.
> My question is, would corrupted data files on the primary server affect
> the streaming standby? In other word, is this setup acceptable in terms
> of minimising deficiency of SSDs?
Have you considered using a UPS? That would make the SSDs about as
reliable as SATA/SAS drives - the UPS may fail, but so may a BBU unit on
the SAS controller.
Tomas
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
On 17/05/13 12:06, Tomas Vondra wrote: > Hi, > > On 16.5.2013 16:46, Cuong Hoang wrote: >> Pro for the master server. I'm aware of write cache issue on SSDs in >> case of power loss. However, our hosting provider doesn't offer any >> other choices of SSD drives with supercapacitor. To minimise risk, we >> will also set up another RAID 10 SAS in streaming replication mode. For >> our application, a few seconds of data loss is acceptable. > > Streaming replication allows zero data loss if used in synchronous mode. > I'm not sure synchronous replication is really an option here as it will slow the master down to spinning disk io speeds, unless the standby is configured with SSDs as well - which probably defeats the purpose of this setup. On the other hand, if the system is so loaded that a pure SAS (spinning drive) solution can't keen up, then the standby lag may get to be way more than a few seconds...which means look out for huge data loss. I'd be inclined to apply more leverage to hosting provider to source SSDs suitable for your needs, or change hosting providers. Regards Mark
On Thu, May 16, 2013 at 7:46 AM, Cuong Hoang <climbingrose@gmail.com> wrote: > For our application, a few seconds of data loss is acceptable. If a few seconds of data loss is acceptable, I would seriously look at the synchronous_commit setting and think about turning that off rather than risk silent corruption with non-enterprise SSDs. http://www.postgresql.org/docs/9.2/interactive/runtime-config-wal.html#GUC-SYNCHRONOUS-COMMIT "Unlike fsync, setting this parameter to off does not create any risk of database inconsistency: an operating system or database crash might result in some recent allegedly-committed transactions being lost, but the database state will be just the same as if those transactions had been aborted cleanly. So, turning synchronous_commit off can be a useful alternative when performance is more important than exact certainty about the durability of a transaction." With a default wal_writer_delay setting of 200ms, you will only be at risk of losing at most 600ms of transactions in the event of an unexpected crash or power loss, but write performance should go up a huge amount, especially if they are a lot of small writes as you describe. -Dave
On Fri, May 17, 2013 at 1:34 AM, David Rees <drees76@gmail.com> wrote: > On Thu, May 16, 2013 at 7:46 AM, Cuong Hoang <climbingrose@gmail.com> wrote: >> For our application, a few seconds of data loss is acceptable. > > If a few seconds of data loss is acceptable, I would seriously look at > the synchronous_commit setting and think about turning that off rather > than risk silent corruption with non-enterprise SSDs. That is not going to help. Since the drives lie about fsync, upon a power event you must assume the database is corrupt. I think his proposed configuration is the best bet (although I would strongly consider putting SSD on the standby as well). Personally, I think non SSD drives are obsolete for database purposes and will not recommend them for any configuration. Ideally though, OP would be using S3700 and we wouldn't be having this conversation. merlin
On Fri, May 17, 2013 at 8:17 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Fri, May 17, 2013 at 1:34 AM, David Rees <drees76@gmail.com> wrote: >> On Thu, May 16, 2013 at 7:46 AM, Cuong Hoang <climbingrose@gmail.com> wrote: >>> For our application, a few seconds of data loss is acceptable. >> >> If a few seconds of data loss is acceptable, I would seriously look at >> the synchronous_commit setting and think about turning that off rather >> than risk silent corruption with non-enterprise SSDs. > > That is not going to help. whoops -- misread your post heh (you were suggesting to use classic hard drives). yeah, that might work but it only buys you so much particuarly if there is a lot of random activity in the heap. merlin
On 17.5.2013 03:34, Mark Kirkwood wrote: > On 17/05/13 12:06, Tomas Vondra wrote: >> Hi, >> >> On 16.5.2013 16:46, Cuong Hoang wrote: > >>> Pro for the master server. I'm aware of write cache issue on SSDs in >>> case of power loss. However, our hosting provider doesn't offer any >>> other choices of SSD drives with supercapacitor. To minimise risk, we >>> will also set up another RAID 10 SAS in streaming replication mode. For >>> our application, a few seconds of data loss is acceptable. >> >> Streaming replication allows zero data loss if used in synchronous mode. >> > > I'm not sure synchronous replication is really an option here as it will > slow the master down to spinning disk io speeds, unless the standby is > configured with SSDs as well - which probably defeats the purpose of > this setup. The master waits for reception of the data, not writing them to the disks. It will have to write them eventually (and that might cause issues), but I'm not really sure it's that simple. > On the other hand, if the system is so loaded that a pure SAS (spinning > drive) solution can't keen up, then the standby lag may get to be way > more than a few seconds...which means look out for huge data loss. Don't forget the slave does not perform all the I/O (searching for the row etc.). It's difficult to say how much this will save, though. Tomas
Do you really need a running standby for fast failover? What about doing plain WAL archiging? I'd definitely consider that, because even if you setup a SAS-based replica, you can't use it for production as it does no handle the load. I think you could setup WAL archiving and in case of crash just use the base backup and replay the WAL from the archive. This means the SAS-based system is purely for WAL archiving, i.e. performs only sequential writes which should not be a big deal. The recovery will be performed on the SSD system, which should handle it fine. If you need faster recovery, you may perform it incrementally on the SAS system (it will take some time, but it won't influence the master). You might do that daily or something like that. The only problem with this is that this is file based, and could mean lag (up to 16MB or archive_timeout). But this should not be problem if you place the WAL on SAS drives with controller. If you use RAID, you should be perfectly fine. So this is what I'd suggest: 1) use SSD for data files, SAS RAID1 for WAL on the master 2) setup WAL archiving (base backup + archive on SAS system) 3) update the base backup daily (incremental recovery) 4) in case of crash, keep WAL from the archive and pg_xlog on the SAS RAID (on master) Tomas On 17.5.2013 02:21, Cuong Hoang wrote: > Hi Tomas, > > We have a lot of small updates and some inserts. The database size is at > 35GB including indexes and TOAST. We think it will keep growing to about > 200GB. We usually have a burst of about 500k writes in about 5-10 > minutes which basically cripples IO on the current servers. I've tried > to increase the checkpoint_segments, checkpoint_timeout etc. as > recommended in "PostgreSQL 9.0 Performance" book. However, it seems like > our server just couldn't handle the current load. > > Here is the server specs: > > Dual E5620, 32GB RAM, 4x1TB SAS 15k in RAID10 > > Here are some core PostgreSQL configs: > > shared_buffers = 2GB # min 128kB > work_mem = 64MB # min 64kB > maintenance_work_mem = 1GB # min 1MB > wal_buffers = 16MB > checkpoint_segments = 128 > checkpoint_timeout = 30min > checkpoint_completion_target = 0.7 > > > Thanks, > Cuong > > > On Fri, May 17, 2013 at 10:06 AM, Tomas Vondra <tv@fuzzy.cz > <mailto:tv@fuzzy.cz>> wrote: > > Hi, > > On 16.5.2013 16:46, Cuong Hoang wrote: > > Hi all, > > > > Our application is heavy write and IO utilisation has been the problem > > for us for a while. We've decided to use RAID 10 of 4x500GB > Samsung 840 > > What does "heavy write" mean in your case? Does that mean a lot of small > transactions or few large ones? > > What have you done to tune the server? > > > Pro for the master server. I'm aware of write cache issue on SSDs in > > case of power loss. However, our hosting provider doesn't offer any > > other choices of SSD drives with supercapacitor. To minimise risk, we > > will also set up another RAID 10 SAS in streaming replication > mode. For > > our application, a few seconds of data loss is acceptable. > > Streaming replication allows zero data loss if used in synchronous mode. > > > My question is, would corrupted data files on the primary server > affect > > the streaming standby? In other word, is this setup acceptable in > terms > > of minimising deficiency of SSDs? > > It should be. > > Have you considered using a UPS? That would make the SSDs about as > reliable as SATA/SAS drives - the UPS may fail, but so may a BBU unit on > the SAS controller. > > Tomas > > > -- > Sent via pgsql-performance mailing list > (pgsql-performance@postgresql.org > <mailto:pgsql-performance@postgresql.org>) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > >
Thanks for suggestion Tomas. We're about to set up WAL backup to Amazon S3. I think this should cover all of our bases. At least for the moment, SAS-based standby seems to keep up with the master because that's its sole purpose. We're not sending queries to the hot standby. We also consider switching the hot standby to fast failover as you suggested. I guess for now we should stick to streaming replication because the slave is still keeping up with the master.
Btw, after switching to SSD, performance improves vastly. IO utilisation drops from 100% to 6% in peak periods. That's an order of magnitude faster!
Cheers,
Cuong
On Mon, May 20, 2013 at 8:34 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
Do you really need a running standby for fast failover? What about doing
plain WAL archiging? I'd definitely consider that, because even if you
setup a SAS-based replica, you can't use it for production as it does no
handle the load.
I think you could setup WAL archiving and in case of crash just use the
base backup and replay the WAL from the archive.
This means the SAS-based system is purely for WAL archiving, i.e.
performs only sequential writes which should not be a big deal.
The recovery will be performed on the SSD system, which should handle it
fine. If you need faster recovery, you may perform it incrementally on
the SAS system (it will take some time, but it won't influence the
master). You might do that daily or something like that.
The only problem with this is that this is file based, and could mean
lag (up to 16MB or archive_timeout). But this should not be problem if
you place the WAL on SAS drives with controller. If you use RAID, you
should be perfectly fine.
So this is what I'd suggest:
1) use SSD for data files, SAS RAID1 for WAL on the master
2) setup WAL archiving (base backup + archive on SAS system)
3) update the base backup daily (incremental recovery)
4) in case of crash, keep WAL from the archive and pg_xlog on the
SAS RAID (on master)
Tomas
On 17.5.2013 02:21, Cuong Hoang wrote:
> Hi Tomas,
>
> We have a lot of small updates and some inserts. The database size is at
> 35GB including indexes and TOAST. We think it will keep growing to about
> 200GB. We usually have a burst of about 500k writes in about 5-10
> minutes which basically cripples IO on the current servers. I've tried
> to increase the checkpoint_segments, checkpoint_timeout etc. as
> recommended in "PostgreSQL 9.0 Performance" book. However, it seems like
> our server just couldn't handle the current load.
>
> Here is the server specs:
>
> Dual E5620, 32GB RAM, 4x1TB SAS 15k in RAID10
>
> Here are some core PostgreSQL configs:
>
> shared_buffers = 2GB # min 128kB
> work_mem = 64MB # min 64kB
> maintenance_work_mem = 1GB # min 1MB
> wal_buffers = 16MB
> checkpoint_segments = 128
> checkpoint_timeout = 30min
> checkpoint_completion_target = 0.7
>
>
> Thanks,
> Cuong
>
>
> On Fri, May 17, 2013 at 10:06 AM, Tomas Vondra <tv@fuzzy.cz> <mailto:pgsql-performance@postgresql.org>)> <mailto:tv@fuzzy.cz>> wrote:
>
> Hi,
>
> On 16.5.2013 16:46, Cuong Hoang wrote:
> > Hi all,
> >
> > Our application is heavy write and IO utilisation has been the problem
> > for us for a while. We've decided to use RAID 10 of 4x500GB
> Samsung 840
>
> What does "heavy write" mean in your case? Does that mean a lot of small
> transactions or few large ones?
>
> What have you done to tune the server?
>
> > Pro for the master server. I'm aware of write cache issue on SSDs in
> > case of power loss. However, our hosting provider doesn't offer any
> > other choices of SSD drives with supercapacitor. To minimise risk, we
> > will also set up another RAID 10 SAS in streaming replication
> mode. For
> > our application, a few seconds of data loss is acceptable.
>
> Streaming replication allows zero data loss if used in synchronous mode.
>
> > My question is, would corrupted data files on the primary server
> affect
> > the streaming standby? In other word, is this setup acceptable in
> terms
> > of minimising deficiency of SSDs?
>
> It should be.
>
> Have you considered using a UPS? That would make the SSDs about as
> reliable as SATA/SAS drives - the UPS may fail, but so may a BBU unit on
> the SAS controller.
>
> Tomas
>
>
> --
> Sent via pgsql-performance mailing list
> (pgsql-performance@postgresql.org> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>
>
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
On 5/16/13 8:06 PM, Tomas Vondra wrote: > Have you considered using a UPS? That would make the SSDs about as > reliable as SATA/SAS drives - the UPS may fail, but so may a BBU unit on > the SAS controller. That's not true at all. Any decent RAID controller will have an option to stop write-back caching when the battery is bad. Things will slow badly when that happens, but there is zero data risk from a short-term BBU failure. The only serious risk with a good BBU setup are that you'll have a power failure lasting so long that the battery runs down before the cache can be flushed to disk. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On 5/16/13 7:52 PM, Cuong Hoang wrote: > The standby host will be disk-based so it > will be less vulnerable to power loss. If it can keep up with replay from the faster master, that sounds like a decent backup. Make sure you setup all write caches very carefully on that system, because it's going to be your best hope to come back up quickly after a real crash. Any vendor that pushes Samsung 840 drives for database use should be ashamed of themselves. Those drives are turning into the new incarnation of what we saw with the Intel X25-E/X-25-M: they're very popular, but any system built with them will corrupt itself on the first failure. I expect to see a new spike in people needing data recovery help after losing their Samsung 840 based servers start soon. > I forgot to mention that we'll set up Wal-e > <https://github.com/wal-e/wal-e> to ship base backups and WALs to Amazon > S3 continuous as another safety measure. Again, the lost of a few WALs > would not be a big issue for us. That's a useful plan. Just make sure you ship new base backups fairly often. If you have to fall back to that copy of the data, you'll need to replay anything that's happened since the last base backup happened. That can easily result in a week of downtime if you're only shipping backups once per month, for example. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On 20.5.2013 05:00, Greg Smith wrote: > On 5/16/13 8:06 PM, Tomas Vondra wrote: >> Have you considered using a UPS? That would make the SSDs about as >> reliable as SATA/SAS drives - the UPS may fail, but so may a BBU unit on >> the SAS controller. > > That's not true at all. Any decent RAID controller will have an option > to stop write-back caching when the battery is bad. Things will slow > badly when that happens, but there is zero data risk from a short-term > BBU failure. The only serious risk with a good BBU setup are that > you'll have a power failure lasting so long that the battery runs down > before the cache can be flushed to disk. That's true, no doubt about that. What I was trying to say is that a controller with BBU (or a SSD with proper write cache protection) is about as safe as an UPS when it comes to power outages. Assuming both are properly configured / watched / checked. Sure, there are scenarios where UPS is not going to help (e.g. a PSU failure) so a controller with BBU is better from this point of view. I've seen crashes with both options (BBU / UPS), both because of misconfiguration and hw issues. BTW I don't know what controller are we talking about here - it might be as crappy as the SSD drives. What I was thinking about in this case is using two SSD-based systems with UPSes. That'd allow fast failover (which may not be possible with the SAS based replica, as it does not handle the load). But yes, I do agree that the provider should be ashamed for not providing reliable SSDs in the first place. Getting reliable SSDs should be the first option - all these suggestions are really just workarounds of this rather simple issue. Tomas
On Mon, May 20, 2013 at 3:57 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > But yes, I do agree that the provider should be ashamed for not > providing reliable SSDs in the first place. Getting reliable SSDs should > be the first option - all these suggestions are really just workarounds > of this rather simple issue. Absolutely. Reliable SSD should be the first and only option. They are significantly more expensive (more than 2x) but are worth it. When it comes to databases, particularly in the open source postgres world, hard drives are completely obsolete. SSD are a couple of orders of magnitude faster and this (while still slow in computer terms) is fast enough to put storage into the modern area by anyone who is smart enough to connect a sata cable. While everyone likes to obsess over super scalable architectures technology has finally advanced to the point where your typical SMB system can be handled by a sincle device. merlin
On 5/20/13 6:32 PM, Merlin Moncure wrote: > When it comes to databases, particularly in the open source postgres > world, hard drives are completely obsolete. SSD are a couple of > orders of magnitude faster and this (while still slow in computer > terms) is fast enough to put storage into the modern area by anyone > who is smart enough to connect a sata cable. You're skirting the edge of vendor Kool-Aid here. I'm working on a very detailed benchmark vs. real world piece centered on Intel's 710 models, one of the few reliable drives on the market. (Yes, I have a DC S3700 too, just not as much data yet) While in theory these drives will hit two orders of magnitude speed improvement, and I have benchmarks where that's the case, in practice I've seen them deliver less than 5X better too. You get one guess which I'd consider more likely to happen on a difficult database server workload. The only really huge gain to be had using SSD is commit rate at a low client count. There you can easily do 5,000/second instead of a spinning disk that is closer to 100, for less than what the battery-backed RAID card along costs to speed up mechanical drives. My test server's 100GB DC S3700 was $250. That's still not two orders of magnitude faster though. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Tue, May 21, 2013 at 7:19 PM, Greg Smith <greg@2ndquadrant.com> wrote: > On 5/20/13 6:32 PM, Merlin Moncure wrote: > >> When it comes to databases, particularly in the open source postgres >> world, hard drives are completely obsolete. SSD are a couple of >> orders of magnitude faster and this (while still slow in computer >> terms) is fast enough to put storage into the modern area by anyone >> who is smart enough to connect a sata cable. > > > You're skirting the edge of vendor Kool-Aid here. I'm working on a very > detailed benchmark vs. real world piece centered on Intel's 710 models, one > of the few reliable drives on the market. (Yes, I have a DC S3700 too, just > not as much data yet) While in theory these drives will hit two orders of > magnitude speed improvement, and I have benchmarks where that's the case, in > practice I've seen them deliver less than 5X better too. You get one guess > which I'd consider more likely to happen on a difficult database server > workload. > > The only really huge gain to be had using SSD is commit rate at a low client > count. There you can easily do 5,000/second instead of a spinning disk that > is closer to 100, for less than what the battery-backed RAID card along > costs to speed up mechanical drives. My test server's 100GB DC S3700 was > $250. That's still not two orders of magnitude faster though. That's most certainly *not* the only gain to be had: random read rates of large databases (a very important metric for data analysis) can easily hit 20k tps. So I'll stand by the figure. Another point: that 5000k commit raid is sustained, whereas a raid card will spectacularly degrade until the cache overflows; it's not fair to compare burst with sustained performance. To hit 5000k sustained commit rate along with good random read performance, you'd need a very expensive storage system. Right now I'm working (not by choice) with a teir-1 storage system (let's just say it rhymes with 'weefax') and I would trade it for direct attached SSD in a heartbeat. Also, note that 3rd party benchmarking is showing the 3700 completely smoking the 710 in database workloads (for example, see http://www.anandtech.com/show/6433/intel-ssd-dc-s3700-200gb-review/6). Anyways, SSD installation in the post-capactior era has been 100.0% correlated in my experience (admittedly, around a dozen or so systems) with removal of storage as the primary performance bottleneck, and I'll stand by that. I'm not claiming to work with extremely high transaction rate systems but then again neither are most of the people reading this list. Disk drives are obsolete for database installations. merlin
On 5/22/13 9:30 AM, Merlin Moncure wrote: > That's most certainly *not* the only gain to be had: random read rates > of large databases (a very important metric for data analysis) can > easily hit 20k tps. So I'll stand by the figure. They can easily hit that number. Or they can do this: Device: r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdd 2702.80 19.40 19.67 0.16 14.91 273.68 71.74 0.37 100.00 sdd 2707.60 13.00 19.53 0.10 14.78 276.61 90.34 0.37 100.00 That's an Intel 710 being crushed by a random read database server workload, unable to deliver even 3000 IOPS / 20MB/s. I have hours of data like this from several servers. Yes, the DC S3700 drives are at least twice as fast on average, but I haven't had one for long enough to see what its worst case really looks like yet. Here's a mechanical drive hitting its limits on the same server as the above: Device: r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdb 100.80 220.60 1.06 1.79 18.16 228.78 724.11 3.11 100.00 sdb 119.20 220.40 1.09 1.77 17.22 228.36 677.46 2.94 100.00 Giving around 3MB/s. I am quite happy saying the SSD is delivering about a single order of magnitude improvement, in both throughput and latency. But that's it, and a single order of magnitude improvement is sometimes not good enough to solve all storage issues. If all you care about is speed, the main situation where I've found there to still be value in "tier 1 storage" are extremely write-heavy workloads. The best write numbers I've seen out of Postgres are still going into a monster EMC unit, simply because the unit I was working with had 16GB of durable cache. Yes, that only supports burst speeds, but 16GB absorbs a whole lot of writes before it fills. Write re-ordering and combining can accelerate traditional disk quite a bit when it's across a really large horizon like that. > Anyways, SSD installation in the post-capacitor era has been 100.0% > correlated in my experience (admittedly, around a dozen or so systems) > with removal of storage as the primary performance bottleneck, and > I'll stand by that. I wish it were that easy for everyone, but that's simply not true. Are there lots of systems where SSD makes storage look almost free it's so fast? Sure. But presuming all systems will look like that is optimistic, and it sets unreasonable expectations. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Wed, May 22, 2013 at 9:18 AM, Greg Smith <greg@2ndquadrant.com> wrote: > On 5/22/13 9:30 AM, Merlin Moncure wrote: >> >> That's most certainly *not* the only gain to be had: random read rates >> of large databases (a very important metric for data analysis) can >> easily hit 20k tps. So I'll stand by the figure. > > > They can easily hit that number. Or they can do this: > > Device: r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util > sdd 2702.80 19.40 19.67 0.16 14.91 273.68 71.74 0.37 100.00 > sdd 2707.60 13.00 19.53 0.10 14.78 276.61 90.34 0.37 100.00 yup -- I've seen this too...the high transaction rates quickly fall over when there is concurrent writing (but for bulk 100% read OLAP queries I see the higher figure more often than not). Even so, it's a huge difference over 100. unfortunately, I don't have a s3700 to test with, but based on everything i've seen it looks like it's a mostly solved problem. (for example, see here: http://www.storagereview.com/intel_ssd_dc_s3700_series_enterprise_ssd_review). Tests that drive the 710 to <3k iops were not able to take the 3700 down under 10k at any queue depth. Take a good look at the 8k preconditioning curve latency chart -- everything you need to know is right there; it's a completely different controller and offers much better worst case performance. merlin
On 5/22/13 11:05 AM, Merlin Moncure wrote: > unfortunately, I don't have a s3700 to > test with, but based on everything i've seen it looks like it's a > mostly solved problem. (for example, see here: > http://www.storagereview.com/intel_ssd_dc_s3700_series_enterprise_ssd_review). > Tests that drive the 710 to <3k iops were not able to take the 3700 > down under 10k at any queue depth. I have two weeks of real-world data from DC S3700 units in production and a pile of synthetic test results. The S3700 drives are at least 2X as fast as the 710 models, and there are synthetic tests where it's closer to 10X. On a 5,000 IOPS workload that crushed a pair of 710 units, the new drives are only hitting 50% utilization now. Does that make worst-case 10K? Maybe. I can't just extrapolate from the 50% figures and predict the throughput I'll see at 100% though, so I'm still waiting for more data before I feel comfortable saying exactly what the worst case looks like. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On 05/22/2013 08:30 AM, Merlin Moncure wrote: > I'm not claiming to work with extremely high transaction rate systems > but then again neither are most of the people reading this list. > Disk drives are obsolete for database installations. Well, you may not be able to make that claim, but I can. While we don't use Intel SSDs, our first-gen FusinoIO cards can deliver about 20k PostgreSQL TPS of our real-world data right off the device before caching effects start boosting the numbers. These days, devices like this make our current batch look like rusty old hulks in comparison, so the gap is just widening. Hard drives stand no chance at all. An 8-drive 15k RPM RAID-10 gave us about 1800 TPS back when we switched to FusionIO about two years ago. So, while Intel drives themselves may not be able to hit sustained 100x speeds over spindles, it's pretty clear that that's a firmware or implementation limitation. The main "issue" is that the sustained sequence scan speeds are generally less than an order of magnitude faster than drives. So as soon as you hit something that isn't limited by random IOPS, spindles get a chance to catch up. But those situations are few and far between in a heavy transactional setting. Having used NVRAM/SSDs, I could never go back so long as the budget allows us to procure them. A data warehouse? Maybe spindles still have a place there. Heavy transactional system? Not a chance. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-676-8870 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
On 5/22/2013 8:18 AM, Greg Smith wrote: > > They can easily hit that number. Or they can do this: > > Device: r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm > %util > sdd 2702.80 19.40 19.67 0.16 14.91 273.68 71.74 0.37 100.00 > sdd 2707.60 13.00 19.53 0.10 14.78 276.61 90.34 0.37 100.00 > > That's an Intel 710 being crushed by a random read database server > workload, unable to deliver even 3000 IOPS / 20MB/s. I have hours of > data like this from several servers. This is interesting. Do you know what it is about the workload that leads to the unusually low rps ?
On 5/22/13 12:56 PM, Shaun Thomas wrote: > Well, you may not be able to make that claim, but I can. While we don't > use Intel SSDs, our first-gen FusinoIO cards can deliver about 20k > PostgreSQL TPS of our real-world data right off the device before > caching effects start boosting the numbers. I've seen FusionIO hit that 20K commit number, as well as hitting 75K IOPS on random reads (600MB/s). They are roughly 5 to 10X faster than the Intel 320/710 drives. There's a corresponding price hit though, and having to provision PCI-E cards is a pain in some systems. A claim that a FusionIO drive in particular is capable of 100X the performance of a spinning drive, that I wouldn't dispute. I even made that claim myself with some benchmark numbers to back it up: http://www.fusionio.com/blog/fusion-io-boosts-postgresql-performance/ That's not just a generic SSD anymore though. > An 8-drive 15k RPM RAID-10 gave us about 1800 TPS back when we switched > to FusionIO about two years ago. So, while Intel drives themselves may > not be able to hit sustained 100x speeds over spindles, it's pretty > clear that that's a firmware or implementation limitation. 1800 TPS to 20K TPS is just over a 10X speedup. As for Intel vs. FusionIO, rather than implementation quality it's more what architecture you're willing to pay for. If you test a few models across Intel's product line, you can see there's a rough size vs. speed correlation. The larger units have more channels of flash going at the same time. FusionIO has architected such that there is a wide write path even on their smallest cards. That 75K IOPS number I got even out of their little 80GB card. (since dropped from the product line) I can buy a good number of Intel DC S3700 drives for what a FusionIO card costs though. > The main "issue" is that the sustained sequence scan speeds are > generally less than an order of magnitude faster than drives. So as soon > as you hit something that isn't limited by random IOPS, spindles get a > chance to catch up. I have some moderately fast SSD based transactional systems that are still using traditional drives with battery-backed cache for the sequential writes of the WAL volume, where the data volume is on Intel 710 disks. WAL writes really burn through flash cells, too, so keeping them on traditional drives can be cost effective in a few ways. That approach is lucky to hit 10K TPS though, so it can't compete against what a PCI-E card like the FusionIO drives are capable of. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On 05/22/2013 01:06 PM, Greg Smith wrote: > There's a corresponding price hit though, and > having to provision PCI-E cards is a pain in some systems. Oh, totally. Specialist devices like RAMSAN, FusionIO, Virident, or Whiptail are hideously expensive, even compared to high-end SSDs. I was just pointing out out that the technical limitations of the underlying chips (NVRAM) can be overcome or augmented in ways Intel isn't doing (yet). > 1800 TPS to 20K TPS is just over a 10X speedup. True. But in that case, it was a single device pitted against 8 very high-end 15k RPM spindles. I'd need 80-100 drives in a massive SAN to get similar numbers, and at that point, we're not really saving any money and have a lot more failure points and maintenance. I guess you get way more space, though. :) > The larger units have more channels of flash going at the > same time. FusionIO has architected such that there is a wide write > path even on their smallest cards. Yep. And I've been watching these technologies like a hawk waiting for the new chips and their performance profiles. Some of the newer chips have performance multipliers even on a single die in the larger sizes. > I can buy a good number of Intel DC S3700 drives for what a FusionIO > card costs though. I know. :( But knowing the performance they can deliver, I often dream of a perfect device comprised of several PCIe-based NVRAM cards in a hot-swap PCIe enclosure (they exist!). Something like that in a 3U piece of gear would absolutely annihilate even the largest SAN. At the mere cost of a half million or so. :p > I have some moderately fast SSD based transactional systems that are > still using traditional drives with battery-backed cache for the > sequential writes of the WAL volume, where the data volume is on > Intel 710 disks. That sounds like a very sane and recommendable approach, and coincidentally the same we would use if we couldn't afford the FusionIO drives. I'm actually curious to see how using ZFS with its CoW profile and using a bundle of SSDs as a ZIL would compare. It's still disk-based, but the transparent SSD layer acting as a gigantic passive read and write cache intrigue me. It seems like it would also make a good middle-ground concerning cost vs. performance. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-676-8870 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
On 05/22/2013 12:31 PM, David Boreham wrote: >> Device: r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util >> sdd 2702.80 19.40 19.67 0.16 14.91 273.68 71.74 0.37 100.00 >> sdd 2707.60 13.00 19.53 0.10 14.78 276.61 90.34 0.37 100.00 >> >> That's an Intel 710 being crushed by a random read database server >> workload, unable to deliver even 3000 IOPS / 20MB/s. I have hours of >> data like this from several servers. > > This is interesting. Do you know what it is about the workload that > leads to the unusually low rps ? That read rate and that throughput suggest 8k reads. The queue size is 270+, which is pretty high for a single device, even when it's an SSD. Some SSDs seem to break down on queue sizes over 4, and 15 sectors spread across a read queue of 270 is pretty hash. The drive tested here basically fell over on servicing a huge diverse read queue, which suggests a firmware issue. Often this is because the device was optimized for sequential reads and post lower IOPS than is theoretically possible so they can advertise higher numbers alongside consumer-grade disks. They're Greg's disks though. :) -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-676-8870 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
On 05/22/2013 11:06 AM, Greg Smith wrote: > I have some moderately fast SSD based transactional systems that are > still using traditional drives with battery-backed cache for the > sequential writes of the WAL volume, where the data volume is on Intel > 710 disks. WAL writes really burn through flash cells, too, so keeping > them on traditional drives can be cost effective in a few ways. That > approach is lucky to hit 10K TPS though, so it can't compete against > what a PCI-E card like the FusionIO drives are capable of. Greg, can you elaborate on the SSD + Xlog issue? What type of burn through are we talking about? JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579 PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc For my dreams of your image that blossoms a rose in the deeps of my heart. - W.B. Yeats
On 5/22/13 3:06 PM, Joshua D. Drake wrote: > Greg, can you elaborate on the SSD + Xlog issue? What type of burn > through are we talking about? You're burning through flash cells at a multiple of the total WAL write volume. The system I gave iostat snapshots from upthread (with the Intel 710 hitting its limit) archives about 1TB of WAL each week. The actual amount of WAL written in terms of erased flash blocks is even higher though, because sometimes the flash is hit with partial page writes. The write amplification of WAL is much worse than the main database. I gave a rough intro to this on the Intel drives at http://blog.2ndquadrant.com/intel_ssds_lifetime_and_the_32/ and there's a nice "Write endurance" table at http://www.tomshardware.com/reviews/ssd-710-enterprise-x25-e,3038-2.html The cheapest of the Intel SSDs I have here only guarantees 15TB of total write endurance. Eliminating >1TB of writes per week by moving the WAL off SSD is a pretty significant change, even though the burn rate isn't a simple linear thing--you won't burn the flash out in only 15 weeks. The production server is actually using the higher grade 710 drives that aim for 900TB instead. But I do have standby servers using the low grade stuff, so anything I can do to decrease SSD burn rate without dropping performance is useful. And only the top tier of transaction rates will outrun a RAID1 pair of 15K drives dedicated to WAL. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Wed, May 22, 2013 at 2:30 PM, Greg Smith <greg@2ndquadrant.com> wrote: > On 5/22/13 3:06 PM, Joshua D. Drake wrote: >> >> Greg, can you elaborate on the SSD + Xlog issue? What type of burn >> through are we talking about? > > > You're burning through flash cells at a multiple of the total WAL write > volume. The system I gave iostat snapshots from upthread (with the Intel > 710 hitting its limit) archives about 1TB of WAL each week. The actual > amount of WAL written in terms of erased flash blocks is even higher though, > because sometimes the flash is hit with partial page writes. The write > amplification of WAL is much worse than the main database. > > I gave a rough intro to this on the Intel drives at > http://blog.2ndquadrant.com/intel_ssds_lifetime_and_the_32/ and there's a > nice "Write endurance" table at > http://www.tomshardware.com/reviews/ssd-710-enterprise-x25-e,3038-2.html > > The cheapest of the Intel SSDs I have here only guarantees 15TB of total > write endurance. Eliminating >1TB of writes per week by moving the WAL off > SSD is a pretty significant change, even though the burn rate isn't a simple > linear thing--you won't burn the flash out in only 15 weeks. Certainly, intel 320 is not designed for 1tb/week workloads. > The production server is actually using the higher grade 710 drives that aim > for 900TB instead. But I do have standby servers using the low grade stuff, > so anything I can do to decrease SSD burn rate without dropping performance > is useful. And only the top tier of transaction rates will outrun a RAID1 > pair of 15K drives dedicated to WAL. s3700 is rated for 10 drive writes/day for 5 years. so, for 200gb drive, that's 200gb * 10/day * 365 days * 5, that's 3.65 million gigabytes or ~ 3.5 petabytes. 1tb/week would take 67 years to burn through / whatever you assume for write amplification / whatever extra penalty you give if you are shooting for > 5 year duty cycle (flash degrades faster the older it is) *for a single 200gb device*. write endurance is not a problem for this drive, in fact it's a very reasonable assumption that the faster worst case random performance is directly related to reduced write amplification. btw, cost/pb of this drive is less than half of the 710 (which IMO was obsolete the day the s3700 hit the street). merlin
On 05/22/2013 02:51 PM, Merlin Moncure wrote: > s3700 is rated for 10 drive writes/day for 5 years. so, for 200gb > drive, that's 200gb * 10/day * 365 days * 5, that's 3.65 million > gigabytes or ~ 3.5 petabytes. Nice. And on that note: http://www.tomshardware.com/reviews/ssd-dc-s3700-raid-0-benchmarks,3480.html They actually over-saturated the backplane with 24 of these drives in a giant RAID-0, tipping the scales at around 3.1M IOPS. Not bad for consumer-level drives. I'd love to see a RAID-10 of these. I'm having a hard time coming up with a database workload that would run into performance problems with a (relatively inexpensive) setup like this. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-676-8870 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
On 5/22/13 3:51 PM, Merlin Moncure wrote: > s3700 is rated for 10 drive writes/day for 5 years. so, for 200gb drive, that's > 200gb * 10/day * 365 days * 5, that's 3.65 million gigabytes or ~ 3.5 petabytes. Yes, they've improved on the 1.5PB that the 710 drives topped out at. For that particular drive, this is unlikely to be a problem. But I'm not willing to toss out longevity issues at therefore irrelevant in all cases. Some flash still costs a lot more than Intel's SSDs do, like the FusionIO products. Chop even a few percent of the wear out of the price tag on a RAMSAN and you've saved some real money. And there are some other products with interesting price/performance/capacity combinations that are also sensitive to wearout. Seagate's hybrid drives have turned interesting now that they cache writes safely for example. There's no cheaper way to get 1TB with flash write speeds for small commits than that drive right now. (Test results on that drive coming soon, along with my full DC S3700 review) > btw, cost/pb of this drive is less than half of > the 710 (which IMO was obsolete the day the s3700 hit the street). You bet, and I haven't recommended anyone buy a 710 since the announcement. However, "hit the street" is still an issue. No one has been able to keep DC S3700 drives in stock very well yet. It took me three tries through Newegg before my S3700 drive actually shipped. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Wed, May 22, 2013 at 3:06 PM, Greg Smith <greg@2ndquadrant.com> wrote: > You bet, and I haven't recommended anyone buy a 710 since the announcement. > However, "hit the street" is still an issue. No one has been able to keep > DC S3700 drives in stock very well yet. It took me three tries through > Newegg before my S3700 drive actually shipped. Well, let's look a the facts: *) >2x write endurance vs 710 (500x 320) *) 2-10x performance depending on workload specifics *) much better worst case/average latency *) half the cost of the 710!? After obsoleting hard drives with the introduction of the 320/710, intel managed to obsolete their *own* entire lineup with the s3700 (with the exception of the pcie devices and the ultra low cost notebook 1$/gb segment). I'm amazed these drives were sold at that price point: they could have been sold at 3-4x the current price and still have a willing market (note, please don't do this). Presumably most of the inventory is being bought up by small channel resellers for a quick profit. Even by the fast moving standards of the SSD world this product is an absolute game changer and has ushered in the new era of fast storage with a loud 'gong'. Oh, the major vendors will still keep their rip-off going on a little longer selling their storage trays, raid controllers, entry/mid level SANS, SAS HBAs etc at huge markup to customers who don't need them (some will still need them, but the bar suddenly just got spectacularly raised before you have to look into enterprise gear). CRT was overtaken by LCD monitor in mind 2004 in terms of sales: I'd say it's late 2002/early 2003, at least for new deployments. merlin
On May 22, 2013, at 4:06 PM, Greg Smith wrote: > And there are some other products with interesting price/performance/capacity combinations that are also sensitive to wearout. Seagate's hybrid drives have turned interesting now that they cache writes safely for example. There's no cheaperway to get 1TB with flash write speeds for small commits than that drive right now. (Test results on that drive comingsoon, along with my full DC S3700 review) I am really looking forward to that. Will you announce here or just post on the 2ndQuadrant blog? Another "hybrid" solution is to run ZFS on some decent hard drives and then put the ZFS intent log on SSDs. With very syntheticbenchmarks, the random write performance is excellent. All of these discussions about alternate storage media are great - everyone has different needs and there are certainly anumber of deployments that can "get away" with spending much less money by adding some solid state storage. There's reallyan amazing number of options today… Thanks, Charles
On 05/22/2013 01:57 PM, Merlin Moncure wrote: > > On Wed, May 22, 2013 at 3:06 PM, Greg Smith <greg@2ndquadrant.com> wrote: >> You bet, and I haven't recommended anyone buy a 710 since the announcement. >> However, "hit the street" is still an issue. No one has been able to keep >> DC S3700 drives in stock very well yet. It took me three tries through >> Newegg before my S3700 drive actually shipped. > > Well, let's look a the facts: > *) >2x write endurance vs 710 (500x 320) > *) 2-10x performance depending on workload specifics > *) much better worst case/average latency > *) half the cost of the 710!? I am curious how the 710 or S3700 stacks up against the new M500 from Crucial? I know Intel is kind of the goto for these things but the m500 is power off protected and rated at: Endurance: 72TB total bytes written (TBW), equal to 40GB per day for 5 years . Granted it isn't he fasted pig in the poke but it sure seems like a very reasonable drive for the price: http://www.newegg.com/Product/Product.aspx?Item=20-148-695&ParentOnly=1&IsVirtualParent=1 Sincerely, Joshua D. Drake -- Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579 PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc For my dreams of your image that blossoms a rose in the deeps of my heart. - W.B. Yeats
On Wed, May 22, 2013 at 5:42 PM, Joshua D. Drake <jd@commandprompt.com> wrote: > I am curious how the 710 or S3700 stacks up against the new M500 from > Crucial? I know Intel is kind of the goto for these things but the m500 is > power off protected and rated at: Endurance: 72TB total bytes written (TBW), > equal to 40GB per day for 5 years . I don't think the m500 is power safe (nor is any drive at the <1$/gb price point). This drive is positioned as a desktop class disk drive. AFAIK, the s3700 strongly outclasses all competitors on price, performance, or both. Once you give up enterprise features of endurance and iops you have many options (samsung 840 is another one). Pretty soon these types of drives are going to be standard kit in workstations (and we'll be back to the IDE area of corrupted data, ha!). I would recommend none of them for server class use, they are inferior in terms of $/iop and $/gb written. for server class drives, see: hitachi ssd400m (10$/gb, slower!) kingston e100, etc. merlin
On 05/22/2013 04:37 PM, Merlin Moncure wrote: > > On Wed, May 22, 2013 at 5:42 PM, Joshua D. Drake <jd@commandprompt.com> wrote: >> I am curious how the 710 or S3700 stacks up against the new M500 from >> Crucial? I know Intel is kind of the goto for these things but the m500 is >> power off protected and rated at: Endurance: 72TB total bytes written (TBW), >> equal to 40GB per day for 5 years . > > I don't think the m500 is power safe (nor is any drive at the <1$/gb > price point). According the the data sheet it is power safe. http://investors.micron.com/releasedetail.cfm?ReleaseID=732650 http://www.micron.com/products/solid-state-storage/client-ssd/m500-ssd Sincerely, JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579 PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc For my dreams of your image that blossoms a rose in the deeps of my heart. - W.B. Yeats
On 23/05/13 13:01, Joshua D. Drake wrote: > > On 05/22/2013 04:37 PM, Merlin Moncure wrote: >> >> On Wed, May 22, 2013 at 5:42 PM, Joshua D. Drake >> <jd@commandprompt.com> wrote: >>> I am curious how the 710 or S3700 stacks up against the new M500 from >>> Crucial? I know Intel is kind of the goto for these things but the >>> m500 is >>> power off protected and rated at: Endurance: 72TB total bytes >>> written (TBW), >>> equal to 40GB per day for 5 years . >> >> I don't think the m500 is power safe (nor is any drive at the <1$/gb >> price point). > > According the the data sheet it is power safe. > > http://investors.micron.com/releasedetail.cfm?ReleaseID=732650 > http://www.micron.com/products/solid-state-storage/client-ssd/m500-ssd > > Yeah - they apparently have a capacitor on board. Their write endurance is where they don't compare so favorably to the S3700 (they are *much* cheaper mind you): - M500 120GB drive: 40GB per day for 5 years - S3700 100GB drive: 1000GB per day for 5 years But great to see more reasonably priced SSD with power off protection. Cheers Mark
On 5/22/13 6:42 PM, Joshua D. Drake wrote: > I am curious how the 710 or S3700 stacks up against the new M500 from > Crucial? I know Intel is kind of the goto for these things but the m500 > is power off protected and rated at: Endurance: 72TB total bytes written > (TBW), equal to 40GB per day for 5 years . The M500 is fine on paper, I had that one on my list of things to evaluate when I can. The general reliability of Crucial's consumer SSD has looked good recently. I'm not going to recommend that one until I actually see one work as expected though. I'm waiting for one to pass by or I reach a new toy purchasing spree. What makes me step very carefully here is watching what Intel went through when they released their first supercap drive, the 320 series. If you look at the nastiest of the firmware bugs they had, like the infamous "8MB bug", a lot of them were related to the new clean shutdown feature. It's the type of firmware that takes some exposure to the real world to flush out the bugs. The last of the enthusiast SSD players who tried to take this job on was OCZ with the Vertex 3 Pro, and they never got that model quite right before abandoning it altogether. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On 5/22/13 4:57 PM, Merlin Moncure wrote: > Oh, the major vendors will still keep their > rip-off going on a little longer selling their storage trays, raid > controllers, entry/mid level SANS, SAS HBAs etc at huge markup to > customers who don't need them (some will still need them, but the bar > suddenly just got spectacularly raised before you have to look into > enterprise gear). The angle to distinguish "enterprise" hardware is moving on to error related capabilities. Soon we'll see SAS drives with the 520 byte sectors and checksumming for example. And while SATA drives have advanced a long way, they haven't caught up with SAS for failure handling. It's still far too easy for a single crazy SATA device to force crippling bus resets for example. Individual SATA ports don't expect to share things with others, while SAS chains have a much better protocol for handling things. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On 23/05/13 13:32, Mark Kirkwood wrote: > On 23/05/13 13:01, Joshua D. Drake wrote: >> >> On 05/22/2013 04:37 PM, Merlin Moncure wrote: >>> >>> On Wed, May 22, 2013 at 5:42 PM, Joshua D. Drake >>> <jd@commandprompt.com> wrote: >>>> I am curious how the 710 or S3700 stacks up against the new M500 from >>>> Crucial? I know Intel is kind of the goto for these things but the >>>> m500 is >>>> power off protected and rated at: Endurance: 72TB total bytes >>>> written (TBW), >>>> equal to 40GB per day for 5 years . >>> >>> I don't think the m500 is power safe (nor is any drive at the <1$/gb >>> price point). >> >> According the the data sheet it is power safe. >> >> http://investors.micron.com/releasedetail.cfm?ReleaseID=732650 >> http://www.micron.com/products/solid-state-storage/client-ssd/m500-ssd >> >> > > Yeah - they apparently have a capacitor on board. > Make that quite a few capacitors (top right corner): http://regmedia.co.uk/2013/05/07/m500_4.jpg
On Wednesday, May 22, 2013, Joshua D. Drake <jd@commandprompt.com> wrote:
>
> On 05/22/2013 04:37 PM, Merlin Moncure wrote:
>>
>> On Wed, May 22, 2013 at 5:42 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
>>>
>>> I am curious how the 710 or S3700 stacks up against the new M500 from
>>> Crucial? I know Intel is kind of the goto for these things but the m500 is
>>> power off protected and rated at: Endurance: 72TB total bytes written (TBW),
>>> equal to 40GB per day for 5 years .
>>
>> I don't think the m500 is power safe (nor is any drive at the <1$/gb
>> price point).
>
> According the the data sheet it is power safe.
>
> http://investors.micron.com/releasedetail.cfm?ReleaseID=732650
> http://www.micron.com/products/solid-state-storage/client-ssd/m500-ssd
Wow, that seems like a pretty good deal then assuming it works and performs decently.
merlin
On 5/22/13 10:04 PM, Mark Kirkwood wrote: > Make that quite a few capacitors (top right corner): > http://regmedia.co.uk/2013/05/07/m500_4.jpg There are some more shots and descriptions of the internals in the excellent review at http://techreport.com/review/24666/crucial-m500-ssd-reviewed That also highlights the big problem with this drive that's kept me from buying one so far: "Unlike rivals Intel and Samsung, Crucial doesn't provide utility software with a built-in health indicator. The M500's payload of SMART attributes doesn't contain any references to flash wear or bytes written, either. Several of the SMART attributes are labeled "Vendor-specific," but you'll need to guess what they track and read the associated values using third-party software." That's a serious problem for most business use of this sort of drive. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On 23/05/13 14:22, Greg Smith wrote: > On 5/22/13 10:04 PM, Mark Kirkwood wrote: >> Make that quite a few capacitors (top right corner): >> http://regmedia.co.uk/2013/05/07/m500_4.jpg > > There are some more shots and descriptions of the internals in the > excellent review at > http://techreport.com/review/24666/crucial-m500-ssd-reviewed > > That also highlights the big problem with this drive that's kept me > from buying one so far: > > "Unlike rivals Intel and Samsung, Crucial doesn't provide utility > software with a built-in health indicator. The M500's payload of SMART > attributes doesn't contain any references to flash wear or bytes > written, either. Several of the SMART attributes are labeled > "Vendor-specific," but you'll need to guess what they track and read > the associated values using third-party software." > > That's a serious problem for most business use of this sort of drive. > Agreed - I was thinking the same thing! Cheers Mark
On 23/05/13 14:26, Mark Kirkwood wrote: > On 23/05/13 14:22, Greg Smith wrote: >> On 5/22/13 10:04 PM, Mark Kirkwood wrote: >>> Make that quite a few capacitors (top right corner): >>> http://regmedia.co.uk/2013/05/07/m500_4.jpg >> >> There are some more shots and descriptions of the internals in the >> excellent review at >> http://techreport.com/review/24666/crucial-m500-ssd-reviewed >> >> That also highlights the big problem with this drive that's kept me >> from buying one so far: >> >> "Unlike rivals Intel and Samsung, Crucial doesn't provide utility >> software with a built-in health indicator. The M500's payload of >> SMART attributes doesn't contain any references to flash wear or >> bytes written, either. Several of the SMART attributes are labeled >> "Vendor-specific," but you'll need to guess what they track and read >> the associated values using third-party software." >> >> That's a serious problem for most business use of this sort of drive. >> > > Agreed - I was thinking the same thing! > > Having said that, there does seem to be a wear leveling counter in its SMART attributes - but, yes - I'd like to see indicators more similar the level of detail that Intel provides. Cheers Mark
On 05/22/2013 07:17 PM, Merlin Moncure wrote: > > According the the data sheet it is power safe. > > > > http://investors.micron.com/releasedetail.cfm?ReleaseID=732650 > > http://www.micron.com/products/solid-state-storage/client-ssd/m500-ssd > > Wow, that seems like a pretty good deal then assuming it works and > performs decently. Yeah that was my thinking. Sure it isn't an S3700 but for the money it is still faster than the comparable spindle configuration. JD > > merlin
On 05/22/2013 03:30 PM, Merlin Moncure wrote: > On Tue, May 21, 2013 at 7:19 PM, Greg Smith <greg@2ndquadrant.com> wrote: >> On 5/20/13 6:32 PM, Merlin Moncure wrote: [cut] >> The only really huge gain to be had using SSD is commit rate at a low client >> count. There you can easily do 5,000/second instead of a spinning disk that >> is closer to 100, for less than what the battery-backed RAID card along >> costs to speed up mechanical drives. My test server's 100GB DC S3700 was >> $250. That's still not two orders of magnitude faster though. > > That's most certainly *not* the only gain to be had: random read rates > of large databases (a very important metric for data analysis) can > easily hit 20k tps. So I'll stand by the figure. Another point: that > 5000k commit raid is sustained, whereas a raid card will spectacularly > degrade until the cache overflows; it's not fair to compare burst with > sustained performance. To hit 5000k sustained commit rate along with > good random read performance, you'd need a very expensive storage > system. Right now I'm working (not by choice) with a teir-1 storage > system (let's just say it rhymes with 'weefax') and I would trade it > for direct attached SSD in a heartbeat. > > Also, note that 3rd party benchmarking is showing the 3700 completely > smoking the 710 in database workloads (for example, see > http://www.anandtech.com/show/6433/intel-ssd-dc-s3700-200gb-review/6). [cut] Sorry for interrupting but on a related note I would like to know your opinions on what the anandtech review said about 3700 poor performance on "Oracle Swingbench", quoting the relevant part that you can find here (*) <quote> [..] There are two components to the Swingbench test we're running here: the database itself, and the redo log. The redo log stores all changes that are made to the database, which allows the database to be reconstructed in the event of a failure. In good DB design, these two would exist on separate storage systems, but in order to increase IO we combined them both for this test. Accesses to the DB end up being 8KB and random in nature, a definite strong suit of the S3700 as we've already shown. The redo log however consists of a bunch of 1KB - 1.5KB, QD1, sequential accesses. The S3700, like many of the newer controllers we've tested, isn't optimized for low queue depth, sub-4KB, sequential workloads like this. [..] </quote> Does this kind of scenario apply to postgresql wal files repo ? Thanks andrea (*) http://www.anandtech.com/show/6433/intel-ssd-dc-s3700-200gb-review/5
On Thu, May 23, 2013 at 1:56 AM, Andrea Suisani <sickpig@opinioni.net> wrote: > On 05/22/2013 03:30 PM, Merlin Moncure wrote: >> >> On Tue, May 21, 2013 at 7:19 PM, Greg Smith <greg@2ndquadrant.com> wrote: >>> >>> On 5/20/13 6:32 PM, Merlin Moncure wrote: > > > [cut] > > >>> The only really huge gain to be had using SSD is commit rate at a low >>> client >>> count. There you can easily do 5,000/second instead of a spinning disk >>> that >>> is closer to 100, for less than what the battery-backed RAID card along >>> costs to speed up mechanical drives. My test server's 100GB DC S3700 was >>> $250. That's still not two orders of magnitude faster though. >> >> >> That's most certainly *not* the only gain to be had: random read rates >> of large databases (a very important metric for data analysis) can >> easily hit 20k tps. So I'll stand by the figure. Another point: that >> 5000k commit raid is sustained, whereas a raid card will spectacularly >> degrade until the cache overflows; it's not fair to compare burst with >> sustained performance. To hit 5000k sustained commit rate along with >> good random read performance, you'd need a very expensive storage >> system. Right now I'm working (not by choice) with a teir-1 storage >> system (let's just say it rhymes with 'weefax') and I would trade it >> for direct attached SSD in a heartbeat. >> >> Also, note that 3rd party benchmarking is showing the 3700 completely >> smoking the 710 in database workloads (for example, see >> http://www.anandtech.com/show/6433/intel-ssd-dc-s3700-200gb-review/6). > > > [cut] > > Sorry for interrupting but on a related note I would like to know your > opinions on what the anandtech review said about 3700 poor performance > on "Oracle Swingbench", quoting the relevant part that you can find here (*) > > <quote> > > [..] There are two components to the Swingbench test we're running here: > the database itself, and the redo log. The redo log stores all changes that > are made to the database, which allows the database to be reconstructed in > the event of a failure. In good DB design, these two would exist on separate > storage systems, but in order to increase IO we combined them both for this > test. > Accesses to the DB end up being 8KB and random in nature, a definite strong > suit > of the S3700 as we've already shown. The redo log however consists of a > bunch > of 1KB - 1.5KB, QD1, sequential accesses. The S3700, like many of the newer > controllers we've tested, isn't optimized for low queue depth, sub-4KB, > sequential > workloads like this. [..] > > </quote> > > Does this kind of scenario apply to postgresql wal files repo ? huh -- I don't think so. wal file segments are 8kb aligned, ditto clog, etc. In XLogWrite(): /* OK to write the page(s) */ from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ; nbytes = npages * (Size) XLOG_BLCKSZ; <-- errno = 0; if (write(openLogFile, from, nbytes) != nbytes) { AFICT, that's the only way you write out xlog. One thing I would definitely advise though is to disable partial page writes if it's enabled. s3700 is algined on 8kb blocks internally -- hm. merlin
On 05/23/2013 03:47 PM, Merlin Moncure wrote: [cut] >> <quote> >> >> [..] There are two components to the Swingbench test we're running here: >> the database itself, and the redo log. The redo log stores all changes that >> are made to the database, which allows the database to be reconstructed in >> the event of a failure. In good DB design, these two would exist on separate >> storage systems, but in order to increase IO we combined them both for this >> test. >> Accesses to the DB end up being 8KB and random in nature, a definite strong >> suit >> of the S3700 as we've already shown. The redo log however consists of a >> bunch >> of 1KB - 1.5KB, QD1, sequential accesses. The S3700, like many of the newer >> controllers we've tested, isn't optimized for low queue depth, sub-4KB, >> sequential >> workloads like this. [..] >> >> </quote> >> >> Does this kind of scenario apply to postgresql wal files repo ? > > huh -- I don't think so. wal file segments are 8kb aligned, ditto > clog, etc. In XLogWrite(): > > /* OK to write the page(s) */ > from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ; > nbytes = npages * (Size) XLOG_BLCKSZ; <-- > errno = 0; > if (write(openLogFile, from, nbytes) != nbytes) > { > > AFICT, that's the only way you write out xlog. One thing I would > definitely advise though is to disable partial page writes if it's > enabled. s3700 is algined on 8kb blocks internally -- hm. many thanks merlin for both the explanation and the good advice :) andrea
On 5/22/13 2:45 PM, Shaun Thomas wrote: > That read rate and that throughput suggest 8k reads. The queue size is > 270+, which is pretty high for a single device, even when it's an SSD. > Some SSDs seem to break down on queue sizes over 4, and 15 sectors > spread across a read queue of 270 is pretty hash. The drive tested here > basically fell over on servicing a huge diverse read queue, which > suggests a firmware issue. That's basically it. I don't know that I'd put the blame specifically onto a firmware issue without further evidence that's the case though. The last time I chased down a SSD performance issue like this it ended up being a Linux scheduler bug. One thing I plan to do for future SSD tests is to try and replicate this issue better, starting by increasing the number of clients to at least 300. Related: if anyone read my "Seeking PostgreSQL" talk last year, some of my Intel 320 results there were understating the drive's worst-case performance due to a testing setup error. I have a blog entry talking about what was wrong and how it slipped past me at http://highperfpostgres.com/2013/05/seeking-revisited-intel-320-series-and-ncq/ With that loose end sorted, I'll be kicking off a brand new round of SSD tests on a 24 core server here soon. All those will appear on my blog. The 320 drive is returning as the bang for buck champ, along with a DC S3700 and a Seagate 1TB Hybrid drive with NAND durable write cache. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com