Thread: Performances issues with SSD volume ?

Performances issues with SSD volume ?

From
"Graeme B. Bell"
Date:

> After the change, I had the following behavior, and I don't understand
> why : everything seems to work fine (load is ~6/7, when it was
> previously ~25/30 on the HDD server), so the SSD server is faster than
the HDD one, and my apps run faster too, but after some time (can be 5
> minutes or 2 hours), the load average increases suddently (can reach 150
> !) and does not decrease, so postgres and my application are almost
> unusable. (even small requests are in statement timeout)

=====

A braindump of ideas that might be worth investigating


1. Cheaper SSDs tend to have high burst performance and poorer sustained write performance.
Your SSD may actually have a slower underlying performance that is having trouble keeping up.
Check online for reviews of your drives sustained performance. Shouldn't be lower than an HDD of course...

2. Your SSD, as it fills up, has to do more and more work to manage wear-levelling and garbage-collection of cells that
arebeing re-used, and more and more cell wipes and rewrites. Usually there is a reserve of cells that can be used
immediatelybut then the controller has to start doing wipes and performance is crippled on some cheaper drives. It may
helpyour SSD to set your RAID partitions to use only 90% of the capacity of a fresh drive. The extra 10% significantly
reducesthe complexity for the SSD controller to manage the disk and provides a greater pool of cells that can be used
forburst write activity and wear-levelling. Basically, cheap SSDs hate being 100% full.   

3. Kernel as mentioned, as new as possible. 3.18+ is best.

4. Remember to readup on readahead, filesystem 4k alignment, and IO scheduler choice.

5. Look for any other tasks that are running. For example, we saw local SSDs suffer under one of the schedulers when
bulkcopying of hundreds of GB of data was occuring. There wasn't enough IO left for the random reads/writes to maintain
theirperformance. The scheduler was giving all the IO to the copying routine. May have been NOOP or CFQ, it wasn't
deadline.

5b. Keep a note of when the problem occurs precisely. Use a script to monitor load if necessary. Think about what is
happeningat those times. Check your logs.  

6. If you have a BBU, perhaps the slowness is occuring there. Enable direct mode on your raid card and disable
read-aheadcaching on the raid controller. Check the battery and maybe even the battery controller too. We had a
battery/controllerthat it turned out reported 'working fine' 99% of the time then randomly decided it wasn't fine,
beforerecovering soon after (it wasn't recharging or self-testing, it was just broken, it seemed, and had weird
overheatingbehaviour). Replacing them fixed the problem.   

7. Firmware update your SSDs to the latest versions. Try this last after everything else. You can take disks out of
RAIDone at a time to do this, presuming you have a hotspare or spare. Remember to check the array has fully resilvered
beforeyou put the reflashed drive back in again.  Could be a bad firmware. 

8. As someone said, could be memory rather than disk related. Check your NUMA settings ( I think I keep NUMA off) /
interleavingon.  

9. Random thought. Could it be something funny like you have code that is using lots of locks and your apps or DB is
hittingdeadlock at some level of activity? Check pg_locks to make sure. Would be strange if you had this now but not
withthe HDDs though. But maybe the SSD is letting more things start running together, and the lock problem is emerging
atsome level of activity.  

10. Check your VM settings to make sure that you're not in a stuation where lots of data is waiting to be cleared all
atonce. e.g. dirty_background_bytes, dirty_bytes, and then stalling all other IO.  

11. Check your crontabs and cron.d to make sure nothing else is running which might lock tables or nuke performance.
Backuproutines, manual vacuums/statistics, etc.  

12. Make sure you don't have wal_buffers set at some crazy high level so that a transaction commit is causing insane
amountsof data to get cleared out of cache synchronously.  

13. Just out of curiosity are you using 2-stage/synchronous commit to the slave? Maybe the problem is on the slave.

14. Have you tested what happens if you go back to the HDDs, e.g. does the problem persist or go away? Maybe it's
coincidenceit arrived with the SSDs.  

Graeme Bell

Re: Performances issues with SSD volume ?

From
"Graeme B. Bell"
Date:
On 21 May 2015, at 15:52, <pgsql-admin-owner@postgresql.org> <pgsql-admin-owner@postgresql.org> wrote:

> I mean, if I see ssd are 100% full, how can I figure out why their
> behavior changes ?

I would say "don't try to figure it out, simply accept that no affordably priced SSD on the market will perform well
with95-100% of it's capacity used". 
If you really want to know why it happens, google some documents about wear-levelling, the process of writing and
erasingflash cells, etc.  
Basically erasing is a slow process; internal fragmentation inside cells makes SSDs slow. Your SSD has to do a LOT more
workwhen it's full of data than when it's empty. basically it has no spare 'working area'. 

If your SSDs are 100% full, accept that your system is going to perform OK for a short while, then it will perform
extremelybadly for a long time.   
You must add more space.

Also, I suggest you should buy disks that have a high sustained performance when full (often this is achieved by having
hiddenspare space they don't tell you about, or better controllers). 

You should aim to have SSDs no more than 50-70% full in normal use, peaking at 80%ish occasionally, if you want to get
goodconsistent DB performance out of them.  
And even then, cheap consumer SSDs will still exhibit slowdowns in performance during heavy periods of writes.

Look at this data:

http://www.thessdreview.com/our-reviews/samsung-850-evo-ssd-review-1tb-differing-series-controllers-compared/4/

As you can see, most consumer SSDs have a problem with sustained performance. Once full, at a certain level of activity
theystruggle to keep up and the steady state performance is very poor compared to the advertised numbers (e.g.
60MB/second). 

Some are worse than others.

Take a look here:

http://www.storagereview.com/samsung_ssd_845dc_evo_review

Look at the graph "preconditining curve - 4K 100% Write [Max latency]"

You can see some SSDs have latency that shoots up to e.g. 900ms for a 4K write! That's almost 1 second. And average
writesas bad as 31ms. 
Whereas the models with better controllers and more reserved space achieve average and worst case results 10x better.

Expensive SSDs are great. Cheap SSDs are great-ish for consumer loads and moderate production loads but they vary
between'bad' and 'ok' for heavy production loads. 

Graeme Bell





Re: Performances issues with SSD volume ?

From
"Graeme B. Bell"
Date:
> No, I had read some megacli related docs about SSD, and the advice was
> to put writethrough on disks. (see
> http://wiki.mikejung.biz/LSI#Configure_LSI_Card_for_SSD_RAID), last section.
> Disks are already in "No Write Cache if Bad BBU" mode. (wrote on
> splitted line on my extract)


====

The advice in that link is for maximum performance e.g. for fileservers where people are dumping documents, temporary
workingspace, and so on. 

It is not advice for maximum performance of DB systems which have unique demands in terms of persistence of data
writes. 

For postgres, if you use WT with SSDs that are not tested as having data-in-flight protection via capacitors, YOU WILL
GETA CORRUPTED DB WITH WRITETHROUGH (the first time the power is cut). It is quite likely you will not be able to
recoverthat DB, except from backups. Potentially, the consequences of corrupted data could affect your slave depending
onwhich version of postgres you're using.  

Don't believe me? test it yourself with that diskchecker.pl program I linked on the other post. You will see the
corruptionhappening to your data even when the disk assures you that it is safely stored.  

Do not use WT if you value your data or your uptime.

It may still be acceptable to use writethrough if you can accept a DB becoming SILENTLY CORRUPT after a power cut
reboot.In some use cases that's ok (we have a experimental machine with software raid0 that we just reclone
occasionallyfrom another machine, set for maximum performance, who cares if that db gets corrupt?). 

Also, relating to your problem. The issue of unexplained load spikes that don't go away, is something I've heard can
happenwith corrupt dbs. Can anyone else contribute anecdotes?  

So, I'm sorry to bring bad news, but there is a possibility your DB is already corrupt because of your previous use of
WT. 

Graeme.

Re: Performances issues with SSD volume ?

From
Glyn Astill
Date:
----- Original Message -----

> From: Graeme B. Bell <grb@skogoglandskap.no>
> To: "pgsql-admin@postgresql.org" <pgsql-admin@postgresql.org>
> Cc: "tsimon@neteven.com" <tsimon@neteven.com>
> Sent: Friday, 22 May 2015, 13:27
> Subject: Re: [ADMIN] Performances issues with SSD volume ?
>
>>  No, I had read some megacli related docs about SSD, and the advice was
>>  to put writethrough on disks. (see
>>  http://wiki.mikejung.biz/LSI#Configure_LSI_Card_for_SSD_RAID), last
> section.
>>  Disks are already in "No Write Cache if Bad BBU" mode. (wrote on
>>  splitted line on my extract)
>
>
> ====
>
> The advice in that link is for maximum performance e.g. for fileservers where
> people are dumping documents, temporary working space, and so on.
>
> It is not advice for maximum performance of DB systems which have unique demands
> in terms of persistence of data writes.
>
> For postgres, if you use WT with SSDs that are not tested as having
> data-in-flight protection via capacitors, YOU WILL GET A CORRUPTED DB WITH
> WRITETHROUGH (the first time the power is cut). It is quite likely you will not
> be able to recover that DB, except from backups. Potentially, the consequences
> of corrupted data could affect your slave depending on which version of postgres
> you're using.
If the cache on the SSD isn't safe then nothing you do elsewhere will protect the data.  The only thing you could try
isto disable the cache on the SSD itself, which would have severe performance and longevity penalties as each write
wouldhave to hit a full erase block. 

Regardless; in this conversation there's no need for your doom as Thomas has said he's using intel s3500's, which have
supercaps,and from personal experience they are safe; it's news to me if they're not. 


Re: Performances issues with SSD volume ?

From
Thomas SIMON
Date:
Hi Graeme, thanks for your 2 complete replys.

Le 19/05/2015 16:26, Graeme B. Bell a écrit :

After the change, I had the following behavior, and I don't understand 
why : everything seems to work fine (load is ~6/7, when it was 
previously ~25/30 on the HDD server), so the SSD server is faster than 
the HDD one, and my apps run faster too, but after some time (can be 5 
minutes or 2 hours), the load average increases suddently (can reach 150 
!) and does not decrease, so postgres and my application are almost 
unusable. (even small requests are in statement timeout)
=====

A braindump of ideas that might be worth investigating 


1. Cheaper SSDs tend to have high burst performance and poorer sustained write performance.
Your SSD may actually have a slower underlying performance that is having trouble keeping up.
Check online for reviews of your drives sustained performance. Shouldn't be lower than an HDD of course... 

2. Your SSD, as it fills up, has to do more and more work to manage wear-levelling and garbage-collection of cells that are being re-used, and more and more cell wipes and rewrites. Usually there is a reserve of cells that can be used immediately but then the controller has to start doing wipes and performance is crippled on some cheaper drives. It may help your SSD to set your RAID partitions to use only 90% of the capacity of a fresh drive. The extra 10% significantly reduces the complexity for the SSD controller to manage the disk and provides a greater pool of cells that can be used for burst write activity and wear-levelling. Basically, cheap SSDs hate being 100% full.  

As you told in other reply, Intel SSD seems to be good disks.

I've got fastpath on my raid controller, that's why I setup WT on SSDs

megacli -ELF -ControllerFeatures -a0
                                    
Activated Advanced Software Options
---------------------------
Advanced Software Option          : MegaRAID FastPath
Mode             : Secured
Time Remaining   : Unlimited
...

So it should be OK for WT ?
4. Remember to readup on readahead, filesystem 4k alignment, and IO scheduler choice.
schedule choice is noop now. 4k alignment ok

5. Look for any other tasks that are running. For example, we saw local SSDs suffer under one of the schedulers when bulk copying of hundreds of GB of data was occuring. There wasn't enough IO left for the random reads/writes to maintain their performance. The scheduler was giving all the IO to the copying routine. May have been NOOP or CFQ, it wasn't deadline.
My server is dedicated to postgres, no other tasks running.
6. If you have a BBU, perhaps the slowness is occuring there. Enable direct mode on your raid card and disable read-ahead caching on the raid controller. Check the battery and maybe even the battery controller too. We had a battery/controller that it turned out reported 'working fine' 99% of the time then randomly decided it wasn't fine, before recovering soon after (it wasn't recharging or self-testing, it was just broken, it seemed, and had weird overheating behaviour). Replacing them fixed the problem.  
my current config is the following one on ssd raid volume
Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU
I've found no errors like to one you had in log dump.

7. Firmware update your SSDs to the latest versions. Try this last after everything else. You can take disks out of RAID one at a time to do this, presuming you have a hotspare or spare. Remember to check the array has fully resilvered before you put the reflashed drive back in again.  Could be a bad firmware.

8. As someone said, could be memory rather than disk related. Check your NUMA settings ( I think I keep NUMA off) / interleaving on. 
Numa seems to be enabled. Not sure how to disable it; I've set vm.zone_reclaim_mode = 0 but if seems to be

numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 20 21 22 23 24 25 26 27 28 29
node 0 size: 128966 MB
node 0 free: 899 MB
node 1 cpus: 10 11 12 13 14 15 16 17 18 19 30 31 32 33 34 35 36 37 38 39
node 1 size: 129021 MB
node 1 free: 890 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

I now use interleaving on with "numactl --interleave=all /etc/init.d/postgresql start"


9. Random thought. Could it be something funny like you have code that is using lots of locks and your apps or DB is hitting deadlock at some level of activity? Check pg_locks to make sure. Would be strange if you had this now but not with the HDDs though. But maybe the SSD is letting more things start running together, and the lock problem is emerging at some level of activity. 
That is indeed a possibility. I will check next time I will do the switch.

10. Check your VM settings to make sure that you're not in a stuation where lots of data is waiting to be cleared all at once. e.g. dirty_background_bytes, dirty_bytes, and then stalling all other IO. 
Here is my parameters for this.
Seems to be low values, but i don't know if they are good and how to tune them.

vm.dirty_background_bytes = 8388608
vm.dirty_background_ratio = 0
vm.dirty_bytes = 67108864
vm.dirty_ratio = 0



11. Check your crontabs and cron.d to make sure nothing else is running which might lock tables or nuke performance. Backup routines, manual vacuums/statistics, etc. 
I've nothing here.

12. Make sure you don't have wal_buffers set at some crazy high level so that a transaction commit is causing insane amounts of data to get cleared out of cache synchronously. 
wal_buffers is now set to -1 in 9.3+ versions, the setting is automatic now.
13. Just out of curiosity are you using 2-stage/synchronous commit to the slave? Maybe the problem is on the slave. 
No, I'm using hot standby in asynchronous

14. Have you tested what happens if you go back to the HDDs, e.g. does the problem persist or go away? Maybe it's coincidence it arrived with the SSDs. 
The problem, in this proportions, does not appears anynore when I go back to HDD server. Behavior was less performance, but stable in duration.


Graeme Bell