Performances issues with SSD volume ? - Mailing list pgsql-admin

From Graeme B. Bell
Subject Performances issues with SSD volume ?
Date
Msg-id B6C7CB3A-4102-419D-A930-D78AD9811964@skogoglandskap.no
Whole thread Raw
Responses Re: Performances issues with SSD volume ?
List pgsql-admin

> After the change, I had the following behavior, and I don't understand
> why : everything seems to work fine (load is ~6/7, when it was
> previously ~25/30 on the HDD server), so the SSD server is faster than
the HDD one, and my apps run faster too, but after some time (can be 5
> minutes or 2 hours), the load average increases suddently (can reach 150
> !) and does not decrease, so postgres and my application are almost
> unusable. (even small requests are in statement timeout)

=====

A braindump of ideas that might be worth investigating


1. Cheaper SSDs tend to have high burst performance and poorer sustained write performance.
Your SSD may actually have a slower underlying performance that is having trouble keeping up.
Check online for reviews of your drives sustained performance. Shouldn't be lower than an HDD of course...

2. Your SSD, as it fills up, has to do more and more work to manage wear-levelling and garbage-collection of cells that
arebeing re-used, and more and more cell wipes and rewrites. Usually there is a reserve of cells that can be used
immediatelybut then the controller has to start doing wipes and performance is crippled on some cheaper drives. It may
helpyour SSD to set your RAID partitions to use only 90% of the capacity of a fresh drive. The extra 10% significantly
reducesthe complexity for the SSD controller to manage the disk and provides a greater pool of cells that can be used
forburst write activity and wear-levelling. Basically, cheap SSDs hate being 100% full.   

3. Kernel as mentioned, as new as possible. 3.18+ is best.

4. Remember to readup on readahead, filesystem 4k alignment, and IO scheduler choice.

5. Look for any other tasks that are running. For example, we saw local SSDs suffer under one of the schedulers when
bulkcopying of hundreds of GB of data was occuring. There wasn't enough IO left for the random reads/writes to maintain
theirperformance. The scheduler was giving all the IO to the copying routine. May have been NOOP or CFQ, it wasn't
deadline.

5b. Keep a note of when the problem occurs precisely. Use a script to monitor load if necessary. Think about what is
happeningat those times. Check your logs.  

6. If you have a BBU, perhaps the slowness is occuring there. Enable direct mode on your raid card and disable
read-aheadcaching on the raid controller. Check the battery and maybe even the battery controller too. We had a
battery/controllerthat it turned out reported 'working fine' 99% of the time then randomly decided it wasn't fine,
beforerecovering soon after (it wasn't recharging or self-testing, it was just broken, it seemed, and had weird
overheatingbehaviour). Replacing them fixed the problem.   

7. Firmware update your SSDs to the latest versions. Try this last after everything else. You can take disks out of
RAIDone at a time to do this, presuming you have a hotspare or spare. Remember to check the array has fully resilvered
beforeyou put the reflashed drive back in again.  Could be a bad firmware. 

8. As someone said, could be memory rather than disk related. Check your NUMA settings ( I think I keep NUMA off) /
interleavingon.  

9. Random thought. Could it be something funny like you have code that is using lots of locks and your apps or DB is
hittingdeadlock at some level of activity? Check pg_locks to make sure. Would be strange if you had this now but not
withthe HDDs though. But maybe the SSD is letting more things start running together, and the lock problem is emerging
atsome level of activity.  

10. Check your VM settings to make sure that you're not in a stuation where lots of data is waiting to be cleared all
atonce. e.g. dirty_background_bytes, dirty_bytes, and then stalling all other IO.  

11. Check your crontabs and cron.d to make sure nothing else is running which might lock tables or nuke performance.
Backuproutines, manual vacuums/statistics, etc.  

12. Make sure you don't have wal_buffers set at some crazy high level so that a transaction commit is causing insane
amountsof data to get cleared out of cache synchronously.  

13. Just out of curiosity are you using 2-stage/synchronous commit to the slave? Maybe the problem is on the slave.

14. Have you tested what happens if you go back to the HDDs, e.g. does the problem persist or go away? Maybe it's
coincidenceit arrived with the SSDs.  

Graeme Bell

pgsql-admin by date:

Previous
From: Matheus de Oliveira
Date:
Subject: Re: Create Index CONCURRENTLY Hangs Indefinitely.
Next
From: Wei Shan
Date:
Subject: Re: Create Index CONCURRENTLY Hangs Indefinitely.