Performances issues with SSD volume ? - Mailing list pgsql-admin
From | Graeme B. Bell |
---|---|
Subject | Performances issues with SSD volume ? |
Date | |
Msg-id | B6C7CB3A-4102-419D-A930-D78AD9811964@skogoglandskap.no Whole thread Raw |
Responses |
Re: Performances issues with SSD volume ?
|
List | pgsql-admin |
> After the change, I had the following behavior, and I don't understand > why : everything seems to work fine (load is ~6/7, when it was > previously ~25/30 on the HDD server), so the SSD server is faster than the HDD one, and my apps run faster too, but after some time (can be 5 > minutes or 2 hours), the load average increases suddently (can reach 150 > !) and does not decrease, so postgres and my application are almost > unusable. (even small requests are in statement timeout) ===== A braindump of ideas that might be worth investigating 1. Cheaper SSDs tend to have high burst performance and poorer sustained write performance. Your SSD may actually have a slower underlying performance that is having trouble keeping up. Check online for reviews of your drives sustained performance. Shouldn't be lower than an HDD of course... 2. Your SSD, as it fills up, has to do more and more work to manage wear-levelling and garbage-collection of cells that arebeing re-used, and more and more cell wipes and rewrites. Usually there is a reserve of cells that can be used immediatelybut then the controller has to start doing wipes and performance is crippled on some cheaper drives. It may helpyour SSD to set your RAID partitions to use only 90% of the capacity of a fresh drive. The extra 10% significantly reducesthe complexity for the SSD controller to manage the disk and provides a greater pool of cells that can be used forburst write activity and wear-levelling. Basically, cheap SSDs hate being 100% full. 3. Kernel as mentioned, as new as possible. 3.18+ is best. 4. Remember to readup on readahead, filesystem 4k alignment, and IO scheduler choice. 5. Look for any other tasks that are running. For example, we saw local SSDs suffer under one of the schedulers when bulkcopying of hundreds of GB of data was occuring. There wasn't enough IO left for the random reads/writes to maintain theirperformance. The scheduler was giving all the IO to the copying routine. May have been NOOP or CFQ, it wasn't deadline. 5b. Keep a note of when the problem occurs precisely. Use a script to monitor load if necessary. Think about what is happeningat those times. Check your logs. 6. If you have a BBU, perhaps the slowness is occuring there. Enable direct mode on your raid card and disable read-aheadcaching on the raid controller. Check the battery and maybe even the battery controller too. We had a battery/controllerthat it turned out reported 'working fine' 99% of the time then randomly decided it wasn't fine, beforerecovering soon after (it wasn't recharging or self-testing, it was just broken, it seemed, and had weird overheatingbehaviour). Replacing them fixed the problem. 7. Firmware update your SSDs to the latest versions. Try this last after everything else. You can take disks out of RAIDone at a time to do this, presuming you have a hotspare or spare. Remember to check the array has fully resilvered beforeyou put the reflashed drive back in again. Could be a bad firmware. 8. As someone said, could be memory rather than disk related. Check your NUMA settings ( I think I keep NUMA off) / interleavingon. 9. Random thought. Could it be something funny like you have code that is using lots of locks and your apps or DB is hittingdeadlock at some level of activity? Check pg_locks to make sure. Would be strange if you had this now but not withthe HDDs though. But maybe the SSD is letting more things start running together, and the lock problem is emerging atsome level of activity. 10. Check your VM settings to make sure that you're not in a stuation where lots of data is waiting to be cleared all atonce. e.g. dirty_background_bytes, dirty_bytes, and then stalling all other IO. 11. Check your crontabs and cron.d to make sure nothing else is running which might lock tables or nuke performance. Backuproutines, manual vacuums/statistics, etc. 12. Make sure you don't have wal_buffers set at some crazy high level so that a transaction commit is causing insane amountsof data to get cleared out of cache synchronously. 13. Just out of curiosity are you using 2-stage/synchronous commit to the slave? Maybe the problem is on the slave. 14. Have you tested what happens if you go back to the HDDs, e.g. does the problem persist or go away? Maybe it's coincidenceit arrived with the SSDs. Graeme Bell
pgsql-admin by date: