Re: Performances issues with SSD volume ? - Mailing list pgsql-admin

From Thomas SIMON
Subject Re: Performances issues with SSD volume ?
Date
Msg-id 555DCBA6.1050607@neteven.com
Whole thread Raw
In response to Re: Performances issues with SSD volume ?  (Glyn Astill <glynastill@yahoo.co.uk>)
Responses Re: Performances issues with SSD volume ?
List pgsql-admin
Le 20/05/2015 18:50, Glyn Astill a écrit :
>
>> From: Thomas SIMON <tsimon@neteven.com>
>> To: glynastill@yahoo.co.uk
>> Cc: "pgsql-admin@postgresql.org" <pgsql-admin@postgresql.org>
>> Sent: Wednesday, 20 May 2015, 16:41
>> Subject: Re: [ADMIN] Performances issues with SSD volume ?
>>
>> Hi Glyn,
>>
>> I'll try to answer this points.
>>
>> I've made some benchs, and indeed 3.2 not helping. not helping at all.
>> I changed to 3.14 and gap is quite big !
>> With pgbench RW test, 3.2 --> 4200 TPS ; 3.14 --> 6900 TPS in same
>> conditions
>> With pgbench RO test, 3.2 --> 37000 TPS ; 3.14 --> 95000 TPS, same
>> conditions too.
>
> That's a start then.
>
>> It should so be better, but when server was in production, and ever with
>> bad kernel, performances was already quite good before they quickly
>> decreased.
>> So i think too I have another configuration problem.
>>
>> You say you're IO bound, so some output from sar / iostat / dstat and
>> pg_stat_activity etc before and during the issue would be of use.
>>
>> -> My server is not in production right now, so it is difficult to
>> replay production load and have some useful metrics.
>> The best way I've found is to replay trafic from logs with pgreplay.
>> I hoped that the server falls back by replaying this traffic, but it
>> never happens ... Another thing I can't understand ...
>>
>> Below is my dstat output when I replay this traffic (and so when server
>> runs normally)
>> I have unfortunately no more outputs when server's performances decreased.
> It's a shame we can't get any insight into activity on the server during the issues.
>>
>> Other things you asked
>>
>>       System memory size : 256 Go
>>       SSD Model numbers and how many : 4 SSd disks ; RAID 10 ; model
>> INTEL SSDSC2BB480G4
>>       Raid controller : MegaRAID SAS 2208
>>       Partition alignments and stripe sizes : see fdisk delow
>>       Kernel options : the config file is here :
>> ftp://ftp.ovh.net/made-in-ovh/bzImage/3.14.43/config-3.14.43-xxxx-std-ipv6-64
>>       Filesystem used and mount options : ext4, see mtab below
>>       IO Scheduler : noop [deadline] cfq for my ssd raid volume
>>       Postgresql version and configuration : 9.3.5
>>
>> max_connections=1800
>> shared_buffers=8GB
>> temp_buffers=32MB
>> work_mem=100MB
>> maintenance_work_mem=12GB
>> bgwriter_lru_maxpages=200
>> effective_io_concurrency=4
>> wal_level=hot_standby
>> wal_sync_method=fdatasync
>> wal_writer_delay=2000ms
>> commit_delay=1000
>> checkpoint_segments=80
>> checkpoint_timeout=15min
>> checkpoint_completion_target=0.7
>> archive_command='rsync ....'
>> max_wal_senders=10
>> wal_keep_segments=38600
>> vacuum_defer_cleanup_age=100
>> hot_standby = on
>> max_standby_archive_delay = 5min
>> max_standby_streaming_delay = 5min
>> hot_standby_feedback = on
>> random_page_cost = 1.0
>> effective_cache_size = 240GB
>> log_min_error_statement = warning
>> log_min_duration_statement = 0
>> log_checkpoints = on
>> log_connections = on
>> log_disconnections = on
>> log_line_prefix = '%m|%u|%d|%c|'
>> log_lock_waits = on
>> log_statement = 'all'
>> log_timezone = 'localtime'
>> track_activities = on
>> track_functions = pl
>> track_activity_query_size = 8192
>> autovacuum_max_workers = 5
>> autovacuum_naptime = 30s
>> autovacuum_vacuum_threshold = 40
>> autovacuum_analyze_threshold = 20
>> autovacuum_vacuum_scale_factor = 0.10
>> autovacuum_analyze_scale_factor = 0.10
>> autovacuum_vacuum_cost_delay = 5ms
>> default_transaction_isolation = 'read committed'
>> max_locks_per_transaction = 128
>>
>>
>>
>>       Connection pool sizing (pgpool2)
>> num_init_children = 1790
>> max_pool = 1
>
> 1800 is quite a lot of connections, and with max_pool=1 in pgpool you're effectively just using pgpool as a proxy (as
Irecall, my memory is a little fuzzy on pgpool now).  Unless your app is stateful in some way or has unique users for
eachof those 1800 connections you should lower the quantity of active connections.  A general starting point is usually
cpucores * 2, so you could up max_pool and divide num_init_children by the same amount. 
>
> Hard to say what you need to do without knowing what exactly you're doing though.  What's the nature of the app(s)?
Yes, we just use it as a proxy for now.
We have approximately 100 different active users, doing for all of then
various number of connexions (twisted + zope apps)
result is ~ 900 idle connexions for ~ 60 active connexions, but
sometimes (when stopping/starting prod), we need almost double of
connexion because some twisted services don't stop their connexions
immediatly.
But this is the actual (working) configuration, and I don't think think
my performance disk is related to this.
>
>> I also add megacli parameters :
>>
>> Virtual Drive: 2 (Target Id: 2)
>> Name                :datassd
>> RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
>> Size                : 893.25 GB
>> Sector Size         : 512
>> Is VD emulated      : Yes
>> Mirror Data         : 893.25 GB
>> State               : Optimal
>> Strip Size          : 256 KB
>> Number Of Drives per span:2
>> Span Depth          : 2
>> Default Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write
>> Cache if Bad BBU
>> Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write
>> Cache if Bad BBU
>> Default Access Policy: Read/Write
>> Current Access Policy: Read/Write
>> Disk Cache Policy   : Enabled
>> Encryption Type     : None
>> Bad Blocks Exist: No
>> PI type: No PI
>>
>> Is VD Cached: No
>
> Not using your raid controllers write cache then?  Not sure just how important that is with SSDs these days, but if
you'vegot a BBU set it to "WriteBack". Also change "Cache if Bad BBU" to "No Write Cache if Bad BBU" if you do that. 
No, I had read some megacli related docs about SSD, and the advice was
to put writethrough on disks. (see
http://wiki.mikejung.biz/LSI#Configure_LSI_Card_for_SSD_RAID), last section.
Disks are already in "No Write Cache if Bad BBU" mode. (wrote on
splitted line on my extract)
>
>
>> Other outputs :
>>       fdisk -l
>>
>> Disk /dev/sdc: 959.1 GB, 959119884288 bytes
>> 255 heads, 63 sectors/track, 116606 cylinders, total 1873281024 sectors
>> Units = sectors of 1 * 512 = 512 bytes
>> Sector size (logical/physical): 512 bytes / 4096 bytes
>> I/O size (minimum/optimal): 4096 bytes / 4096 bytes
>> Disk identifier: 0x00000000
>>
>> Disk /dev/mapper/vg_datassd-lv_datassd: 751.6 GB, 751619276800 bytes
>> 255 heads, 63 sectors/track, 91379 cylinders, total 1468006400 sectors
>> Units = sectors of 1 * 512 = 512 bytes
>> Sector size (logical/physical): 512 bytes / 4096 bytes
>> I/O size (minimum/optimal): 4096 bytes / 4096 bytes
>> Disk identifier: 0x00000000
>>
>>
>>       cat /etc/mtab
>> /dev/mapper/vg_datassd-lv_datassd /datassd ext4
>> rw,relatime,discard,nobarrier,data=ordered 0 0
>> (I added nobarrier option)
>>
>>
>>       cat /sys/block/sdc/queue/scheduler
>> noop [deadline] cfq
>>
>
> You could swap relatime for noatime,nodiratime.
I'll swap to noatime, thanks.
>
>>       sysctl kernel | grep sched
>> kernel.sched_child_runs_first = 0
>> kernel.sched_rr_timeslice_ms = 25
>> kernel.sched_rt_period_us = 1000000
>> kernel.sched_rt_runtime_us = 950000
>>
>> I've read some advices about changing kernel.sched_autogroup_enabled=0
>> and kernel.sched_migration_cost_ns=5000000, but this parameters are not
>> recognized by my kernel. So I don't know what to do with that...
> sched_migration_cost_ns would be called sched_migration_cost in your old 3.2 kernel, not sure why
> sched_autogroup_enabled wouldn't be recognized though.
I've found on config file that "CONFIG_SCHED_AUTOGROUP is not set" in
this kernel.
So i guess it is the same thing as if it was enabled=0 , right ?
I haven't found any parameter related to migration_sost in config file.


>
>>       sysctl vm
>> vm.admin_reserve_kbytes = 8192
>> vm.block_dump = 0
>> vm.dirty_background_bytes = 8388608
>> vm.dirty_background_ratio = 0
>> vm.dirty_bytes = 67108864
>> vm.dirty_expire_centisecs = 3000
>> vm.dirty_ratio = 0
>> vm.dirty_writeback_centisecs = 500
>> vm.drop_caches = 3
>> vm.laptop_mode = 0
>> vm.legacy_va_layout = 0
>> vm.lowmem_reserve_ratio = 256    256    32
>> vm.max_map_count = 65530
>> vm.memory_failure_early_kill = 0
>> vm.memory_failure_recovery = 1
>> vm.min_free_kbytes = 65008
>> vm.min_slab_ratio = 5
>> vm.min_unmapped_ratio = 1
>> vm.mmap_min_addr = 4096
>> vm.nr_pdflush_threads = 0
>> vm.numa_zonelist_order = default
>> vm.oom_dump_tasks = 1
>> vm.oom_kill_allocating_task = 0
>> vm.overcommit_kbytes = 0
>> vm.overcommit_memory = 2
>> vm.overcommit_ratio = 50
>> vm.page-cluster = 3
>> vm.panic_on_oom = 0
>> vm.percpu_pagelist_fraction = 0
>> vm.scan_unevictable_pages = 0
>> vm.stat_interval = 1
>> vm.swappiness = 60
>> vm.user_reserve_kbytes = 131072
>> vm.vfs_cache_pressure = 100
>> vm.zone_reclaim_mode = 0
>>
>>
>>       select * from pg_stat_activity
>> I've got hundred of entries for that when i'm in production, and I
>> can't
>> paste it here due to confidentiality.
>> By day, it is around 50 millions queries usually. (35% selects ; 55%
>> updates & 5% inserts)
>>
>>
>>       lspci | grep -E 'RAID|SCSI|IDE|SATA'
>> 00:1f.2 SATA controller: Intel Corporation C600/X79 series chipset
>> 6-Port SATA AHCI Controller (rev 06)
>> 02:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2208
>> [Thunderbolt] (rev 05)
>> 07:00.0 Serial Attached SCSI controller: Intel Corporation C602 chipset
>> 4-Port SATA Storage Control Unit (rev 06)
>>
>> Thanks
>>
>> Thomas
>>
>>
>> Le 18/05/2015 16:29, Glyn Astill a écrit :
>>>>   From: Koray Eyidoğan <korayey@gmail.com>
>>>>   To: Thomas SIMON <tsimon@neteven.com>
>>>>   Cc: pgsql-admin@postgresql.org
>>>>   Sent: Monday, 18 May 2015, 14:51
>>>>   Subject: Re: [ADMIN] Performances issues with SSD volume ?
>>>>
>>>>
>>>>
>>>>   Hi Thomas,
>>>>
>>>>
>>>>   3.2 kernel may be #1 cause of your I/O load problem:
>>>>
>>>>
>>>>
>> http://www.databasesoup.com/2014/09/why-you-need-to-avoid-linux-kernel-32.html
>>>>
>>
https://medium.com/postgresql-talk/benchmarking-postgresql-with-different-linux-kernel-versions-on-ubuntu-lts-e61d57b70dd4
>>>>
>>>>
>>>>   Have a nice day.
>>>>
>>>>
>>>>   Koray
>>>   Likely 3.2 kernel isn't helping, but I think we need much more
>> information before jumping to conclusions.
>>>   You say you're IO bound, so some output from sar / iostat / dstat and
>> pg_stat_activity etc before and during the issue would be of use.
>>>   Also:
>>>        System memory size
>>>
>>>        SSD Model numbers and how many
>>>        Raid controller
>>>        Partition allignments and stripe sizes
>>>        Kernel options
>>>        Filesystem used and mount options
>>>        IO Scheduler
>>>        Postgresql version and configuration
>>>        Connection pool sizing
>>>
>>>
>>>   Perhaps you could thow us the output of some of these:
>>>
>>>        fdisk -l
>>>        cat /etc/mtab
>>>        cat /sys/block/<ssd device>/queue/scheduler
>>>        sysctl kernel | grep sched
>>>        sysctl vm
>>>        select * from pg_stat_activity
>>>        select name, setting from pg_settings
>>>        lspci | grep -E 'RAID|SCSI|IDE|SATA'
>>
>>
>> --
>> Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-admin
>>



pgsql-admin by date:

Previous
From: "Graeme B. Bell"
Date:
Subject: raid writethrough mode (WT), ssds and your DB. (was Performances issues with SSD volume ?)
Next
From: Thomas SIMON
Date:
Subject: Re: Performances issues with SSD volume ?