Re: Performances issues with SSD volume ? - Mailing list pgsql-admin

From Thomas SIMON
Subject Re: Performances issues with SSD volume ?
Date
Msg-id 555DE32B.4000706@neteven.com
Whole thread Raw
In response to Re: Performances issues with SSD volume ?  (Glyn Astill <glynastill@yahoo.co.uk>)
Responses Re: Performances issues with SSD volume ?
List pgsql-admin
Le 21/05/2015 14:57, Glyn Astill a écrit :
>> From: Thomas SIMON <tsimon@neteven.com>
>> To: Glyn Astill <glynastill@yahoo.co.uk>
>> Cc: "pgsql-admin@postgresql.org" <pgsql-admin@postgresql.org>
>> Sent: Thursday, 21 May 2015, 13:12
>> Subject: Re: [ADMIN] Performances issues with SSD volume ?
>>
>> Le 20/05/2015 18:50, Glyn Astill a écrit :
>>>>   From: Thomas SIMON <tsimon@neteven.com>
>>>>   To: glynastill@yahoo.co.uk
>>>>   Cc: "pgsql-admin@postgresql.org"
>> <pgsql-admin@postgresql.org>
>>>>   Sent: Wednesday, 20 May 2015, 16:41
>>>>   Subject: Re: [ADMIN] Performances issues with SSD volume ?
>>>>
>>>>   Hi Glyn,
>>>>
>>>>   I'll try to answer this points.
>>>>
>>>>   I've made some benchs, and indeed 3.2 not helping. not helping at
>> all.
>>>>   I changed to 3.14 and gap is quite big !
>>>>   With pgbench RW test, 3.2 --> 4200 TPS ; 3.14 --> 6900 TPS in
>> same
>>>>   conditions
>>>>   With pgbench RO test, 3.2 --> 37000 TPS ; 3.14 --> 95000 TPS,
>> same
>>>>   conditions too.
>>>   That's a start then.
>>>
>>>>   It should so be better, but when server was in production, and ever
>> with
>>>>   bad kernel, performances was already quite good before they quickly
>>>>   decreased.
>>>>   So i think too I have another configuration problem.
>>>>
>>>>   You say you're IO bound, so some output from sar / iostat / dstat
>> and
>>>>   pg_stat_activity etc before and during the issue would be of use.
>>>>
>>>>   -> My server is not in production right now, so it is difficult to
>>>>   replay production load and have some useful metrics.
>>>>   The best way I've found is to replay trafic from logs with
>> pgreplay.
>>>>   I hoped that the server falls back by replaying this traffic, but it
>>>>   never happens ... Another thing I can't understand ...
>>>>
>>>>   Below is my dstat output when I replay this traffic (and so when server
>>>>   runs normally)
>>>>   I have unfortunately no more outputs when server's performances
>> decreased.
>>>   It's a shame we can't get any insight into activity on the server
>> during the issues.
>>>>   Other things you asked
>>>>
>>>>         System memory size : 256 Go
>>>>         SSD Model numbers and how many : 4 SSd disks ; RAID 10 ; model
>>>>   INTEL SSDSC2BB480G4
>>>>         Raid controller : MegaRAID SAS 2208
>>>>         Partition alignments and stripe sizes : see fdisk delow
>>>>         Kernel options : the config file is here :
>>>>
>> ftp://ftp.ovh.net/made-in-ovh/bzImage/3.14.43/config-3.14.43-xxxx-std-ipv6-64
>>>>         Filesystem used and mount options : ext4, see mtab below
>>>>         IO Scheduler : noop [deadline] cfq for my ssd raid volume
>>>>         Postgresql version and configuration : 9.3.5
>>>>
>>>>   max_connections=1800
>>>>   shared_buffers=8GB
>>>>   temp_buffers=32MB
>>>>   work_mem=100MB
>>>>   maintenance_work_mem=12GB
>>>>   bgwriter_lru_maxpages=200
>>>>   effective_io_concurrency=4
>>>>   wal_level=hot_standby
>>>>   wal_sync_method=fdatasync
>>>>   wal_writer_delay=2000ms
>>>>   commit_delay=1000
>>>>   checkpoint_segments=80
>>>>   checkpoint_timeout=15min
>>>>   checkpoint_completion_target=0.7
>>>>   archive_command='rsync ....'
>>>>   max_wal_senders=10
>>>>   wal_keep_segments=38600
>>>>   vacuum_defer_cleanup_age=100
>>>>   hot_standby = on
>>>>   max_standby_archive_delay = 5min
>>>>   max_standby_streaming_delay = 5min
>>>>   hot_standby_feedback = on
>>>>   random_page_cost = 1.0
>>>>   effective_cache_size = 240GB
>>>>   log_min_error_statement = warning
>>>>   log_min_duration_statement = 0
>>>>   log_checkpoints = on
>>>>   log_connections = on
>>>>   log_disconnections = on
>>>>   log_line_prefix = '%m|%u|%d|%c|'
>>>>   log_lock_waits = on
>>>>   log_statement = 'all'
>>>>   log_timezone = 'localtime'
>>>>   track_activities = on
>>>>   track_functions = pl
>>>>   track_activity_query_size = 8192
>>>>   autovacuum_max_workers = 5
>>>>   autovacuum_naptime = 30s
>>>>   autovacuum_vacuum_threshold = 40
>>>>   autovacuum_analyze_threshold = 20
>>>>   autovacuum_vacuum_scale_factor = 0.10
>>>>   autovacuum_analyze_scale_factor = 0.10
>>>>   autovacuum_vacuum_cost_delay = 5ms
>>>>   default_transaction_isolation = 'read committed'
>>>>   max_locks_per_transaction = 128
>>>>
>>>>
>>>>
>>>>         Connection pool sizing (pgpool2)
>>>>   num_init_children = 1790
>>>>   max_pool = 1
>>>   1800 is quite a lot of connections, and with max_pool=1 in pgpool
>> you're effectively just using pgpool as a proxy (as I recall, my memory is a
>> little fuzzy on pgpool now).  Unless your app is stateful in some way or has
>> unique users for each of those 1800 connections you should lower the quantity of
>> active connections.  A general starting point is usually cpu cores * 2, so you
>> could up max_pool and divide num_init_children by the same amount.
>>>   Hard to say what you need to do without knowing what exactly you're
>>> doing though.  What's the nature of the app(s)?
>> Yes, we just use it as a proxy for now.
>> We have approximately 100 different active users, doing for all of then
>> various number of connexions (twisted + zope apps)
>> result is ~ 900 idle connexions for ~ 60 active connexions, but
>> sometimes (when stopping/starting prod), we need almost double of
>> connexion because some twisted services don't stop their connexions
>> immediatly.
>> But this is the actual (working) configuration, and I don't think think
>> my performance disk is related to this.
> I think at this point you could do with going back and trying to reproduce the issue, then trace back up to
pg_stat_activityto see what activity could be causing the disk i/o.  I assume you've tried to reproduce the disk issues
witha simple disk benchmark like bonnie++? 
Yes, I think the same thing. Probably I will doing this tomorrow early
in the morning.
I tried to reproduce disk issues with different stress tests like
bonnie, fio, tsung, and I use a more realistic scenario with pgreplay to
reproduce my production trafic from postgresql logfile.
However, I'm note sure how to diagnostic performance issues.
I mean, if I see ssd are 100% full, how can I figure out why their
behavior changes ?

I'm asking myself another question, about master/slave configuration.
For doing my test, I will put my ssd server as slave of hdd server.
After that, I will promote him as master.
In case I still have performance issues and I must do a rollback, am I
necessarily forced to reconstruct completely my new slave (hdd) with
pg_basebackup (and wait some hours file are transferer), or can I
promote directly this old master as a slave without pending time to
reconstruct (as files should be the same on both servers) ?

Thanks
Thomas


pgsql-admin by date:

Previous
From: Glyn Astill
Date:
Subject: Re: Performances issues with SSD volume ?
Next
From: Glyn Astill
Date:
Subject: Re: Performances issues with SSD volume ?