Re: Performances issues with SSD volume ? - Mailing list pgsql-admin
From | Thomas SIMON |
---|---|
Subject | Re: Performances issues with SSD volume ? |
Date | |
Msg-id | 555DE32B.4000706@neteven.com Whole thread Raw |
In response to | Re: Performances issues with SSD volume ? (Glyn Astill <glynastill@yahoo.co.uk>) |
Responses |
Re: Performances issues with SSD volume ?
|
List | pgsql-admin |
Le 21/05/2015 14:57, Glyn Astill a écrit : >> From: Thomas SIMON <tsimon@neteven.com> >> To: Glyn Astill <glynastill@yahoo.co.uk> >> Cc: "pgsql-admin@postgresql.org" <pgsql-admin@postgresql.org> >> Sent: Thursday, 21 May 2015, 13:12 >> Subject: Re: [ADMIN] Performances issues with SSD volume ? >> >> Le 20/05/2015 18:50, Glyn Astill a écrit : >>>> From: Thomas SIMON <tsimon@neteven.com> >>>> To: glynastill@yahoo.co.uk >>>> Cc: "pgsql-admin@postgresql.org" >> <pgsql-admin@postgresql.org> >>>> Sent: Wednesday, 20 May 2015, 16:41 >>>> Subject: Re: [ADMIN] Performances issues with SSD volume ? >>>> >>>> Hi Glyn, >>>> >>>> I'll try to answer this points. >>>> >>>> I've made some benchs, and indeed 3.2 not helping. not helping at >> all. >>>> I changed to 3.14 and gap is quite big ! >>>> With pgbench RW test, 3.2 --> 4200 TPS ; 3.14 --> 6900 TPS in >> same >>>> conditions >>>> With pgbench RO test, 3.2 --> 37000 TPS ; 3.14 --> 95000 TPS, >> same >>>> conditions too. >>> That's a start then. >>> >>>> It should so be better, but when server was in production, and ever >> with >>>> bad kernel, performances was already quite good before they quickly >>>> decreased. >>>> So i think too I have another configuration problem. >>>> >>>> You say you're IO bound, so some output from sar / iostat / dstat >> and >>>> pg_stat_activity etc before and during the issue would be of use. >>>> >>>> -> My server is not in production right now, so it is difficult to >>>> replay production load and have some useful metrics. >>>> The best way I've found is to replay trafic from logs with >> pgreplay. >>>> I hoped that the server falls back by replaying this traffic, but it >>>> never happens ... Another thing I can't understand ... >>>> >>>> Below is my dstat output when I replay this traffic (and so when server >>>> runs normally) >>>> I have unfortunately no more outputs when server's performances >> decreased. >>> It's a shame we can't get any insight into activity on the server >> during the issues. >>>> Other things you asked >>>> >>>> System memory size : 256 Go >>>> SSD Model numbers and how many : 4 SSd disks ; RAID 10 ; model >>>> INTEL SSDSC2BB480G4 >>>> Raid controller : MegaRAID SAS 2208 >>>> Partition alignments and stripe sizes : see fdisk delow >>>> Kernel options : the config file is here : >>>> >> ftp://ftp.ovh.net/made-in-ovh/bzImage/3.14.43/config-3.14.43-xxxx-std-ipv6-64 >>>> Filesystem used and mount options : ext4, see mtab below >>>> IO Scheduler : noop [deadline] cfq for my ssd raid volume >>>> Postgresql version and configuration : 9.3.5 >>>> >>>> max_connections=1800 >>>> shared_buffers=8GB >>>> temp_buffers=32MB >>>> work_mem=100MB >>>> maintenance_work_mem=12GB >>>> bgwriter_lru_maxpages=200 >>>> effective_io_concurrency=4 >>>> wal_level=hot_standby >>>> wal_sync_method=fdatasync >>>> wal_writer_delay=2000ms >>>> commit_delay=1000 >>>> checkpoint_segments=80 >>>> checkpoint_timeout=15min >>>> checkpoint_completion_target=0.7 >>>> archive_command='rsync ....' >>>> max_wal_senders=10 >>>> wal_keep_segments=38600 >>>> vacuum_defer_cleanup_age=100 >>>> hot_standby = on >>>> max_standby_archive_delay = 5min >>>> max_standby_streaming_delay = 5min >>>> hot_standby_feedback = on >>>> random_page_cost = 1.0 >>>> effective_cache_size = 240GB >>>> log_min_error_statement = warning >>>> log_min_duration_statement = 0 >>>> log_checkpoints = on >>>> log_connections = on >>>> log_disconnections = on >>>> log_line_prefix = '%m|%u|%d|%c|' >>>> log_lock_waits = on >>>> log_statement = 'all' >>>> log_timezone = 'localtime' >>>> track_activities = on >>>> track_functions = pl >>>> track_activity_query_size = 8192 >>>> autovacuum_max_workers = 5 >>>> autovacuum_naptime = 30s >>>> autovacuum_vacuum_threshold = 40 >>>> autovacuum_analyze_threshold = 20 >>>> autovacuum_vacuum_scale_factor = 0.10 >>>> autovacuum_analyze_scale_factor = 0.10 >>>> autovacuum_vacuum_cost_delay = 5ms >>>> default_transaction_isolation = 'read committed' >>>> max_locks_per_transaction = 128 >>>> >>>> >>>> >>>> Connection pool sizing (pgpool2) >>>> num_init_children = 1790 >>>> max_pool = 1 >>> 1800 is quite a lot of connections, and with max_pool=1 in pgpool >> you're effectively just using pgpool as a proxy (as I recall, my memory is a >> little fuzzy on pgpool now). Unless your app is stateful in some way or has >> unique users for each of those 1800 connections you should lower the quantity of >> active connections. A general starting point is usually cpu cores * 2, so you >> could up max_pool and divide num_init_children by the same amount. >>> Hard to say what you need to do without knowing what exactly you're >>> doing though. What's the nature of the app(s)? >> Yes, we just use it as a proxy for now. >> We have approximately 100 different active users, doing for all of then >> various number of connexions (twisted + zope apps) >> result is ~ 900 idle connexions for ~ 60 active connexions, but >> sometimes (when stopping/starting prod), we need almost double of >> connexion because some twisted services don't stop their connexions >> immediatly. >> But this is the actual (working) configuration, and I don't think think >> my performance disk is related to this. > I think at this point you could do with going back and trying to reproduce the issue, then trace back up to pg_stat_activityto see what activity could be causing the disk i/o. I assume you've tried to reproduce the disk issues witha simple disk benchmark like bonnie++? Yes, I think the same thing. Probably I will doing this tomorrow early in the morning. I tried to reproduce disk issues with different stress tests like bonnie, fio, tsung, and I use a more realistic scenario with pgreplay to reproduce my production trafic from postgresql logfile. However, I'm note sure how to diagnostic performance issues. I mean, if I see ssd are 100% full, how can I figure out why their behavior changes ? I'm asking myself another question, about master/slave configuration. For doing my test, I will put my ssd server as slave of hdd server. After that, I will promote him as master. In case I still have performance issues and I must do a rollback, am I necessarily forced to reconstruct completely my new slave (hdd) with pg_basebackup (and wait some hours file are transferer), or can I promote directly this old master as a slave without pending time to reconstruct (as files should be the same on both servers) ? Thanks Thomas
pgsql-admin by date: