Re: Performances issues with SSD volume ? - Mailing list pgsql-admin

From Glyn Astill
Subject Re: Performances issues with SSD volume ?
Date
Msg-id 1300059800.5810274.1432213032449.JavaMail.yahoo@mail.yahoo.com
Whole thread Raw
In response to Re: Performances issues with SSD volume ?  (Thomas SIMON <tsimon@neteven.com>)
Responses Re: Performances issues with SSD volume ?
List pgsql-admin
> From: Thomas SIMON <tsimon@neteven.com>

> To: Glyn Astill <glynastill@yahoo.co.uk>
> Cc: "pgsql-admin@postgresql.org" <pgsql-admin@postgresql.org>
> Sent: Thursday, 21 May 2015, 13:12
> Subject: Re: [ADMIN] Performances issues with SSD volume ?
>
> Le 20/05/2015 18:50, Glyn Astill a écrit :
>>
>>>  From: Thomas SIMON <tsimon@neteven.com>
>>>  To: glynastill@yahoo.co.uk
>>>  Cc: "pgsql-admin@postgresql.org"
> <pgsql-admin@postgresql.org>
>>>  Sent: Wednesday, 20 May 2015, 16:41
>>>  Subject: Re: [ADMIN] Performances issues with SSD volume ?
>>>
>>>  Hi Glyn,
>>>
>>>  I'll try to answer this points.
>>>
>>>  I've made some benchs, and indeed 3.2 not helping. not helping at
> all.
>>>  I changed to 3.14 and gap is quite big !
>>>  With pgbench RW test, 3.2 --> 4200 TPS ; 3.14 --> 6900 TPS in
> same
>>>  conditions
>>>  With pgbench RO test, 3.2 --> 37000 TPS ; 3.14 --> 95000 TPS,
> same
>>>  conditions too.
>>
>>  That's a start then.
>>
>>>  It should so be better, but when server was in production, and ever
> with
>>>  bad kernel, performances was already quite good before they quickly
>>>  decreased.
>>>  So i think too I have another configuration problem.
>>>
>>>  You say you're IO bound, so some output from sar / iostat / dstat
> and
>>>  pg_stat_activity etc before and during the issue would be of use.
>>>
>>>  -> My server is not in production right now, so it is difficult to
>>>  replay production load and have some useful metrics.
>>>  The best way I've found is to replay trafic from logs with
> pgreplay.
>>>  I hoped that the server falls back by replaying this traffic, but it
>>>  never happens ... Another thing I can't understand ...
>>>
>>>  Below is my dstat output when I replay this traffic (and so when server
>>>  runs normally)
>>>  I have unfortunately no more outputs when server's performances
> decreased.
>>  It's a shame we can't get any insight into activity on the server
> during the issues.
>>>
>>>  Other things you asked
>>>
>>>        System memory size : 256 Go
>>>        SSD Model numbers and how many : 4 SSd disks ; RAID 10 ; model
>>>  INTEL SSDSC2BB480G4
>>>        Raid controller : MegaRAID SAS 2208
>>>        Partition alignments and stripe sizes : see fdisk delow
>>>        Kernel options : the config file is here :
>>>
> ftp://ftp.ovh.net/made-in-ovh/bzImage/3.14.43/config-3.14.43-xxxx-std-ipv6-64
>>>        Filesystem used and mount options : ext4, see mtab below
>>>        IO Scheduler : noop [deadline] cfq for my ssd raid volume
>>>        Postgresql version and configuration : 9.3.5
>>>
>>>  max_connections=1800
>>>  shared_buffers=8GB
>>>  temp_buffers=32MB
>>>  work_mem=100MB
>>>  maintenance_work_mem=12GB
>>>  bgwriter_lru_maxpages=200
>>>  effective_io_concurrency=4
>>>  wal_level=hot_standby
>>>  wal_sync_method=fdatasync
>>>  wal_writer_delay=2000ms
>>>  commit_delay=1000
>>>  checkpoint_segments=80
>>>  checkpoint_timeout=15min
>>>  checkpoint_completion_target=0.7
>>>  archive_command='rsync ....'
>>>  max_wal_senders=10
>>>  wal_keep_segments=38600
>>>  vacuum_defer_cleanup_age=100
>>>  hot_standby = on
>>>  max_standby_archive_delay = 5min
>>>  max_standby_streaming_delay = 5min
>>>  hot_standby_feedback = on
>>>  random_page_cost = 1.0
>>>  effective_cache_size = 240GB
>>>  log_min_error_statement = warning
>>>  log_min_duration_statement = 0
>>>  log_checkpoints = on
>>>  log_connections = on
>>>  log_disconnections = on
>>>  log_line_prefix = '%m|%u|%d|%c|'
>>>  log_lock_waits = on
>>>  log_statement = 'all'
>>>  log_timezone = 'localtime'
>>>  track_activities = on
>>>  track_functions = pl
>>>  track_activity_query_size = 8192
>>>  autovacuum_max_workers = 5
>>>  autovacuum_naptime = 30s
>>>  autovacuum_vacuum_threshold = 40
>>>  autovacuum_analyze_threshold = 20
>>>  autovacuum_vacuum_scale_factor = 0.10
>>>  autovacuum_analyze_scale_factor = 0.10
>>>  autovacuum_vacuum_cost_delay = 5ms
>>>  default_transaction_isolation = 'read committed'
>>>  max_locks_per_transaction = 128
>>>
>>>
>>>
>>>        Connection pool sizing (pgpool2)
>>>  num_init_children = 1790
>>>  max_pool = 1
>>
>>  1800 is quite a lot of connections, and with max_pool=1 in pgpool
> you're effectively just using pgpool as a proxy (as I recall, my memory is a
> little fuzzy on pgpool now).  Unless your app is stateful in some way or has
> unique users for each of those 1800 connections you should lower the quantity of
> active connections.  A general starting point is usually cpu cores * 2, so you
> could up max_pool and divide num_init_children by the same amount.
>>

>>  Hard to say what you need to do without knowing what exactly you're

>> doing though.  What's the nature of the app(s)?
>
> Yes, we just use it as a proxy for now.
> We have approximately 100 different active users, doing for all of then
> various number of connexions (twisted + zope apps)
> result is ~ 900 idle connexions for ~ 60 active connexions, but
> sometimes (when stopping/starting prod), we need almost double of
> connexion because some twisted services don't stop their connexions
> immediatly.
> But this is the actual (working) configuration, and I don't think think
> my performance disk is related to this.

I think at this point you could do with going back and trying to reproduce the issue, then trace back up to
pg_stat_activityto see what activity could be causing the disk i/o.  I assume you've tried to reproduce the disk issues
witha simple disk benchmark like bonnie++? 

pgsql-admin by date:

Previous
From: Glyn Astill
Date:
Subject: Re: Performances issues with SSD volume ?
Next
From: Thomas SIMON
Date:
Subject: Re: Performances issues with SSD volume ?