Re: PG performance issues related to storage I/O waits - Mailing list pgsql-performance

From Tasos Petalas
Subject Re: PG performance issues related to storage I/O waits
Date
Msg-id CAJtGb8rCU1jgsHF0yJyj+_904YWfZvpt+kSc71Qd6VgWyQTwNQ@mail.gmail.com
Whole thread Raw
In response to Re: PG performance issues related to storage I/O waits  (Tomas Vondra <tv@fuzzy.cz>)
List pgsql-performance



On Mon, Aug 5, 2013 at 11:28 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
Hi,

On 5.8.2013 17:55, Tasos Petalas wrote:
>
>     Seems most of the I/O is caused by SELECT backend processes (READ),
>     whereas (WRITE) requests of wal writer and checkpointer processes do
>     not appear as top IO proceses (correct me if I am wrong)
>
> E.g. check the follwoing heavy write process that reports 0% I/O ...!
>
>  14:09:40   769 be/4 enterpri    0.00 B/s   33.65 M/s  0.00 %  0.00 %
> postgres: wal writer process

That's because the WAL writer does sequential I/O (writes), which is a
perfect match for SAS drives.

OTOH the queries do a lot of random reads, which is a terrible match for
spinners.

>         That however still doesn't say which processes are responsible
>         for that.
>         Is that background writer, backends running queries or what? The
>         iotop
>         should give you answer to this (or at least a hint).
>
>
> It seems most of I/O reported from backends running heavy concurrent
> select queries (See iotop attachment in previous email)

Yes, that seems to be the case.

>         Also, how are these volumes defined? Do they use distinct sets
>         of disks?
>         How many disks are used for each volume?
>
>
> These are LUNs from SAN (we have dedicated 16 SAS 2,5'' disks in RAID-10
> topology in Storage)

I do understand these are LUNs from the SAN. I was asking whether there
are separate sets of disks for the data directory (which you mentioned
to be RAID-10) and pg_archives (which you mentioned to be RAID-5).

Although I doubt it'd be possible to use the same disk for two LUNs.

Sorry I didn't get you question right. Yes there are different disk sets for RAID-10 (data) and RAID-5 (wal archives)

>         > Yes we are using 15K SAS disks in RAID 10. (253-2 dev refers
>         to sar
>         > output for disks)
>
>         OK, so the pg_archives is probably for xlog archive, right?
>
> NO.
> /pg_archives is the target mount_point where we copy archive_logs to
> (archive_command = 'test ! -f /pg_archives/%f && cp %p /pg_archives/%f')

... which is exactly what WAL archive is. That's why the GUC is called
archive_command.

Again misunderstood your question. I wrongly got you're asking for separate LUN for WAL (pg_xlog to a separate device and not WAL archives)

>         I've checked the conf, and I think you should really consider
>         increasing
>         checkpoint_segments - it's set to 3 (= 64MB) but I think
>         something like
>         32 (=512MB) or even more would be more appropriate.
>
> We use EDB dynatune. Actual setting can be found in  file
> (Ticket.Usual.Info.27.07.13.txt) of initial e-mail --> check show all;
> section
> Current checkpoint_segments is set to 64

OK, I'm not familiar with dynatune, and I got confused by the
postgresql.conf that you sent. 64 seems fine to me.

Understood. EDB dynatune is a specific feature that ships with EDB PG versions and suppose to take care of most of the PG conf parameters (found in postgresql.conf) automatically and adjust them in run time (You can always override them).

"Show all" command in psql promt gives you the actual values at any given time.


>         I see you've enabled log_checkpoints - can you check your logs
>         how often
>         the checkpoints happen?
>
>
> This is the output of the checkpoints during peak hours (avg. every 2-5
> minutes)
>
> 2013-08-02 14:00:20 UTC [767]: [19752]: [0]LOG:  checkpoint complete:
> wrote 55926 buffers (5.3%); 0 transaction log file(s) added, 0 removed,
> 41 recycled; write=220.619 s, sync=
> 5.443 s, total=226.152 s; sync files=220, longest=1.433 s, average=0.024 s
> 2013-08-02 14:05:14 UTC [767]: [19754]: [0]LOG:  checkpoint complete:
> wrote 109628 buffers (10.5%); 0 transaction log file(s) added, 0
> removed, 31 recycled; write=209.714 s, syn
> c=9.513 s, total=219.252 s; sync files=222, longest=3.472 s, average=0.042 s

Meh, seems OK to me. This was based on the incorrect number of
checkpoint segments ...
>
>
>
>         Also, can you check pg_stat_bgwriter view? I'd bet the value in
>         checkpoints_timed is very low, compared to checkpoints_req. Or even
>         better, get the values from this view before / after running the
>         batch jobs.
>
> Results during load:
> checkpoints_timed : 12432 , checkpoints_req = 3058

Again, seems fine.

 
Update values for pg_stat_bgwriter after batch activity (off-peak)
 checkpoints_timed : 12580 checkpoints_req : 3070

I don't see any significant difference here.
 
>         In the afternoon it's a different story - for 253-2 it looks
>         like this:
>
>                           DEV       tps  rd_sec/s  wr_sec/s   await    %util
>         15:50:01     dev253-2   4742.91  33828.98  29156.17   84.84   105.14
>         16:00:01     dev253-2   2781.05  12737.41  18878.52   19.24    80.53
>         16:10:01     dev253-2   3661.51  20950.64  23758.96   36.86    99.03
>         16:20:01     dev253-2   5011.45  32454.33  31895.05   72.75   102.38
>         16:30:01     dev253-2   2638.08  14661.23  17853.16   25.24    75.64
>         16:40:01     dev253-2   1988.95   5764.73  14190.12   45.05    58.80
>         16:50:01     dev253-2   2185.15  88296.73  11806.38    7.46    84.37
>         17:00:01     dev253-2   2031.19  12835.56  12997.34    8.90    82.62
>         17:10:01     dev253-2   4009.24  34288.71  23974.92   38.07   103.01
>         17:20:01     dev253-2   3605.86  26107.83  22457.41   45.76    90.90
>         17:30:01     dev253-2   2550.47   7496.85  18267.07   19.10    65.87
>
>
> This is when the actual  problem arises

Well, then I think it's mostly about the SELECT queries.

>         What I think you could/should do:
>
>         * move pg_xlog to a separate device (not simply a volume on the SAN,
>           sharing disks with the other volumes - that won't give you
>         anything)
>
> Unfortunately we cannot do so at the moment (alll available SAN
> resources are assigned to the pg_data directory of the server)
>
>         I'd expect these changes to improve the afternoon peak, as it's
>         doing
>         about 50% writes. However I would not expect this to improve the
>         morning
>         peak, because that's doing a lot of reads (not writes).
>
> Afternoon peak is what we need to troubleshoot (will check if we can
> assign pg_xlog to a different LUN - not an option currently)

OK, understood. It's difficult to predict the gain and given the iotop
output it might even cause harm.

>
> Will SSD improve write performance? We are thinking of moving towards
> this direction.

It'll certainly improve the random I/O in general, which is the main
issue with SELECT queries. Sequential read/write improvement probably
won't be that significant.

Tomas


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

pgsql-performance by date:

Previous
From: David Johnston
Date:
Subject: Re: ORDER BY, LIMIT and indexes
Next
From: Ivan Voras
Date:
Subject: Re: ORDER BY, LIMIT and indexes