Re: PG performance issues related to storage I/O waits - Mailing list pgsql-performance
From | Tasos Petalas |
---|---|
Subject | Re: PG performance issues related to storage I/O waits |
Date | |
Msg-id | CAJtGb8rCU1jgsHF0yJyj+_904YWfZvpt+kSc71Qd6VgWyQTwNQ@mail.gmail.com Whole thread Raw |
In response to | Re: PG performance issues related to storage I/O waits (Tomas Vondra <tv@fuzzy.cz>) |
List | pgsql-performance |
On Mon, Aug 5, 2013 at 11:28 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
Hi,That's because the WAL writer does sequential I/O (writes), which is a
On 5.8.2013 17:55, Tasos Petalas wrote:
>
> Seems most of the I/O is caused by SELECT backend processes (READ),
> whereas (WRITE) requests of wal writer and checkpointer processes do
> not appear as top IO proceses (correct me if I am wrong)
>
> E.g. check the follwoing heavy write process that reports 0% I/O ...!
>
> 14:09:40 769 be/4 enterpri 0.00 B/s 33.65 M/s 0.00 % 0.00 %
> postgres: wal writer process
perfect match for SAS drives.
OTOH the queries do a lot of random reads, which is a terrible match for
spinners.Yes, that seems to be the case.
> That however still doesn't say which processes are responsible
> for that.
> Is that background writer, backends running queries or what? The
> iotop
> should give you answer to this (or at least a hint).
>
>
> It seems most of I/O reported from backends running heavy concurrent
> select queries (See iotop attachment in previous email)I do understand these are LUNs from the SAN. I was asking whether there
> Also, how are these volumes defined? Do they use distinct sets
> of disks?
> How many disks are used for each volume?
>
>
> These are LUNs from SAN (we have dedicated 16 SAS 2,5'' disks in RAID-10
> topology in Storage)
are separate sets of disks for the data directory (which you mentioned
to be RAID-10) and pg_archives (which you mentioned to be RAID-5).
Although I doubt it'd be possible to use the same disk for two LUNs.
Sorry I didn't get you question right. Yes there are different disk sets for RAID-10 (data) and RAID-5 (wal archives)
... which is exactly what WAL archive is. That's why the GUC is called
> > Yes we are using 15K SAS disks in RAID 10. (253-2 dev refers
> to sar
> > output for disks)
>
> OK, so the pg_archives is probably for xlog archive, right?
>
> NO.
> /pg_archives is the target mount_point where we copy archive_logs to
> (archive_command = 'test ! -f /pg_archives/%f && cp %p /pg_archives/%f')
archive_command.
Again misunderstood your question. I wrongly got you're asking for separate LUN for WAL (pg_xlog to a separate device and not WAL archives)
OK, I'm not familiar with dynatune, and I got confused by the
> I've checked the conf, and I think you should really consider
> increasing
> checkpoint_segments - it's set to 3 (= 64MB) but I think
> something like
> 32 (=512MB) or even more would be more appropriate.
>
> We use EDB dynatune. Actual setting can be found in file
> (Ticket.Usual.Info.27.07.13.txt) of initial e-mail --> check show all;
> section
> Current checkpoint_segments is set to 64
postgresql.conf that you sent. 64 seems fine to me.
Understood. EDB dynatune is a specific feature that ships with EDB PG versions and suppose to take care of most of the PG conf parameters (found in postgresql.conf) automatically and adjust them in run time (You can always override them).
"Show all" command in psql promt gives you the actual values at any given time.
Meh, seems OK to me. This was based on the incorrect number of
> I see you've enabled log_checkpoints - can you check your logs
> how often
> the checkpoints happen?
>
>
> This is the output of the checkpoints during peak hours (avg. every 2-5
> minutes)
>
> 2013-08-02 14:00:20 UTC [767]: [19752]: [0]LOG: checkpoint complete:
> wrote 55926 buffers (5.3%); 0 transaction log file(s) added, 0 removed,
> 41 recycled; write=220.619 s, sync=
> 5.443 s, total=226.152 s; sync files=220, longest=1.433 s, average=0.024 s
> 2013-08-02 14:05:14 UTC [767]: [19754]: [0]LOG: checkpoint complete:
> wrote 109628 buffers (10.5%); 0 transaction log file(s) added, 0
> removed, 31 recycled; write=209.714 s, syn
> c=9.513 s, total=219.252 s; sync files=222, longest=3.472 s, average=0.042 s
checkpoint segments ...>Again, seems fine.
>
>
> Also, can you check pg_stat_bgwriter view? I'd bet the value in
> checkpoints_timed is very low, compared to checkpoints_req. Or even
> better, get the values from this view before / after running the
> batch jobs.
>
> Results during load:
> checkpoints_timed : 12432 , checkpoints_req = 3058
Update values for pg_stat_bgwriter after batch activity (off-peak)
checkpoints_timed : 12580 checkpoints_req : 3070
checkpoints_timed : 12580 checkpoints_req : 3070
I don't see any significant difference here.
> In the afternoon it's a different story - for 253-2 it looksWell, then I think it's mostly about the SELECT queries.
> like this:
>
> DEV tps rd_sec/s wr_sec/s await %util
> 15:50:01 dev253-2 4742.91 33828.98 29156.17 84.84 105.14
> 16:00:01 dev253-2 2781.05 12737.41 18878.52 19.24 80.53
> 16:10:01 dev253-2 3661.51 20950.64 23758.96 36.86 99.03
> 16:20:01 dev253-2 5011.45 32454.33 31895.05 72.75 102.38
> 16:30:01 dev253-2 2638.08 14661.23 17853.16 25.24 75.64
> 16:40:01 dev253-2 1988.95 5764.73 14190.12 45.05 58.80
> 16:50:01 dev253-2 2185.15 88296.73 11806.38 7.46 84.37
> 17:00:01 dev253-2 2031.19 12835.56 12997.34 8.90 82.62
> 17:10:01 dev253-2 4009.24 34288.71 23974.92 38.07 103.01
> 17:20:01 dev253-2 3605.86 26107.83 22457.41 45.76 90.90
> 17:30:01 dev253-2 2550.47 7496.85 18267.07 19.10 65.87
>
>
> This is when the actual problem arises
> What I think you could/should do:
>> * move pg_xlog to a separate device (not simply a volume on the SAN,OK, understood. It's difficult to predict the gain and given the iotop
> sharing disks with the other volumes - that won't give you
> anything)
>
> Unfortunately we cannot do so at the moment (alll available SAN
> resources are assigned to the pg_data directory of the server)
>
> I'd expect these changes to improve the afternoon peak, as it's
> doing
> about 50% writes. However I would not expect this to improve the
> morning
> peak, because that's doing a lot of reads (not writes).
>
> Afternoon peak is what we need to troubleshoot (will check if we can
> assign pg_xlog to a different LUN - not an option currently)
output it might even cause harm.It'll certainly improve the random I/O in general, which is the main
>
> Will SSD improve write performance? We are thinking of moving towards
> this direction.
issue with SELECT queries. Sequential read/write improvement probably
won't be that significant.
Tomas
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
pgsql-performance by date: