Thread: PG performance issues related to storage I/O waits

PG performance issues related to storage I/O waits

From

Tasos Petalas

Date:

31 July 2013, 12:00:23

Hello team,

We have serious performance issues with a production EDB 9.2AS installation

The issue is mostly related to storage I/O bottlenecks during peak hours and we are looking for tunables on any level that could reduce the I/O spikes on SAN and improve overall DB performance.

Our storage array consists of 16 disks in RAID-10 topology (device 253-2 on our OS configuration). We are also using RAID-5 for archive_log storage (also presented by SAN to the same machine - device 253-3)

We have set synchronous_commit to off but since almost all of application queries are using prepared statements we don't get any real benefit.

We are using VMware , VMFS and LVM so we need your feedback on any kind of tunable that could remove load from storage during peak hours (FYI application peak hours are 13:00-23:00 UTC, during night (04:00-06:00 UTC) there are some heavy reporting activity + backups)
Archive logs are rsync-ed to a remote backup server every 20 minutes.

Also please advise on any postgres.conf modification that could significantly affect storage load (WAL-checkpoint configuration etc.) (we have not tried to move pg_xlog to a separate LUN since this is not an option - any other LUN would be using the same storage pool as the rest of the /pgdata files)
We had some issues in the past with autovaccum deamon failing to work efficiently under high load so we have already applied your instructions for a more aggressive auto-vacumm policy (changes already applied on postgresql.conf)

Let me know if you want me to attach all the usual info for tickets regarding (OS, disks, PG conf, etc) plus the sar output and server logs from the last 3 days (24,25,26 June).

Thanks,
Tasos

Re: PG performance issues related to storage I/O waits

From

"Tomas Vondra"

Date:

31 July 2013, 12:40:55

On 31 Červenec 2013, 13:58, Tasos Petalas wrote:
> Hello team,
>
> We have serious performance issues with a production EDB 9.2AS
> installation

Are you sure you've sent this to the right list? Because this is a
community mailing list for PostgreSQL, not for EDB. If you have a support
contract with EDB it's probably better to ask them directly (e.g. they
might give you advices about some custom features not available in vanilla
PostgreSQL).

> The issue is mostly related to storage I/O bottlenecks during peak hours
> and we are looking for tunables on any level that could reduce the I/O
> spikes on SAN and improve overall DB performance.

So is that a dedicated DWH machine, and PostgreSQL is responsible for most
of the I/O load? Which processes are doing that? Backends handling queries
or some background processes (say, checkpoints)? Is that random or
sequential I/O, reads or writes, ...?

How much I/O are we talking about? Could it be that the SAN is overloaded
by someone else (in case it's not dedicated to the database)?

It might turn out that the most effective solution is tuning the queries
that are responsible for the I/O activity.

> Our storage array consists of 16 disks in RAID-10 topology (device 253-2
> on
> our OS configuration). We are also using RAID-5 for archive_log storage
> (also presented by SAN to the same machine - device 253-3)

I have no clue what device 253-3 is, but I assume you're using SAS disks.

>
> We have set synchronous_commit to off but since almost all of application
> queries are using prepared statements we don't get any real benefit.

Ummmm, how is this related? AFAIK those are rather orthogonal features,
i.e. prepared statements should benefit from synchronous_commit=off just
like any other queries.

> We are using VMware , VMFS and LVM so we need your feedback on any kind of
> tunable that could remove load from storage during peak hours (FYI
> application peak hours are 13:00-23:00 UTC, during night (04:00-06:00 UTC)
> there are some heavy reporting activity + backups)
> Archive logs are rsync-ed to a remote backup server every 20 minutes.
>
> Also please advise on any postgres.conf modification that could
> significantly affect storage load (WAL-checkpoint configuration etc.) (we
> have not tried to move pg_xlog to a separate LUN since this is not an
> option - any other LUN would be using the same storage pool as the rest of
> the /pgdata files)
> We had some issues in the past with autovaccum deamon failing to work
> efficiently under high load so we have already applied your instructions
> for a more aggressive auto-vacumm policy (changes already applied on
> postgresql.conf)
>
> Let me know if you want me to attach all the usual info for tickets
> regarding (OS, disks, PG conf, etc) plus the sar output and server logs
> from the last 3 days (24,25,26 June).

Well, we can't really help you unless you give us this, so yes - attach
this info. And please try to identify what is actually causing most I/O
activity (e.g. using "iotop").

Tomas

Re: PG performance issues related to storage I/O waits

From

KONDO Mitsumasa

Date:

01 August 2013, 01:58:15

Hi Tasos,

(2013/07/31 20:58), Tasos Petalas wrote:
> We have set synchronous_commit to off but since almost all of application queries
> are using prepared statements we don't get any real benefit.

> Also please advise on any postgres.conf modification that could significantly
> affect storage load (WAL-checkpoint configuration etc.)
Please send us to your postgresql.conf and detail of your server (memory ,raid
cache size and OS version).

I think your server's memory is big. You should set small dirty_background_ratio
or dirty_background_byte. If your server's memory is big, PostgreSQL's checkpoint
executes stupid disk-write.

By the way, I make new checkpoint scheduler and method now. If you could send
more detail information which is like response time or load average, you will
help me to make better my patch.

Best regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

Re: PG performance issues related to storage I/O waits

From

Tomas Vondra

Date:

03 August 2013, 20:38:44

Hi,

On 1.8.2013 21:47, Tasos Petalas wrote:> I attach postgresql.conf, sar
output and server/os/disk info
>
> Will update you with iotop outputs shortly.

So, have you found anything interesting using iotop?

>> On Wed, Jul 31, 2013 at 3:40 PM, Tomas Vondra <tv@fuzzy.cz
>> <mailto:tv@fuzzy.cz>> wrote:
>>
>> On 31 Červenec 2013, 13:58, Tasos Petalas wrote:
>>> Hello team,
>>>
>>> We have serious performance issues with a production EDB 9.2AS
>>> installation
>>
>> Are you sure you've sent this to the right list? Because this is a
>> community mailing list for PostgreSQL, not for EDB. If you have a
>> support contract with EDB it's probably better to ask them directly
>> (e.g. they might give you advices about some custom features not
>> available in vanilla PostgreSQL).
>
> We do have a support contract with did not provide any valuable
> feedback so far. As a person coming from the community world I
> believe I can get answers here.

You can certainly get help here, although probably only for plain
PostgreSQL, not for the features available only in EDB.

>>> The issue is mostly related to storage I/O bottlenecks during
>>> peak hours and we are looking for tunables on any level that
>>> could reduce the I/O spikes on SAN and improve overall DB
>>> performance.
>>
>> So is that a dedicated DWH machine, and PostgreSQL is responsible
>> for most of the I/O load? Which processes are doing that? Backends
>> handling queries or some background processes (say, checkpoints)?
>> Is that random or sequential I/O, reads or writes, ...?
>
> The system is a heavy OLTP system. Bottlenecks are mostly related to
> I/O writes (many small chunks). Machine is only hosting postgres.

That however still doesn't say which processes are responsible for that.
Is that background writer, backends running queries or what? The iotop
should give you answer to this (or at least a hint).

> How much I/O are we talking about? Could it be that the SAN is
> overloaded by someone else (in case it's not dedicated to the
> database)?
>
> Check SAR results (pg_data resides on dev 253-2 (RAID-10),
> pg_archives on 253-3 (RAID-5)

What is pg_archives? Also RAID-5 is generally poor choice for write
intensive workloads.

Also, how are these volumes defined? Do they use distinct sets of disks?
How many disks are used for each volume?

> It might turn out that the most effective solution is tuning the
> queries that are responsible for the I/O activity.
>
> SAN is almost dedicated to the host running postgres.

And? How does that contradict my suggestion to tune the queries? For
each piece of hardware there are bottlenecks and a query that can hit
them. In those cases the best solution often is tuning the query.

>>> Our storage array consists of 16 disks in RAID-10 topology
>>> (device 253-2 on our OS configuration). We are also using RAID-5
>>>  for archive_log storage (also presented by SAN to the same
>>> machine - device 253-3)
>>
>> I have no clue what device 253-3 is, but I assume you're using SAS
>>  disks.
>
> Yes we are using 15K SAS disks in RAID 10. (253-2 dev refers to sar
> output for disks)

OK, so the pg_archives is probably for xlog archive, right?

>>> We have set synchronous_commit to off but since almost all of
>>> application queries are using prepared statements we don't get
>>> any real benefit.
>>
>> Ummmm, how is this related? AFAIK those are rather orthogonal
>> features, i.e. prepared statements should benefit from
>> synchronous_commit=off just like any other queries.
>
> it is not prepared statements. It is distributed transactions
> (queries inside a PREPARE TRANSACTION block). / / /Certain utility
> commands, for instance DROP TABLE, are forced to commit
> synchronously regardless of the setting of synchronous_commit. This
> is to ensure consistency between the server's file system and the
> logical state of the database. The commands supporting two-phase
> commit, such as PREPARE TRANSACTION, are also always synchronous./
>
> _/Taken form
> http://www.postgresql.org/docs/9.2/static/wal-async-commit.html/_

Well, then please use the proper term - prepared transactions and
prepared statements are two very different things.

But yes, if you're using prepared transactions (for 2PC), then yes,
synchronous_commit=off is not going to improve anything as it simply has
to be synchronous.

>>> We are using VMware , VMFS and LVM so we need your feedback on
>>> any kind of tunable that could remove load from storage during
>>> peak hours (FYI application peak hours are 13:00-23:00 UTC,
>>> during night (04:00-06:00 UTC) there are some heavy reporting
>>> activity + backups) Archive logs are rsync-ed to a remote backup
>>> server every 20 minutes.
>>>
>>> Also please advise on any postgres.conf modification that could
>>> significantly affect storage load (WAL-checkpoint configuration
>>> etc.) (we have not tried to move pg_xlog to a separate LUN since
>>> this is not an option - any other LUN would be using the same
>>> storage pool as the rest of the /pgdata files) We had some
>>> issues in the past with autovaccum deamon failing to work
>>> efficiently under high load so we have already applied your
>>> instructions for a more aggressive auto-vacumm policy (changes
>>> already applied on postgresql.conf)
>>>
>>> Let me know if you want me to attach all the usual info for
>>> tickets regarding (OS, disks, PG conf, etc) plus the sar output
>>> and server logs from the last 3 days (24,25,26 June).
>>
>> Well, we can't really help you unless you give us this, so yes -
>> attach this info. And please try to identify what is actually
>> causing most I/O activity (e.g. using "iotop").
>
> SAR outputs, postgresql.conf, other os/system h/w info attached.

I've checked the conf, and I think you should really consider increasing
checkpoint_segments - it's set to 3 (= 64MB) but I think something like
32 (=512MB) or even more would be more appropriate.

I see you've enabled log_checkpoints - can you check your logs how often
the checkpoints happen?

Also, can you check pg_stat_bgwriter view? I'd bet the value in
checkpoints_timed is very low, compared to checkpoints_req. Or even
better, get the values from this view before / after running the batch jobs.

Anyway, I don't think writes are the main problem here - see this:

                  DEV       tps  rd_sec/s  wr_sec/s   await   %util
04:00:01     dev253-2    176.56    134.07   1378.08    3.70    3.53
04:10:01     dev253-2    895.22  11503.99   6735.08   16.63    8.24
04:20:01     dev253-2    455.35  25523.80   1362.37    2.38   16.81
04:30:01     dev253-2    967.29  95471.88   4193.50    6.70   37.44
04:40:01     dev253-2    643.31  80754.86   2456.40    3.35   29.70
04:50:01     dev253-2    526.35  84990.05   1323.28    2.07   29.41
05:00:01     dev253-2    652.68  73192.18   1297.20    1.89   28.51
05:10:01     dev253-2   1256.31  34786.32   5840.08    9.25   53.08
05:20:01     dev253-2    549.84  14530.45   3522.85    8.12    9.89
05:30:01     dev253-2   1363.27 170743.78   5490.38    7.53   59.75
05:40:01     dev253-2    978.88 180199.97   1796.90    2.54   74.08
05:50:01     dev253-2   1690.10 166467.91   8013.10   35.45   66.32
06:00:01     dev253-2   2441.94 111316.65  15245.05   34.90   41.78

it's a slightly modified sar output for the main data directory (on
253-2). It clearly shows you're doing ~50MB/s of (random) reads compared
to less than 5MB/s of writes (assuming a sector is 512B).

There's almost no activity on the pg_archives (253-3) device:

00:00:01          DEV       tps  rd_sec/s  wr_sec/s   await   %util
04:00:01     dev253-3     20.88      0.01    167.00   14.10    0.02
04:10:01     dev253-3    211.06      0.05   1688.43   43.61    0.12
04:20:01     dev253-3     14.30      0.00    114.40    9.95    0.01
04:30:01     dev253-3    112.78      0.45    901.75   17.81    0.06
04:40:01     dev253-3     14.11      0.00    112.92   10.66    0.01
04:50:01     dev253-3      7.39     56.94     56.85   10.91    0.04
05:00:01     dev253-3     14.21      0.00    113.67   10.92    0.01
05:10:01     dev253-3      7.05      0.26     56.15   17.03    0.02
05:20:01     dev253-3     28.38     18.20    208.87    8.68    0.29
05:30:01     dev253-3     41.71      0.03    333.63   14.70    0.03
05:40:01     dev253-3      6.95      0.00     55.62   10.39    0.00
05:50:01     dev253-3    105.36    386.44    830.83    9.62    0.19
06:00:01     dev253-3     13.92      0.01    111.34   10.41    0.01

In the afternoon it's a different story - for 253-2 it looks like this:

                  DEV       tps  rd_sec/s  wr_sec/s   await    %util
15:50:01     dev253-2   4742.91  33828.98  29156.17   84.84   105.14
16:00:01     dev253-2   2781.05  12737.41  18878.52   19.24    80.53
16:10:01     dev253-2   3661.51  20950.64  23758.96   36.86    99.03
16:20:01     dev253-2   5011.45  32454.33  31895.05   72.75   102.38
16:30:01     dev253-2   2638.08  14661.23  17853.16   25.24    75.64
16:40:01     dev253-2   1988.95   5764.73  14190.12   45.05    58.80
16:50:01     dev253-2   2185.15  88296.73  11806.38    7.46    84.37
17:00:01     dev253-2   2031.19  12835.56  12997.34    8.90    82.62
17:10:01     dev253-2   4009.24  34288.71  23974.92   38.07   103.01
17:20:01     dev253-2   3605.86  26107.83  22457.41   45.76    90.90
17:30:01     dev253-2   2550.47   7496.85  18267.07   19.10    65.87

So this is about 50:50 reads and writes, and this is also the time when
with some measurable activity on the 253-3 device:

00:00:01          DEV       tps  rd_sec/s  wr_sec/s   await    %util
15:50:01     dev253-3   1700.97   9739.48  13249.75   22.63     8.53
16:00:01     dev253-3    807.44    512.95   6439.17   15.21     0.82
16:10:01     dev253-3   1236.72     22.92   9892.26   28.74     0.95
16:20:01     dev253-3   1709.15      0.52  13672.70   40.89     1.69
16:30:01     dev253-3    919.26   8217.60   7051.60   20.40    11.74
16:40:01     dev253-3    601.66      0.37   4812.94   18.99     0.39
16:50:01     dev253-3    476.40      0.42   3810.95   10.02     0.28
17:00:01     dev253-3    636.03      0.15   5088.08   11.01     0.35
17:10:01     dev253-3   1259.55    165.64  10069.65   15.18     1.01
17:20:01     dev253-3   1194.10      0.29   9552.49   26.11     0.94
17:30:01     dev253-3    785.40   2000.52   6201.21   33.12     3.06

Still, this is pretty low write activity, and it's sequential.

What I think you could/should do:

* increase checkpoint_segments to 64 (or something)

* move pg_xlog to a separate device (not simply a volume on the SAN,
  sharing disks with the other volumes - that won't give you anything)

I'd expect these changes to improve the afternoon peak, as it's doing
about 50% writes. However I would not expect this to improve the morning
peak, because that's doing a lot of reads (not writes).

The trick here is that your database is ~300GB - you're doing a lot of
seeks across the whole database, and it's way more than the RAM. So it
has to actually access the disks. 15k disks have ~200 IOPS each, adding
them to RAID-10 generally gives you an array with

    (n/2) * (IOPS for a single disk)

I don't know how exactly are your volumes defined on your SAN, but
assuming you have RAID-10 on 12 drives gives you ~1200 IOPS.

This is the bottleneck you're hitting, and there's not much you can do
about it:

  * getting better storage (giving you more seeks) - say SSD-like disks
  * improving the application/queries/schema to minimize the seeks

And so on.

Tomas

Re: PG performance issues related to storage I/O waits

From

Tomas Vondra

Date:

05 August 2013, 20:28:45

Hi,

On 5.8.2013 17:55, Tasos Petalas wrote:
>
>     Seems most of the I/O is caused by SELECT backend processes (READ),
>     whereas (WRITE) requests of wal writer and checkpointer processes do
>     not appear as top IO proceses (correct me if I am wrong)
>
> E.g. check the follwoing heavy write process that reports 0% I/O ...!
>
>  14:09:40   769 be/4 enterpri    0.00 B/s   33.65 M/s  0.00 %  0.00 %
> postgres: wal writer process

That's because the WAL writer does sequential I/O (writes), which is a
perfect match for SAS drives.

OTOH the queries do a lot of random reads, which is a terrible match for
spinners.

>         That however still doesn't say which processes are responsible
>         for that.
>         Is that background writer, backends running queries or what? The
>         iotop
>         should give you answer to this (or at least a hint).
>
>
> It seems most of I/O reported from backends running heavy concurrent
> select queries (See iotop attachment in previous email)

Yes, that seems to be the case.

>         Also, how are these volumes defined? Do they use distinct sets
>         of disks?
>         How many disks are used for each volume?
>
>
> These are LUNs from SAN (we have dedicated 16 SAS 2,5'' disks in RAID-10
> topology in Storage)

I do understand these are LUNs from the SAN. I was asking whether there
are separate sets of disks for the data directory (which you mentioned
to be RAID-10) and pg_archives (which you mentioned to be RAID-5).

Although I doubt it'd be possible to use the same disk for two LUNs.

>         > Yes we are using 15K SAS disks in RAID 10. (253-2 dev refers
>         to sar
>         > output for disks)
>
>         OK, so the pg_archives is probably for xlog archive, right?
>
> NO.
> /pg_archives is the target mount_point where we copy archive_logs to
> (archive_command = 'test ! -f /pg_archives/%f && cp %p /pg_archives/%f')

... which is exactly what WAL archive is. That's why the GUC is called
archive_command.

>         I've checked the conf, and I think you should really consider
>         increasing
>         checkpoint_segments - it's set to 3 (= 64MB) but I think
>         something like
>         32 (=512MB) or even more would be more appropriate.
>
> We use EDB dynatune. Actual setting can be found in  file
> (Ticket.Usual.Info.27.07.13.txt) of initial e-mail --> check show all;
> section
> Current checkpoint_segments is set to 64

OK, I'm not familiar with dynatune, and I got confused by the
postgresql.conf that you sent. 64 seems fine to me.

>         I see you've enabled log_checkpoints - can you check your logs
>         how often
>         the checkpoints happen?
>
>
> This is the output of the checkpoints during peak hours (avg. every 2-5
> minutes)
>
> 2013-08-02 14:00:20 UTC [767]: [19752]: [0]LOG:  checkpoint complete:
> wrote 55926 buffers (5.3%); 0 transaction log file(s) added, 0 removed,
> 41 recycled; write=220.619 s, sync=
> 5.443 s, total=226.152 s; sync files=220, longest=1.433 s, average=0.024 s
> 2013-08-02 14:05:14 UTC [767]: [19754]: [0]LOG:  checkpoint complete:
> wrote 109628 buffers (10.5%); 0 transaction log file(s) added, 0
> removed, 31 recycled; write=209.714 s, syn
> c=9.513 s, total=219.252 s; sync files=222, longest=3.472 s, average=0.042 s

Meh, seems OK to me. This was based on the incorrect number of
checkpoint segments ...
>
>
>
>         Also, can you check pg_stat_bgwriter view? I'd bet the value in
>         checkpoints_timed is very low, compared to checkpoints_req. Or even
>         better, get the values from this view before / after running the
>         batch jobs.
>
> Results during load:
> checkpoints_timed : 12432 , checkpoints_req = 3058

Again, seems fine.

>         In the afternoon it's a different story - for 253-2 it looks
>         like this:
>
>                           DEV       tps  rd_sec/s  wr_sec/s   await    %util
>         15:50:01     dev253-2   4742.91  33828.98  29156.17   84.84   105.14
>         16:00:01     dev253-2   2781.05  12737.41  18878.52   19.24    80.53
>         16:10:01     dev253-2   3661.51  20950.64  23758.96   36.86    99.03
>         16:20:01     dev253-2   5011.45  32454.33  31895.05   72.75   102.38
>         16:30:01     dev253-2   2638.08  14661.23  17853.16   25.24    75.64
>         16:40:01     dev253-2   1988.95   5764.73  14190.12   45.05    58.80
>         16:50:01     dev253-2   2185.15  88296.73  11806.38    7.46    84.37
>         17:00:01     dev253-2   2031.19  12835.56  12997.34    8.90    82.62
>         17:10:01     dev253-2   4009.24  34288.71  23974.92   38.07   103.01
>         17:20:01     dev253-2   3605.86  26107.83  22457.41   45.76    90.90
>         17:30:01     dev253-2   2550.47   7496.85  18267.07   19.10    65.87
>
>
> This is when the actual  problem arises

Well, then I think it's mostly about the SELECT queries.

>         What I think you could/should do:
>
>         * move pg_xlog to a separate device (not simply a volume on the SAN,
>           sharing disks with the other volumes - that won't give you
>         anything)
>
> Unfortunately we cannot do so at the moment (alll available SAN
> resources are assigned to the pg_data directory of the server)
>
>         I'd expect these changes to improve the afternoon peak, as it's
>         doing
>         about 50% writes. However I would not expect this to improve the
>         morning
>         peak, because that's doing a lot of reads (not writes).
>
> Afternoon peak is what we need to troubleshoot (will check if we can
> assign pg_xlog to a different LUN - not an option currently)

OK, understood. It's difficult to predict the gain and given the iotop
output it might even cause harm.

>
> Will SSD improve write performance? We are thinking of moving towards
> this direction.

It'll certainly improve the random I/O in general, which is the main
issue with SELECT queries. Sequential read/write improvement probably
won't be that significant.

Tomas

Re: PG performance issues related to storage I/O waits

From

Tasos Petalas

Date:

06 August 2013, 04:47:06

On Mon, Aug 5, 2013 at 11:28 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

Hi,

On 5.8.2013 17:55, Tasos Petalas wrote:
>
> Seems most of the I/O is caused by SELECT backend processes (READ),
> whereas (WRITE) requests of wal writer and checkpointer processes do
> not appear as top IO proceses (correct me if I am wrong)
>
> E.g. check the follwoing heavy write process that reports 0% I/O ...!
>
> 14:09:40 769 be/4 enterpri 0.00 B/s 33.65 M/s 0.00 % 0.00 %
> postgres: wal writer process

That's because the WAL writer does sequential I/O (writes), which is a
perfect match for SAS drives.

OTOH the queries do a lot of random reads, which is a terrible match for
spinners.

> That however still doesn't say which processes are responsible
> for that.
> Is that background writer, backends running queries or what? The
> iotop
> should give you answer to this (or at least a hint).
>
>
> It seems most of I/O reported from backends running heavy concurrent
> select queries (See iotop attachment in previous email)

Yes, that seems to be the case.

> Also, how are these volumes defined? Do they use distinct sets
> of disks?
> How many disks are used for each volume?
>
>
> These are LUNs from SAN (we have dedicated 16 SAS 2,5'' disks in RAID-10
> topology in Storage)

I do understand these are LUNs from the SAN. I was asking whether there
are separate sets of disks for the data directory (which you mentioned
to be RAID-10) and pg_archives (which you mentioned to be RAID-5).

Although I doubt it'd be possible to use the same disk for two LUNs.

Sorry I didn't get you question right. Yes there are different disk sets for RAID-10 (data) and RAID-5 (wal archives)

> > Yes we are using 15K SAS disks in RAID 10. (253-2 dev refers
> to sar
> > output for disks)
>
> OK, so the pg_archives is probably for xlog archive, right?
>
> NO.
> /pg_archives is the target mount_point where we copy archive_logs to
> (archive_command = 'test ! -f /pg_archives/%f && cp %p /pg_archives/%f')

... which is exactly what WAL archive is. That's why the GUC is called
archive_command.

Again misunderstood your question. I wrongly got you're asking for separate LUN for WAL (pg_xlog to a separate device and not WAL archives)

> I've checked the conf, and I think you should really consider
> increasing
> checkpoint_segments - it's set to 3 (= 64MB) but I think
> something like
> 32 (=512MB) or even more would be more appropriate.
>
> We use EDB dynatune. Actual setting can be found in file
> (Ticket.Usual.Info.27.07.13.txt) of initial e-mail --> check show all;
> section
> Current checkpoint_segments is set to 64

OK, I'm not familiar with dynatune, and I got confused by the
postgresql.conf that you sent. 64 seems fine to me.

Understood. EDB dynatune is a specific feature that ships with EDB PG versions and suppose to take care of most of the PG conf parameters (found in postgresql.conf) automatically and adjust them in run time (You can always override them).

"Show all" command in psql promt gives you the actual values at any given time.

> I see you've enabled log_checkpoints - can you check your logs
> how often
> the checkpoints happen?
>
>
> This is the output of the checkpoints during peak hours (avg. every 2-5
> minutes)
>
> 2013-08-02 14:00:20 UTC [767]: [19752]: [0]LOG: checkpoint complete:
> wrote 55926 buffers (5.3%); 0 transaction log file(s) added, 0 removed,
> 41 recycled; write=220.619 s, sync=
> 5.443 s, total=226.152 s; sync files=220, longest=1.433 s, average=0.024 s
> 2013-08-02 14:05:14 UTC [767]: [19754]: [0]LOG: checkpoint complete:
> wrote 109628 buffers (10.5%); 0 transaction log file(s) added, 0
> removed, 31 recycled; write=209.714 s, syn
> c=9.513 s, total=219.252 s; sync files=222, longest=3.472 s, average=0.042 s

Meh, seems OK to me. This was based on the incorrect number of
checkpoint segments ...
>
>
>
> Also, can you check pg_stat_bgwriter view? I'd bet the value in
> checkpoints_timed is very low, compared to checkpoints_req. Or even
> better, get the values from this view before / after running the
> batch jobs.
>
> Results during load:
> checkpoints_timed : 12432 , checkpoints_req = 3058

Again, seems fine.

Update values for pg_stat_bgwriter after batch activity (off-peak)
checkpoints_timed : 12580 checkpoints_req : 3070

I don't see any significant difference here.

> In the afternoon it's a different story - for 253-2 it looks
> like this:
>
> DEV tps rd_sec/s wr_sec/s await %util
> 15:50:01 dev253-2 4742.91 33828.98 29156.17 84.84 105.14
> 16:00:01 dev253-2 2781.05 12737.41 18878.52 19.24 80.53
> 16:10:01 dev253-2 3661.51 20950.64 23758.96 36.86 99.03
> 16:20:01 dev253-2 5011.45 32454.33 31895.05 72.75 102.38
> 16:30:01 dev253-2 2638.08 14661.23 17853.16 25.24 75.64
> 16:40:01 dev253-2 1988.95 5764.73 14190.12 45.05 58.80
> 16:50:01 dev253-2 2185.15 88296.73 11806.38 7.46 84.37
> 17:00:01 dev253-2 2031.19 12835.56 12997.34 8.90 82.62
> 17:10:01 dev253-2 4009.24 34288.71 23974.92 38.07 103.01
> 17:20:01 dev253-2 3605.86 26107.83 22457.41 45.76 90.90
> 17:30:01 dev253-2 2550.47 7496.85 18267.07 19.10 65.87
>
>
> This is when the actual problem arises

Well, then I think it's mostly about the SELECT queries.

> What I think you could/should do:
>
> * move pg_xlog to a separate device (not simply a volume on the SAN,
> sharing disks with the other volumes - that won't give you
> anything)
>
> Unfortunately we cannot do so at the moment (alll available SAN
> resources are assigned to the pg_data directory of the server)
>
> I'd expect these changes to improve the afternoon peak, as it's
> doing
> about 50% writes. However I would not expect this to improve the
> morning
> peak, because that's doing a lot of reads (not writes).
>
> Afternoon peak is what we need to troubleshoot (will check if we can
> assign pg_xlog to a different LUN - not an option currently)

OK, understood. It's difficult to predict the gain and given the iotop
output it might even cause harm.

>
> Will SSD improve write performance? We are thinking of moving towards
> this direction.

It'll certainly improve the random I/O in general, which is the main
issue with SELECT queries. Sequential read/write improvement probably
won't be that significant.

Tomas

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance