Re: upgrade from 9.2.x to 9.3 causes significant performance degradation - Mailing list pgsql-general

From Lonni J Friedman
Subject Re: upgrade from 9.2.x to 9.3 causes significant performance degradation
Date
Msg-id CAP=oouH+aBQ=DSNr9NLf6VP05AjRb+GNomDY5nQHSaGy5q1qWA@mail.gmail.com
Whole thread Raw
In response to Re: upgrade from 9.2.x to 9.3 causes significant performance degradation  (Jeff Janes <jeff.janes@gmail.com>)
Responses Re: upgrade from 9.2.x to 9.3 causes significant performance degradation  (Kevin Grittner <kgrittn@ymail.com>)
List pgsql-general
Thanks for your reply.  Comments/answers inline below....

On Tue, Sep 17, 2013 at 11:28 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Tue, Sep 17, 2013 at 11:22 AM, Lonni J Friedman <netllama@gmail.com>
> wrote:
>>
>>
>> > c) What does logs say?
>>
>> The postgres server logs look perfectly normal, minus a non-trivial
>> slower run time for most queries.  There's nothing unusual in any of
>> the OS level logs (/var/log/messages, etc) or dmesg.
>
>
> Are you generally CPU limited or IO limited?

top shows over 90% of the load is in sys space.  vmstat output seems
to suggest that its CPU bound (or bouncing back & forth):
########
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0  17308 852016 141104 127074192    0    0    10    18    0    0
6  4 90  0  0
 0  0  17308 872316 141104 127074200    0    0     0   988  940  564
1  0 99  0  0
 0  0  17308 884288 141104 127074208    0    0     0  1921 1202 2132
1  0 99  0  0
 0  0  17308 898728 141104 127074208    0    0     0     0 1064  577
1  0 99  0  0
 2  0  17308 914920 141104 127074224    0    0     0    44  820  427
1  0 99  0  0
 0  0  17308 926524 141104 127074272    0    0     0    48 1173  585
1  0 99  0  0
108  1  17308 753648 141104 127074224    0    0     0   236 9825 3901
12  5 83  0  0
50  0  17308 723156 141104 127074400    0    0     0   144 43481 9105
20 79  1  0  0
45  0  17308 722860 141104 127074416    0    0     0     8 32969 1998
1 97  2  0  0
47  0  17308 738996 141104 127074416    0    0     0     0 34099 1739
1 99  0  0  0
101  0  17308 770220 141104 127074480    0    0     0    32 38550 5998
 7 93  0  0  0
101  0  17308 775732 141104 127074512    0    0     0   156 33889 5809
 4 96  0  0  0
99  0  17308 791232 141104 127074544    0    0     0     0 32385 4981
0 100  0  0  0
96  0  17308 803156 141104 127074544    0    0     0    24 32413 4824
0 100  0  0  0
87  0  17308 811624 141104 127074544    0    0     0     0 32438 4470
0 100  0  0  0
83  0  17308 815500 141104 127074544    0    0     0     0 32489 4159
0 100  0  0  0
80  0  17308 826572 141104 127074552    0    0     0    33 32582 3948
0 100  0  0  0
73  0  17308 853264 141108 127074552    0    0     0    52 32833 3840
0 100  0  0  0
73  0  17308 882240 141108 127074560    0    0     0     4 32820 3594
0 100  0  0  0
72  0  17308 892256 141108 127074560    0    0     0     0 32368 3516
0 100  0  0  0
#######

iostat consistently shows %util under 1.00 which also suggests that
disk IO is not the bottleneck:
#########
iostat -dx /dev/sdb 5
Linux 2.6.32-358.6.2.el6.x86_64 (cuda-db7)     09/17/2013     _x86_64_
   (32 CPU)

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.02     0.21    6.91   31.33   651.60  1121.85
46.38     0.09    2.25   0.08   0.31

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.00    8.00     0.00    93.00
11.62     0.00    0.28   0.20   0.16

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.00   11.00     0.00   125.40
11.40     0.00    0.16   0.16   0.18

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.00  105.00     0.00  3380.40
32.19     0.29    2.76   0.03   0.34

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.00   14.80     0.00  2430.60
164.23     0.00    0.12   0.09   0.14

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     1.20    0.00   41.60     0.00  1819.40
43.74     0.02    0.45   0.05   0.20

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.00    2.80     0.00    32.00
11.43     0.00    0.00   0.00   0.00
#########

mpstat also shows a virtually 0 iowait, with a ton of sys (CPU) time:
#########
mpstat 2 10
Linux 2.6.32-358.6.2.el6.x86_64 (cuda-db7)     09/17/2013     _x86_64_
   (32 CPU)

12:53:19 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft
%steal  %guest   %idle
12:53:21 PM  all    7.36    0.00   92.58    0.00    0.00    0.03
0.00    0.00    0.03
12:53:23 PM  all    6.35    0.00   90.43    0.00    0.00    0.03
0.00    0.00    3.19
12:53:25 PM  all    3.13    0.00   68.20    0.00    0.00    0.02
0.00    0.00   28.66
12:53:27 PM  all    6.07    0.00   68.46    0.00    0.00    0.03
0.00    0.00   25.44
12:53:29 PM  all    5.83    0.00   94.14    0.00    0.00    0.03
0.00    0.00    0.00
12:53:31 PM  all    5.75    0.00   94.14    0.00    0.00    0.11
0.00    0.00    0.00
12:53:33 PM  all    7.65    0.00   40.32    0.00    0.00    0.03
0.00    0.00   52.00
12:53:35 PM  all    3.78    0.00    0.94    0.00    0.00    0.03
0.00    0.00   95.25
12:53:37 PM  all   16.09    0.00   14.33    0.02    0.00    0.05
0.00    0.00   69.51
12:53:39 PM  all    5.60    0.00   10.38    0.00    0.00    0.05
0.00    0.00   83.97
Average:     all    6.76    0.00   57.41    0.00    0.00    0.04
0.00    0.00   35.79
#########

So unless I've misunderstood the output above, all of this suggests
that the CPU is the bottleneck.  If you'd like some other data let me
know and I'll be happy to capture it.


> Grab one of those slower queries and do an "explain (analyze, buffers)" of
> it.  Preferably with track_io_timing on.

Just to be clear, just about every query is slow, not just certain
queries.  The entire server is slow, and the only thing that has
changed is the version of postgres (granted, not a trivial change).
Anyway, I picked a random query, connected with psql and ran the
following:
########
nightly=# SET track_io_timing=on ;
SET

nightly=# EXPLAIN (analyze,buffers) SELECT count(*) FROM
pg_catalog.pg_class AS c WHERE c.oid =
pgpool_regclass('dblcufft_testlist') AND c.relpersistence = 'u';
                                                  QUERY PLAN

--------------------------------------------------------------------------------------------------------
-------
 Aggregate  (cost=2779.71..2779.72 rows=1 width=0) (actual
time=83.136..83.136 rows=1 loops=1)
   Buffers: shared hit=1633
   ->  Seq Scan on pg_class c  (cost=0.00..2779.70 rows=1 width=0)
(actual time=83.131..83.131 rows=0 loops=1)
         Filter: ((relpersistence = 'u'::"char") AND (oid =
pgpool_regclass('dblcufft_testlist'::cstring)))
         Rows Removed by Filter: 70393
         Buffers: shared hit=1633
 Total runtime: 83.181 ms
(7 rows)
########

I'm not quite sure that I understand how to interpret this output.  I
didn't have track_io_timings enabled by default.  Does it require a
restart to enable?    The official dox seem to suggest that there
should be a number of additional fields, which I'm not seeing.

>
> Do you still have test/dev/QA/etc copy running on 9.2 for comparison?

Not live.  I'd have to restore from a backup, and my test environment
is not the same class of hardware as production.  I could get that
process underway if you wanted.  What did you have in mind?


pgsql-general by date:

Previous
From: Rowan Collins
Date:
Subject: Re: Connect postgres to SQLSERVER
Next
From: Gregory Haase
Date:
Subject: Re: ZFS snapshots - to use pg_start_backup() and pg_stop_backup() or no?