Thread: upgrade from 9.2.x to 9.3 causes significant performance degradation
Greetings, I'm running a PostgreSQL 9.3.0 cluster (1 master with two streaming replication hot standby slaves) on RHEL6-x86_64. Yesterday I upgraded from 9.2.4 to 9.3.0, and since the upgrade I'm seeing a significant performance degradation. PostgreSQL simply feels slower. Nothing other than the version of PostgreSQL changed yesterday. I used pg_upgrade to perform the upgrade, and ran the generated analyze_new_cluster.sh immediately afterwards, which completed successfully. Prior to the upgrade, I'd generally expect a load average of less than 2.00 on the master, and less than 1.00 on each of the slaves. Since the upgrade, the load average on the master has been in double digits (hitting 100.00 for a few minutes), and the slaves are consistently above 5.00. There are a few things that are jumping out at me as behaving differently since the upgrade. vmstat processes waiting for runtime counts have increased dramatically. Prior to the upgrade the process count would be consistently less than 10, however since upgrading it hovers between 40 & 60 at all times. /proc/interrupts "Local timer interrupts" has increased dramatically as well. It used to hover around 6000 and is now over 20k much of the time. However, I'm starting to suspect that they are both symptoms of the problem rather than the cause. At this point, I'm looking for guidance on how to debug this problem more effectively. thanks
On Tue, 17 Sep 2013 09:19:29 -0700 Lonni J Friedman <netllama@gmail.com> wrote: > Greetings, > I'm running a PostgreSQL 9.3.0 cluster (1 master with two streaming > replication hot standby slaves) on RHEL6-x86_64. Yesterday I upgraded > from 9.2.4 to 9.3.0, and since the upgrade I'm seeing a significant > performance degradation. PostgreSQL simply feels slower. Nothing > other than the version of PostgreSQL changed yesterday. I used > pg_upgrade to perform the upgrade, and ran the generated > analyze_new_cluster.sh immediately afterwards, which completed > successfully. > > Prior to the upgrade, I'd generally expect a load average of less than > 2.00 on the master, and less than 1.00 on each of the slaves. Since > the upgrade, the load average on the master has been in double digits > (hitting 100.00 for a few minutes), and the slaves are consistently > above 5.00. > > There are a few things that are jumping out at me as behaving > differently since the upgrade. vmstat processes waiting for runtime > counts have increased dramatically. Prior to the upgrade the process > count would be consistently less than 10, however since upgrading it > hovers between 40 & 60 at all times. /proc/interrupts "Local timer > interrupts" has increased dramatically as well. It used to hover > around 6000 and is now over 20k much of the time. However, I'm > starting to suspect that they are both symptoms of the problem rather > than the cause. > > At this point, I'm looking for guidance on how to debug this problem > more effectively. Don't know what happens but: a) Does analyze_new_cluster.sh include a reindex? If not, indexs are useless because analyze statistics says so. b) Did you configure postgresql.conf on 9.3.0 for your server/load? Perhaps it has default install values. c) What does logs say? > thanks > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general --- --- Eduardo Morras <emorrasg@yahoo.es>
Re: upgrade from 9.2.x to 9.3 causes significant performance degradation
From
Lonni J Friedman
Date:
On Tue, Sep 17, 2013 at 9:54 AM, Eduardo Morras <emorrasg@yahoo.es> wrote: > On Tue, 17 Sep 2013 09:19:29 -0700 > Lonni J Friedman <netllama@gmail.com> wrote: > >> Greetings, >> I'm running a PostgreSQL 9.3.0 cluster (1 master with two streaming >> replication hot standby slaves) on RHEL6-x86_64. Yesterday I upgraded >> from 9.2.4 to 9.3.0, and since the upgrade I'm seeing a significant >> performance degradation. PostgreSQL simply feels slower. Nothing >> other than the version of PostgreSQL changed yesterday. I used >> pg_upgrade to perform the upgrade, and ran the generated >> analyze_new_cluster.sh immediately afterwards, which completed >> successfully. >> >> Prior to the upgrade, I'd generally expect a load average of less than >> 2.00 on the master, and less than 1.00 on each of the slaves. Since >> the upgrade, the load average on the master has been in double digits >> (hitting 100.00 for a few minutes), and the slaves are consistently >> above 5.00. >> >> There are a few things that are jumping out at me as behaving >> differently since the upgrade. vmstat processes waiting for runtime >> counts have increased dramatically. Prior to the upgrade the process >> count would be consistently less than 10, however since upgrading it >> hovers between 40 & 60 at all times. /proc/interrupts "Local timer >> interrupts" has increased dramatically as well. It used to hover >> around 6000 and is now over 20k much of the time. However, I'm >> starting to suspect that they are both symptoms of the problem rather >> than the cause. >> >> At this point, I'm looking for guidance on how to debug this problem >> more effectively. > > Don't know what happens but: > > a) Does analyze_new_cluster.sh include a reindex? If not, indexs are useless because analyze statistics says so. No, it doesn't include a reindex. It merely invokes "vacuumdb --all --analyze-only" with different values for default_statistics_target=1 -c vacuum_cost_delay=0. According to the documentation for pg_upgrade, post-upgrade scripts to rebuild tables and indexes will be generated automatically. Nothing was generated for this purpose, at least not in any obvious place. The analyze_new_cluster.sh script is the only one that was automatically generated as far as I can tell. > b) Did you configure postgresql.conf on 9.3.0 for your server/load? Perhaps it has default install values. Yes, I'm using the same postgresql.conf as I was using when running 9.2.4. Its definitely not running with default install values. > c) What does logs say? The postgres server logs look perfectly normal, minus a non-trivial slower run time for most queries. There's nothing unusual in any of the OS level logs (/var/log/messages, etc) or dmesg. Do you have any other suggestions?
On Tue, Sep 17, 2013 at 11:22 AM, Lonni J Friedman <netllama@gmail.com> wrote:
> c) What does logs say?The postgres server logs look perfectly normal, minus a non-trivial
slower run time for most queries. There's nothing unusual in any of
the OS level logs (/var/log/messages, etc) or dmesg.
Are you generally CPU limited or IO limited?
Grab one of those slower queries and do an "explain (analyze, buffers)" of it. Preferably with track_io_timing on.
Do you still have test/dev/QA/etc copy running on 9.2 for comparison?
Cheers,
Jeff
Re: upgrade from 9.2.x to 9.3 causes significant performance degradation
From
Lonni J Friedman
Date:
Thanks for your reply. Comments/answers inline below.... On Tue, Sep 17, 2013 at 11:28 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Tue, Sep 17, 2013 at 11:22 AM, Lonni J Friedman <netllama@gmail.com> > wrote: >> >> >> > c) What does logs say? >> >> The postgres server logs look perfectly normal, minus a non-trivial >> slower run time for most queries. There's nothing unusual in any of >> the OS level logs (/var/log/messages, etc) or dmesg. > > > Are you generally CPU limited or IO limited? top shows over 90% of the load is in sys space. vmstat output seems to suggest that its CPU bound (or bouncing back & forth): ######## procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 17308 852016 141104 127074192 0 0 10 18 0 0 6 4 90 0 0 0 0 17308 872316 141104 127074200 0 0 0 988 940 564 1 0 99 0 0 0 0 17308 884288 141104 127074208 0 0 0 1921 1202 2132 1 0 99 0 0 0 0 17308 898728 141104 127074208 0 0 0 0 1064 577 1 0 99 0 0 2 0 17308 914920 141104 127074224 0 0 0 44 820 427 1 0 99 0 0 0 0 17308 926524 141104 127074272 0 0 0 48 1173 585 1 0 99 0 0 108 1 17308 753648 141104 127074224 0 0 0 236 9825 3901 12 5 83 0 0 50 0 17308 723156 141104 127074400 0 0 0 144 43481 9105 20 79 1 0 0 45 0 17308 722860 141104 127074416 0 0 0 8 32969 1998 1 97 2 0 0 47 0 17308 738996 141104 127074416 0 0 0 0 34099 1739 1 99 0 0 0 101 0 17308 770220 141104 127074480 0 0 0 32 38550 5998 7 93 0 0 0 101 0 17308 775732 141104 127074512 0 0 0 156 33889 5809 4 96 0 0 0 99 0 17308 791232 141104 127074544 0 0 0 0 32385 4981 0 100 0 0 0 96 0 17308 803156 141104 127074544 0 0 0 24 32413 4824 0 100 0 0 0 87 0 17308 811624 141104 127074544 0 0 0 0 32438 4470 0 100 0 0 0 83 0 17308 815500 141104 127074544 0 0 0 0 32489 4159 0 100 0 0 0 80 0 17308 826572 141104 127074552 0 0 0 33 32582 3948 0 100 0 0 0 73 0 17308 853264 141108 127074552 0 0 0 52 32833 3840 0 100 0 0 0 73 0 17308 882240 141108 127074560 0 0 0 4 32820 3594 0 100 0 0 0 72 0 17308 892256 141108 127074560 0 0 0 0 32368 3516 0 100 0 0 0 ####### iostat consistently shows %util under 1.00 which also suggests that disk IO is not the bottleneck: ######### iostat -dx /dev/sdb 5 Linux 2.6.32-358.6.2.el6.x86_64 (cuda-db7) 09/17/2013 _x86_64_ (32 CPU) Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdb 0.02 0.21 6.91 31.33 651.60 1121.85 46.38 0.09 2.25 0.08 0.31 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 0.00 0.00 8.00 0.00 93.00 11.62 0.00 0.28 0.20 0.16 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 0.00 0.00 11.00 0.00 125.40 11.40 0.00 0.16 0.16 0.18 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 0.00 0.00 105.00 0.00 3380.40 32.19 0.29 2.76 0.03 0.34 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 0.00 0.00 14.80 0.00 2430.60 164.23 0.00 0.12 0.09 0.14 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 1.20 0.00 41.60 0.00 1819.40 43.74 0.02 0.45 0.05 0.20 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 0.00 0.00 2.80 0.00 32.00 11.43 0.00 0.00 0.00 0.00 ######### mpstat also shows a virtually 0 iowait, with a ton of sys (CPU) time: ######### mpstat 2 10 Linux 2.6.32-358.6.2.el6.x86_64 (cuda-db7) 09/17/2013 _x86_64_ (32 CPU) 12:53:19 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 12:53:21 PM all 7.36 0.00 92.58 0.00 0.00 0.03 0.00 0.00 0.03 12:53:23 PM all 6.35 0.00 90.43 0.00 0.00 0.03 0.00 0.00 3.19 12:53:25 PM all 3.13 0.00 68.20 0.00 0.00 0.02 0.00 0.00 28.66 12:53:27 PM all 6.07 0.00 68.46 0.00 0.00 0.03 0.00 0.00 25.44 12:53:29 PM all 5.83 0.00 94.14 0.00 0.00 0.03 0.00 0.00 0.00 12:53:31 PM all 5.75 0.00 94.14 0.00 0.00 0.11 0.00 0.00 0.00 12:53:33 PM all 7.65 0.00 40.32 0.00 0.00 0.03 0.00 0.00 52.00 12:53:35 PM all 3.78 0.00 0.94 0.00 0.00 0.03 0.00 0.00 95.25 12:53:37 PM all 16.09 0.00 14.33 0.02 0.00 0.05 0.00 0.00 69.51 12:53:39 PM all 5.60 0.00 10.38 0.00 0.00 0.05 0.00 0.00 83.97 Average: all 6.76 0.00 57.41 0.00 0.00 0.04 0.00 0.00 35.79 ######### So unless I've misunderstood the output above, all of this suggests that the CPU is the bottleneck. If you'd like some other data let me know and I'll be happy to capture it. > Grab one of those slower queries and do an "explain (analyze, buffers)" of > it. Preferably with track_io_timing on. Just to be clear, just about every query is slow, not just certain queries. The entire server is slow, and the only thing that has changed is the version of postgres (granted, not a trivial change). Anyway, I picked a random query, connected with psql and ran the following: ######## nightly=# SET track_io_timing=on ; SET nightly=# EXPLAIN (analyze,buffers) SELECT count(*) FROM pg_catalog.pg_class AS c WHERE c.oid = pgpool_regclass('dblcufft_testlist') AND c.relpersistence = 'u'; QUERY PLAN -------------------------------------------------------------------------------------------------------- ------- Aggregate (cost=2779.71..2779.72 rows=1 width=0) (actual time=83.136..83.136 rows=1 loops=1) Buffers: shared hit=1633 -> Seq Scan on pg_class c (cost=0.00..2779.70 rows=1 width=0) (actual time=83.131..83.131 rows=0 loops=1) Filter: ((relpersistence = 'u'::"char") AND (oid = pgpool_regclass('dblcufft_testlist'::cstring))) Rows Removed by Filter: 70393 Buffers: shared hit=1633 Total runtime: 83.181 ms (7 rows) ######## I'm not quite sure that I understand how to interpret this output. I didn't have track_io_timings enabled by default. Does it require a restart to enable? The official dox seem to suggest that there should be a number of additional fields, which I'm not seeing. > > Do you still have test/dev/QA/etc copy running on 9.2 for comparison? Not live. I'd have to restore from a backup, and my test environment is not the same class of hardware as production. I could get that process underway if you wanted. What did you have in mind?
Hi, On 2013-09-17 09:19:29 -0700, Lonni J Friedman wrote: > I'm running a PostgreSQL 9.3.0 cluster (1 master with two streaming > replication hot standby slaves) on RHEL6-x86_64. Yesterday I upgraded > from 9.2.4 to 9.3.0, and since the upgrade I'm seeing a significant > performance degradation. PostgreSQL simply feels slower. Nothing > other than the version of PostgreSQL changed yesterday. I used > pg_upgrade to perform the upgrade, and ran the generated > analyze_new_cluster.sh immediately afterwards, which completed > successfully. Where did you get 9.3.0 from? Compiled it yourself? Any chance you compile with --enable-cassert or somesuch? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: upgrade from 9.2.x to 9.3 causes significant performance degradation
From
Lonni J Friedman
Date:
On Tue, Sep 17, 2013 at 3:47 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Hi, > > On 2013-09-17 09:19:29 -0700, Lonni J Friedman wrote: >> I'm running a PostgreSQL 9.3.0 cluster (1 master with two streaming >> replication hot standby slaves) on RHEL6-x86_64. Yesterday I upgraded >> from 9.2.4 to 9.3.0, and since the upgrade I'm seeing a significant >> performance degradation. PostgreSQL simply feels slower. Nothing >> other than the version of PostgreSQL changed yesterday. I used >> pg_upgrade to perform the upgrade, and ran the generated >> analyze_new_cluster.sh immediately afterwards, which completed >> successfully. > > Where did you get 9.3.0 from? Compiled it yourself? Any chance you > compile with --enable-cassert or somesuch? Directly from http://yum.postgresql.org. So unless the RPMs on there are built weird/wrong, I don't think that's the problem.
Lonni J Friedman <netllama@gmail.com> wrote: > top shows over 90% of the load is in sys space. vmstat output > seems to suggest that its CPU bound (or bouncing back & forth): Can you run `perf top` during an episode and see what kernel functions are using all that CPU? This looks similar to cases I've seen of THP defrag going wild. Did the OS version or configuration change? Did the PostgreSQL memory settings (like shared_buffers) change? -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: upgrade from 9.2.x to 9.3 causes significant performance degradation
From
Lonni J Friedman
Date:
On Wed, Sep 18, 2013 at 2:02 AM, Kevin Grittner <kgrittn@ymail.com> wrote: > Lonni J Friedman <netllama@gmail.com> wrote: > >> top shows over 90% of the load is in sys space. vmstat output >> seems to suggest that its CPU bound (or bouncing back & forth): > > Can you run `perf top` during an episode and see what kernel > functions are using all that CPU? Oddly, the problem went away on its own yesterday just after 4PM, and performance has remained 'normal' since that time. I changed absolutely nothing. If/when it returns, I'll certainly capture that output. > > This looks similar to cases I've seen of THP defrag going wild. > Did the OS version or configuration change? Did the PostgreSQL > memory settings (like shared_buffers) change? Nothing changed other than the version of postgres. I re-used the same postgresql.conf that was in place when running 9.2.x. Anyway, here are the current THP related settings on the server: [root@cuda-db7 ~]# grep AnonHugePages /proc/meminfo AnonHugePages: 548864 kB [root@cuda-db7 ~]# egrep 'trans|thp' /proc/vmstat nr_anon_transparent_hugepages 272 thp_fault_alloc 129173889 thp_fault_fallback 17462551 thp_collapse_alloc 148437 thp_collapse_alloc_failed 15143 thp_split 242
Re: upgrade from 9.2.x to 9.3 causes significant performance degradation
From
Lonni J Friedman
Date:
On Wed, Sep 18, 2013 at 2:02 AM, Kevin Grittner <kgrittn@ymail.com> wrote: > Lonni J Friedman <netllama@gmail.com> wrote: > >> top shows over 90% of the load is in sys space. vmstat output >> seems to suggest that its CPU bound (or bouncing back & forth): > > Can you run `perf top` during an episode and see what kernel > functions are using all that CPU? I take back what I said earlier. While the master is currently back to normal performance, the two hot standby slaves are still churning something awful. If I run 'perf top' on either slave, after a few seconds, these are consistently the top three in the list: 84.57% [kernel] [k] _spin_lock_irqsave 6.21% [unknown] [.] 0x0000000000659f60 4.69% [kernel] [k] compaction_alloc > > This looks similar to cases I've seen of THP defrag going wild. > Did the OS version or configuration change? Did the PostgreSQL > memory settings (like shared_buffers) change? I think you're onto something here with respect to THP defrag going wild. I set /sys/kernel/mm/transparent_hugepage/defrag to 'never' and immediately the load dropped on both slaves from over 5.00 to under 1.00. So this raises the question, is this a kernel bug, or is there some other solution to the problem? Also, seems weird that the problem didn't happen until I switched from 9.2 to 9.3. Is it possible this is somehow related to the change from using SysV shared memory to using Posix shared memory and mmap for memory management?
On Wed, Sep 18, 2013 at 10:30 AM, Lonni J Friedman <netllama@gmail.com> wrote:
>I think you're onto something here with respect to THP defrag going
> This looks similar to cases I've seen of THP defrag going wild.
> Did the OS version or configuration change? Did the PostgreSQL
> memory settings (like shared_buffers) change?
wild. I set /sys/kernel/mm/transparent_hugepage/defrag to 'never' and
immediately the load dropped on both slaves from over 5.00 to under
1.00.
So this raises the question, is this a kernel bug, or is there some
other solution to the problem?
Also, seems weird that the problem didn't happen until I switched from
9.2 to 9.3. Is it possible this is somehow related to the change from
using SysV shared memory to using Posix shared memory and mmap for
memory management?
I would guess that it is probably a kernel bug which gets efficiently exercised by the SysV to Posix change.
Cheers,
Jeff