Thread: [HACKERS] Performance issue after upgrading from 9.4 to 9.6
Hi,
We are having some performance issues after we upgraded to newest
version of PostgreSQL, before it everything was fast and smooth.
Upgrade was done by pg_upgrade from 9.4 directly do 9.6.1. Now we
upgraded to 9.6.2 with no improvement.
Some information about our setup: Freebsd, Solaris (SmartOS), simple
master-slave using streaming replication.
Problem:
Very high system CPU when master is streaming replication data, CPU
goes up to 77%. Only one process is generating this load, it's a
postgresql startup process. When I attached a truss to this process I
saw a lot o read calls with almost the same number of errors (EAGAIN).
root(at)d8:~ # truss -c -p 38091
^Csyscall seconds calls errors
semop 0.001611782 198 0
write 0.000074404 2 0
read 2.281535100 17266 12375
openat 0.000683532 48 0
lseek 0.177612479 20443 0
close 0.000395549 48 0
------------- ------- -------
2.461912846 38005 12375
read(6,0x7fffffffa0c7,1) ERR#35 'Resource temporarily unavailable'
lseek(444,0x0,SEEK_END) = 32571392 (0x1f10000)
read(6,0x7fffffffa0c7,1) ERR#35 'Resource temporarily unavailable'
lseek(444,0x0,SEEK_END) = 32571392 (0x1f10000)
read(6,0x7fffffffa0c7,1) ERR#35 'Resource temporarily unavailable'
lseek(7,0x0,SEEK_END) = 164487168 (0x9cde000)
lseek(778,0x0,SEEK_END) = 57344 (0xe000)
read(6,0x7fffffffa0c7,1) ERR#35 'Resource temporarily unavailable'
lseek(894,0x0,SEEK_END) = 139296768 (0x84d8000)
lseek(894,0x4b7e000,SEEK_SET) = 79159296 (0x4b7e000)
read(894," ~\0\08\a\M--m\0\0\^A\0\M^T\0000"...,8192) = 8192 (0x2000)
lseek(3,0xfa6000,SEEK_SET) = 16408576 (0xfa6000)
read(3,"\M^S\M-P\^E\0\^A\0\0\0\0`\M-z"...,8192) = 8192 (0x2000)
read(6,0x7fffffffa0c7,1) ERR#35 'Resource temporarily unavailable'
lseek(894,0x0,SEEK_END) = 139296768 (0x84d8000)
lseek(894,0x0,SEEK_END) = 139296768 (0x84d8000)
lseek(894,0x449c000,SEEK_SET) = 71942144 (0x449c000)
read(894,"\^_~\0\0\M-H\M-H\M-B\M-b\0\0\^E"...,8192) = 8192 (0x2000)
lseek(818,0x0,SEEK_END) = 57344 (0xe000)
read(6,0x7fffffffa0c7,1) ERR#35 'Resource temporarily unavailable'
lseek(442,0x0,SEEK_END) = 10174464 (0x9b4000)
lseek(442,0x4c4000,SEEK_SET) = 4997120 (0x4c4000)
read(442,"\^_~\0\0\M-P\M-+\M-1\M-b\0\0\0\0"...,8192) = 8192 (0x2000)
read(6,0x7fffffffa0c7,1) ERR#35 'Resource temporarily unavailable'
Descriptor 6 is a pipe
Read call try to read one byte over and over, I looked up to source
code and I think this file is responsible for this behavior
src/backend/storage/ipc/latch.c. There was no such file in 9.4.
On Thu, Mar 9, 2017 at 7:47 AM, Naytro Naytro <naytro@googlemail.com> wrote: > We are having some performance issues after we upgraded to newest > version of PostgreSQL, before it everything was fast and smooth. > > Upgrade was done by pg_upgrade from 9.4 directly do 9.6.1. Now we > upgraded to 9.6.2 with no improvement. > > Some information about our setup: Freebsd, Solaris (SmartOS), simple > master-slave using streaming replication. > > Problem: > Very high system CPU when master is streaming replication data, CPU > goes up to 77%. Only one process is generating this load, it's a > postgresql startup process. When I attached a truss to this process I > saw a lot o read calls with almost the same number of errors (EAGAIN). > read(6,0x7fffffffa0c7,1) ERR#35 'Resource temporarily unavailable' > > Descriptor 6 is a pipe > > Read call try to read one byte over and over, I looked up to source > code and I think this file is responsible for this behavior > src/backend/storage/ipc/latch.c. There was no such file in 9.4. Our latch implementation did get overhauled pretty thoroughly in 9.6; see primarily commit 98a64d0bd713cb89e61bef6432befc4b7b5da59e. But I can't guess what is going wrong here based on this information. It might help if you can pull some stack backtraces from the startup process. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2017-03-09 13:47:35 +0100, Naytro Naytro wrote: > We are having some performance issues after we upgraded to newest > version of PostgreSQL, before it everything was fast and smooth. > > Upgrade was done by pg_upgrade from 9.4 directly do 9.6.1. Now we > upgraded to 9.6.2 with no improvement. > > Some information about our setup: Freebsd, Solaris (SmartOS), simple > master-slave using streaming replication. Which node is on which of those, and where is the high load? > Problem: > Very high system CPU when master is streaming replication data, CPU > goes up to 77%. Only one process is generating this load, it's a > postgresql startup process. When I attached a truss to this process I > saw a lot o read calls with almost the same number of errors (EAGAIN). Hm. Just to clarify: The load is on the *receiving* side, in the startup process? Because the load doesn't quite look that way... > read(6,0x7fffffffa0c7,1) ERR#35 'Resource temporarily unavailable' > > Descriptor 6 is a pipe That's presumably a latches internal pipe. Could you redo that truss/strace with timestamps attached? Does truss show signals received? The above profile would e.g. make a lot more sense if not. Is the wal receiver sending signals? > Read call try to read one byte over and over, I looked up to source > code and I think this file is responsible for this behavior > src/backend/storage/ipc/latch.c. There was no such file in 9.4. It was "just" moved (and expanded), used to be at src/backend/port/unix_latch.c. There normally shouldn't be that much "latch traffic" in the startup process, we'd expect to block from within WaitForWALToBecomeAvailable(). Hm. Any chance you've configured a recovery_min_apply_delay? Although I'd expect more timestamp calls in that case. Greetings, Andres Freund
2017-03-09 20:19 GMT+01:00 Andres Freund <andres@anarazel.de>:
Hi,
On 2017-03-09 13:47:35 +0100, Naytro Naytro wrote:
> We are having some performance issues after we upgraded to newest
> version of PostgreSQL, before it everything was fast and smooth.
>
> Upgrade was done by pg_upgrade from 9.4 directly do 9.6.1. Now we
> upgraded to 9.6.2 with no improvement.
>
> Some information about our setup: Freebsd, Solaris (SmartOS), simple
> master-slave using streaming replication.
Which node is on which of those, and where is the high load?
High load in only on slaves, FreeBSD (master+slave) and Solaris (only slaves)
> Problem:
> Very high system CPU when master is streaming replication data, CPU
> goes up to 77%. Only one process is generating this load, it's a
> postgresql startup process. When I attached a truss to this process I
> saw a lot o read calls with almost the same number of errors (EAGAIN).
Hm. Just to clarify: The load is on the *receiving* side, in the startup
process? Because the load doesn't quite look that way...
Yes
> read(6,0x7fffffffa0c7,1) ERR#35 'Resource temporarily unavailable'
>
> Descriptor 6 is a pipe
That's presumably a latches internal pipe. Could you redo that
truss/strace with timestamps attached? Does truss show signals
received? The above profile would e.g. make a lot more sense if not. Is
the wal receiver sending signals?
Truss from Solaris: http://pastebin.com/ WajedZ8Y and FreeBSD: http://pastebin.com/ DB5iT8na
FreeBSD truss should show signals by default
Dtrace from solaris: http://pastebin.com/ u03uVKbr
> Read call try to read one byte over and over, I looked up to source
> code and I think this file is responsible for this behavior
> src/backend/storage/ipc/latch.c. There was no such file in 9.4.
It was "just" moved (and expanded), used to be at
src/backend/port/unix_latch.c.
There normally shouldn't be that much "latch traffic" in the startup
process, we'd expect to block from within WaitForWALToBecomeAvailable().
Hm. Any chance you've configured a recovery_min_apply_delay? Although
I'd expect more timestamp calls in that case.
No, we don't have this option configured
Greetings,
Andres Freund
2017-03-09 18:28 GMT+01:00 Robert Haas <robertmhaas@gmail.com>:
On Thu, Mar 9, 2017 at 7:47 AM, Naytro Naytro <naytro@googlemail.com> wrote:
> We are having some performance issues after we upgraded to newest
> version of PostgreSQL, before it everything was fast and smooth.
>
> Upgrade was done by pg_upgrade from 9.4 directly do 9.6.1. Now we
> upgraded to 9.6.2 with no improvement.
>
> Some information about our setup: Freebsd, Solaris (SmartOS), simple
> master-slave using streaming replication.
>
> Problem:
> Very high system CPU when master is streaming replication data, CPU
> goes up to 77%. Only one process is generating this load, it's a
> postgresql startup process. When I attached a truss to this process I
> saw a lot o read calls with almost the same number of errors (EAGAIN).
> read(6,0x7fffffffa0c7,1) ERR#35 'Resource temporarily unavailable'
>
> Descriptor 6 is a pipe
>
> Read call try to read one byte over and over, I looked up to source
> code and I think this file is responsible for this behavior
> src/backend/storage/ipc/latch.c. There was no such file in 9.4.
Our latch implementation did get overhauled pretty thoroughly in 9.6;
see primarily commit 98a64d0bd713cb89e61bef6432befc4b7b5da59e. But I
can't guess what is going wrong here based on this information. It
might help if you can pull some stack backtraces from the startup
process.