Home > mailing lists

Thread: [HACKERS] Performance issue after upgrading from 9.4 to 9.6

[HACKERS] Performance issue after upgrading from 9.4 to 9.6

From

Naytro Naytro

Date:

09 March 2017, 18:47:35

Hi,

We are having some performance issues after we upgraded to newest

version of PostgreSQL, before it everything was fast and smooth.

Upgrade was done by pg_upgrade from 9.4 directly do 9.6.1. Now we

upgraded to 9.6.2 with no improvement.

Some information about our setup: Freebsd, Solaris (SmartOS), simple

master-slave using streaming replication.

Problem:

Very high system CPU when master is streaming replication data, CPU

goes up to 77%. Only one process is generating this load, it's a

postgresql startup process. When I attached a truss to this process I

saw a lot o read calls with almost the same number of errors (EAGAIN).

root(at)d8:~ # truss -c -p 38091

^Csyscall seconds calls errors

semop 0.001611782 198 0

write 0.000074404 2 0

read 2.281535100 17266 12375

openat 0.000683532 48 0

lseek 0.177612479 20443 0

close 0.000395549 48 0

------------- ------- -------

2.461912846 38005 12375

read(6,0x7fffffffa0c7,1) ERR#35 'Resource temporarily unavailable'

lseek(444,0x0,SEEK_END) = 32571392 (0x1f10000)

read(6,0x7fffffffa0c7,1) ERR#35 'Resource temporarily unavailable'

lseek(444,0x0,SEEK_END) = 32571392 (0x1f10000)

read(6,0x7fffffffa0c7,1) ERR#35 'Resource temporarily unavailable'

lseek(7,0x0,SEEK_END) = 164487168 (0x9cde000)

lseek(778,0x0,SEEK_END) = 57344 (0xe000)

read(6,0x7fffffffa0c7,1) ERR#35 'Resource temporarily unavailable'

lseek(894,0x0,SEEK_END) = 139296768 (0x84d8000)

lseek(894,0x4b7e000,SEEK_SET) = 79159296 (0x4b7e000)

read(894," ~\0\08\a\M--m\0\0\^A\0\M^T\0000"...,8192) = 8192 (0x2000)

lseek(3,0xfa6000,SEEK_SET) = 16408576 (0xfa6000)

read(3,"\M^S\M-P\^E\0\^A\0\0\0\0`\M-z"...,8192) = 8192 (0x2000)

read(6,0x7fffffffa0c7,1) ERR#35 'Resource temporarily unavailable'

lseek(894,0x0,SEEK_END) = 139296768 (0x84d8000)

lseek(894,0x0,SEEK_END) = 139296768 (0x84d8000)

lseek(894,0x449c000,SEEK_SET) = 71942144 (0x449c000)

read(894,"\^_~\0\0\M-H\M-H\M-B\M-b\0\0\^E"...,8192) = 8192 (0x2000)

lseek(818,0x0,SEEK_END) = 57344 (0xe000)

read(6,0x7fffffffa0c7,1) ERR#35 'Resource temporarily unavailable'

lseek(442,0x0,SEEK_END) = 10174464 (0x9b4000)

lseek(442,0x4c4000,SEEK_SET) = 4997120 (0x4c4000)

read(442,"\^_~\0\0\M-P\M-+\M-1\M-b\0\0\0\0"...,8192) = 8192 (0x2000)

read(6,0x7fffffffa0c7,1) ERR#35 'Resource temporarily unavailable'

Descriptor 6 is a pipe

Read call try to read one byte over and over, I looked up to source

code and I think this file is responsible for this behavior

src/backend/storage/ipc/latch.c. There was no such file in 9.4.

Re: [HACKERS] Performance issue after upgrading from 9.4 to 9.6

From

Robert Haas

Date:

09 March 2017, 23:28:19

On Thu, Mar 9, 2017 at 7:47 AM, Naytro Naytro <naytro@googlemail.com> wrote:
> We are having some performance issues after we upgraded to newest
> version of PostgreSQL, before it everything was fast and smooth.
>
> Upgrade was done by pg_upgrade from 9.4 directly do 9.6.1. Now we
> upgraded to 9.6.2 with no improvement.
>
> Some information about our setup: Freebsd, Solaris (SmartOS), simple
> master-slave using streaming replication.
>
> Problem:
> Very high system CPU when master is streaming replication data, CPU
> goes up to 77%. Only one process is generating this load, it's a
> postgresql startup process. When I attached a truss to this process I
> saw a lot o read calls with almost the same number of errors (EAGAIN).
> read(6,0x7fffffffa0c7,1) ERR#35 'Resource temporarily unavailable'
>
> Descriptor 6 is a pipe
>
> Read call try to read one byte over and over, I looked up to source
> code and I think this file is responsible for this behavior
> src/backend/storage/ipc/latch.c. There was no such file in 9.4.

Our latch implementation did get overhauled pretty thoroughly in 9.6;
see primarily commit 98a64d0bd713cb89e61bef6432befc4b7b5da59e.  But I
can't guess what is going wrong here based on this information.  It
might help if you can pull some stack backtraces from the startup
process.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Performance issue after upgrading from 9.4 to 9.6

From

Andres Freund

Date:

10 March 2017, 01:19:49

Hi,

On 2017-03-09 13:47:35 +0100, Naytro Naytro wrote:
> We are having some performance issues after we upgraded to newest
> version of PostgreSQL, before it everything was fast and smooth.
> 
> Upgrade was done by pg_upgrade from 9.4 directly do 9.6.1. Now we
> upgraded to 9.6.2 with no improvement.
> 
> Some information about our setup: Freebsd, Solaris (SmartOS), simple
> master-slave using streaming replication.

Which node is on which of those, and where is the high load?

> Problem:
> Very high system CPU when master is streaming replication data, CPU
> goes up to 77%. Only one process is generating this load, it's a
> postgresql startup process. When I attached a truss to this process I
> saw a lot o read calls with almost the same number of errors (EAGAIN).

Hm. Just to clarify: The load is on the *receiving* side, in the startup
process?  Because the load doesn't quite look that way...

> read(6,0x7fffffffa0c7,1) ERR#35 'Resource temporarily unavailable'
> 
> Descriptor 6 is a pipe

That's presumably a latches internal pipe.  Could you redo that
truss/strace with timestamps attached?  Does truss show signals
received? The above profile would e.g. make a lot more sense if not.  Is
the wal receiver sending signals?

> Read call try to read one byte over and over, I looked up to source
> code and I think this file is responsible for this behavior
> src/backend/storage/ipc/latch.c. There was no such file in 9.4.

It was "just" moved (and expanded), used to be at
src/backend/port/unix_latch.c.

There normally shouldn't be that much "latch traffic" in the startup
process, we'd expect to block from within WaitForWALToBecomeAvailable().

Hm.  Any chance you've configured a recovery_min_apply_delay?  Although
I'd expect more timestamp calls in that case.

Greetings,

Andres Freund

Re: [HACKERS] Performance issue after upgrading from 9.4 to 9.6

From

Naytro Naytro

Date:

10 March 2017, 04:38:46

2017-03-09 20:19 GMT+01:00 Andres Freund <andres@anarazel.de>:

Hi,

On 2017-03-09 13:47:35 +0100, Naytro Naytro wrote:
> We are having some performance issues after we upgraded to newest
> version of PostgreSQL, before it everything was fast and smooth.
>
> Upgrade was done by pg_upgrade from 9.4 directly do 9.6.1. Now we
> upgraded to 9.6.2 with no improvement.
>
> Some information about our setup: Freebsd, Solaris (SmartOS), simple
> master-slave using streaming replication.

Which node is on which of those, and where is the high load?

High load in only on slaves, FreeBSD (master+slave) and Solaris (only slaves)

> Problem:
> Very high system CPU when master is streaming replication data, CPU
> goes up to 77%. Only one process is generating this load, it's a
> postgresql startup process. When I attached a truss to this process I
> saw a lot o read calls with almost the same number of errors (EAGAIN).

Hm. Just to clarify: The load is on the *receiving* side, in the startup
process? Because the load doesn't quite look that way...

Yes

> read(6,0x7fffffffa0c7,1) ERR#35 'Resource temporarily unavailable'
>
> Descriptor 6 is a pipe

That's presumably a latches internal pipe. Could you redo that
truss/strace with timestamps attached? Does truss show signals
received? The above profile would e.g. make a lot more sense if not. Is
the wal receiver sending signals?

Truss from Solaris: http://pastebin.com/WajedZ8Y and FreeBSD: http://pastebin.com/DB5iT8na

FreeBSD truss should show signals by default

Dtrace from solaris: http://pastebin.com/u03uVKbr

> Read call try to read one byte over and over, I looked up to source
> code and I think this file is responsible for this behavior
> src/backend/storage/ipc/latch.c. There was no such file in 9.4.

It was "just" moved (and expanded), used to be at
src/backend/port/unix_latch.c.

There normally shouldn't be that much "latch traffic" in the startup
process, we'd expect to block from within WaitForWALToBecomeAvailable().

Hm. Any chance you've configured a recovery_min_apply_delay? Although
I'd expect more timestamp calls in that case.

No, we don't have this option configured

Greetings,

Andres Freund

Re: [HACKERS] Performance issue after upgrading from 9.4 to 9.6

From

Naytro Naytro

Date:

10 March 2017, 04:40:09

2017-03-09 18:28 GMT+01:00 Robert Haas <robertmhaas@gmail.com>:

On Thu, Mar 9, 2017 at 7:47 AM, Naytro Naytro <naytro@googlemail.com> wrote:
> We are having some performance issues after we upgraded to newest
> version of PostgreSQL, before it everything was fast and smooth.
>
> Upgrade was done by pg_upgrade from 9.4 directly do 9.6.1. Now we
> upgraded to 9.6.2 with no improvement.
>
> Some information about our setup: Freebsd, Solaris (SmartOS), simple
> master-slave using streaming replication.
>
> Problem:
> Very high system CPU when master is streaming replication data, CPU
> goes up to 77%. Only one process is generating this load, it's a
> postgresql startup process. When I attached a truss to this process I
> saw a lot o read calls with almost the same number of errors (EAGAIN).
> read(6,0x7fffffffa0c7,1) ERR#35 'Resource temporarily unavailable'
>
> Descriptor 6 is a pipe
>
> Read call try to read one byte over and over, I looked up to source
> code and I think this file is responsible for this behavior
> src/backend/storage/ipc/latch.c. There was no such file in 9.4.

Our latch implementation did get overhauled pretty thoroughly in 9.6;
see primarily commit 98a64d0bd713cb89e61bef6432befc4b7b5da59e. But I
can't guess what is going wrong here based on this information. It
might help if you can pull some stack backtraces from the startup
process.

Dtrace from solaris: http://pastebin.com/u03uVKbr