Thread: [GENERAL] Causeless CPU load waves in backend, on windows, 9.5.5 (EDB binary).

[GENERAL] Causeless CPU load waves in backend, on windows, 9.5.5 (EDB binary).

From

Nikolai Zhubr

Date:

30 January 2017, 12:16:25

Hello all,

(Hopefully this is right place to post on the subject, otherwise please
let me know.)

I'm observing some strange inexplicable effect in 9.5.5 server running
on x86 windows (32-bit windows xp sp3). That is, CPU usage in backend
process for the session in question starts to grow, going from approx
0-1 to 8-15 percent and more, stays that high for several seconds, then
goes back to 0-1. All the effect takes about 15-30 seconds, and repeats
stably every 10-20 minutes (as long as the respective client continues
to run the same queries). Apparently it is essential to consider the
pattern of requests going to the server: in this case there is a
continuous stream of very small cheap queries, but quite a lot of them
per second. Trying to understand the reason, I've managed to craft a
pure artifical test triggering very similar CPU load waves without the
need for any specific database at all, easy to repeat:

1. "select localtimestamp" 40 times (As separate requests, one by one,
but no delay inserted in between)
2. wait 1/2 second.
3. goto 1

That's it. Just let it run for > 20 minutes in one session. (These
queries are so cheap that normally they consume approx zero resources)

Screenshots: https://yadi.sk/i/J_yj_0t43BgdGw
(I can also send as file if this link does not work)

Other notes:
- the server instance in question is EDB 9.5.5-1 win32 binaries.
- the production machine is Core 2 duo 2600MHz, 2Gb ram, typical CPU
load is rather low, like 0% to 3%, therefore the effect is easily noticable.
- no antivirus or other weird or abusive software involved.
- communication goes through libpq (tcp/ip only).
- turning SSL on/off does not matter.
- my test application issueing "select localtimestamp" in the endless
loop is written in pascal, although this probably does not matter.
- query execution time as seen by the client is not affected (stays low).
- pausing the test in the client causes backend to drop CPU usage
immediately to 0, resuming causes it to go back to where it was (unless
pausing for too long), so excessive CPU load is tied to some normal
activity of backend (i.e. no activity == no load).
- the effect is NOT observed (yet?) when running test on a server
machine directly (pointing it to 127.0.0.1)
- the effect looks more substantial in SMP case (2 Cores), compared to
UMP case (when testing in a VM, see below).
- nothing appears in the log.

To me it looks like some sort of wait/check/synchronize issue for a
socket/lock/signal or similar. Probably some rare corner case, probably
windows-specific. However, looking through backend/port/win32/socket.c
and backend/port/win32_latch.c I cannot immediately see anything wrong
yet (but WaitForMultipleObjects is a tricky thing IIRC).

Luckily I've managed to construct a configuration very similar to
production machine in a development VM (VirtualBox) and reproduced the
effect there, so now I'm able to safely and comfortably test whatever
ideas might appear (Although I'm somewhat reluctant to rebuild and
hand-debug the server here myself, because preparing proper build
environment on windows is quite a lot of work...)

Any help/thoughts/instructions greatly appreciated.

Should I also report this to bugtracker at postgresql.org?


Thank you,
Nikolai

Re: [GENERAL] Causeless CPU load waves in backend, on windows, 9.5.5(EDB binary).

From

Michael Paquier

Date:

31 January 2017, 06:43:07

On Mon, Jan 30, 2017 at 6:16 PM, Nikolai Zhubr <n-a-zhubr@yandex.ru> wrote:
> I'm observing some strange inexplicable effect in 9.5.5 server running on
> x86 windows (32-bit windows xp sp3).

Oh, well... You are aware that this is out of support by Microsoft, right?

> That is, CPU usage in backend process
> for the session in question starts to grow, going from approx 0-1 to 8-15
> percent and more, stays that high for several seconds, then goes back to
> 0-1. All the effect takes about 15-30 seconds, and repeats stably every
> 10-20 minutes (as long as the respective client continues to run the same
> queries). Apparently it is essential to consider the pattern of requests
> going to the server: in this case there is a continuous stream of very small
> cheap queries, but quite a lot of them per second. Trying to understand the
> reason, I've managed to craft a pure artifical test triggering very similar
> CPU load waves without the need for any specific database at all, easy to
> repeat:
>
> 1. "select localtimestamp" 40 times (As separate requests, one by one, but
> no delay inserted in between)
> 2. wait 1/2 second.
> 3. goto 1

Craig, could this be a side-effect of 519b0757? That's new in 9.5, and
that's directly related to the code paths discussed here.
--
Michael

Re: [GENERAL] Causeless CPU load waves in backend, on windows, 9.5.5(EDB binary).

From

Nikolai Zhubr

Date:

31 January 2017, 10:37:05

31.01.2017 6:43, Michael Paquier wrote:
> On Mon, Jan 30, 2017 at 6:16 PM, Nikolai Zhubr<n-a-zhubr@yandex.ru>  wrote:
>> I'm observing some strange inexplicable effect in 9.5.5 server running on
>> x86 windows (32-bit windows xp sp3).
>
> Oh, well... You are aware that this is out of support by Microsoft, right?

Sure. The box is quite old.
However, at the moment I'm not confident the issue is XP-specific. Not
even really sure it is windows-specific.
And, I can't see any indication that the behaviour of e.g.
WaitForMultipleObjects() changed much from NT 3.51 to Windows 10 anyway.

[...]
>> 1. "select localtimestamp" 40 times (As separate requests, one by one, but
>> no delay inserted in between)
>> 2. wait 1/2 second.
>> 3. goto 1
>
> Craig, could this be a side-effect of 519b0757? That's new in 9.5, and
> that's directly related to the code paths discussed here.

Meanwhile I'll redo my tests using 9.4 instead of 9.5 and report back.

Thank you.

Nikolai

Re: [GENERAL] Causeless CPU load waves in backend, on windows, 9.5.5(EDB binary).

From

Nikolai Zhubr

Date:

31 January 2017, 11:45:33

31.01.2017 10:37, I wrote:
> [...]
>>> 1. "select localtimestamp" 40 times (As separate requests, one by
>>> one, but
>>> no delay inserted in between)
>>> 2. wait 1/2 second.
>>> 3. goto 1
>>
>> Craig, could this be a side-effect of 519b0757? That's new in 9.5, and
>> that's directly related to the code paths discussed here.
>
> Meanwhile I'll redo my tests using 9.4 instead of 9.5 and report back.

Exactly same trouble with 9.4 (Specifically EDB 9.4.10-1 win32)

Would it make sense to check some even older ones?


Thank you.

Nikolai

>
> Thank you.
>
> Nikolai
>
>

Re: [GENERAL] Causeless CPU load waves in backend, on windows, 9.5.5(EDB binary).

From

Andres Freund

Date:

31 January 2017, 19:51:22

On 2017-01-31 11:45:33 +0300, Nikolai Zhubr wrote:
> 31.01.2017 10:37, I wrote:
> > [...]
> > > > 1. "select localtimestamp" 40 times (As separate requests, one by
> > > > one, but
> > > > no delay inserted in between)
> > > > 2. wait 1/2 second.
> > > > 3. goto 1
> > >
> > > Craig, could this be a side-effect of 519b0757? That's new in 9.5, and
> > > that's directly related to the code paths discussed here.
> >
> > Meanwhile I'll redo my tests using 9.4 instead of 9.5 and report back.
>
> Exactly same trouble with 9.4 (Specifically EDB 9.4.10-1 win32)

That's good to know, less because of 519b0757, more because of the latch
changes - but they also went in in 9.5...

> Would it make sense to check some even older ones?

Could you use process monitor or such to see what the process is doing
while using a lot of CPU?

Regards,

Andres

Re: [GENERAL] Causeless CPU load waves in backend, on windows, 9.5.5(EDB binary).

From

Nikolai Zhubr

Date:

01 February 2017, 01:02:11

31.01.2017 19:51, Andres Freund:
[...]
>> Exactly same trouble with 9.4 (Specifically EDB 9.4.10-1 win32)
>
> That's good to know, less because of 519b0757, more because of the latch
> changes - but they also went in in 9.5...
>
>> Would it make sense to check some even older ones?
>
> Could you use process monitor or such to see what the process is doing
> while using a lot of CPU?

I'm not sure how to do this, especially considering that the process in
question is running as a service?

Now, some more input:

* 9.5.2 server running on linux x86_64 - unaffected! (What a relief! We
are moving to Centos soon anyway!)

* 9.4.4 server running on win7 32-bit - affected, same thing as on XP.


Thank you.

Nikolai

>
> Regards,
>
> Andres
>

Re: [GENERAL] Causeless CPU load waves in backend, on windows, 9.5.5(EDB binary).

From

Nikolai Zhubr

Date:

02 February 2017, 02:14:47

01.02.2017 1:02, I wrote:
[...]
>> Could you use process monitor or such to see what the process is doing
>> while using a lot of CPU?
>
> I'm not sure how to do this, especially considering that the process in
> question is running as a service?
>
> Now, some more input:
>
> * 9.5.2 server running on linux x86_64 - unaffected! (What a relief! We
> are moving to Centos soon anyway!)
>
> * 9.4.4 server running on win7 32-bit - affected, same thing as on XP.

I've managed to create a "fix" (see diff below).
It looks like the wait logic is somehow broken on windows currently,
though I can not find the problem myself yet.
It would be great if someone more familiar with the (windows-specific)
code came up with ideas.
I have a build environment ready so I could do more tests then.

--- be-secure.c.orig    2017-02-01 22:37:37.228032608 +0300
+++ be-secure.c    2017-02-01 22:51:17.655751292 +0300
@@ -159,6 +159,7 @@
               * socket to become ready again.
               */
          }
+                Sleep(15); /* n.zhubr */
          goto retry;
      }

@@ -238,6 +239,7 @@
               * socket to become ready again.
               */
          }
+                Sleep(15); /* n.zhubr */
          goto retry;
      }


Thank you.

Nikolai

>
>
> Thank you.
>
> Nikolai
>
>>
>> Regards,
>>
>> Andres
>>
>
>
>

Re: [GENERAL] Causeless CPU load waves in backend, on windows, 9.5.5(EDB binary).

From

Nikolai Zhubr

Date:

03 February 2017, 13:52:26

02.02.2017 2:14, I wrote:
> 01.02.2017 1:02, I wrote:
> [...]
>>> Could you use process monitor or such to see what the process is doing
>>> while using a lot of CPU?
>>
>> I'm not sure how to do this, especially considering that the process in
>> question is running as a service?
>>
>> Now, some more input:
>>
>> * 9.5.2 server running on linux x86_64 - unaffected! (What a relief! We
>> are moving to Centos soon anyway!)
>>
>> * 9.4.4 server running on win7 32-bit - affected, same thing as on XP.
>
> I've managed to create a "fix" (see diff below).
> It looks like the wait logic is somehow broken on windows currently,
> though I can not find the problem myself yet.
> It would be great if someone more familiar with the (windows-specific)
> code came up with ideas.
> I have a build environment ready so I could do more tests then.

Some update.

Adding this "Sleep(15)" before "goto retry" into secure_read() has
apparently eliminated the effect at our production server too. That is,
my load-bug-detector has been quiet for > 24hr or more.

Now by adding more debigging stuff into secure_read() and secure_write()
I've found that:

* secure_write() is likely irrelevant, as "goto retry" there was never
actually hit yet;

* in secure_read(), during the intervals of excessive cpu load,
WaitLatchOrSocket() was never observed to indicate latch event, and was
never observed to (erroneously) indicate socket readiness more than once
(with socket read attempt in between), which I was suspecting happening,
so I can not blame secure_read() itself and this all makes me wonder
even more...

Note: I'm testing with SSL off now.

As always, and hints greatly appreciated!

Thank you.
Nikolai

>
> --- be-secure.c.orig 2017-02-01 22:37:37.228032608 +0300
> +++ be-secure.c 2017-02-01 22:51:17.655751292 +0300
> @@ -159,6 +159,7 @@
> * socket to become ready again.
> */
> }
> + Sleep(15); /* n.zhubr */
> goto retry;
> }
>
> @@ -238,6 +239,7 @@
> * socket to become ready again.
> */
> }
> + Sleep(15); /* n.zhubr */
> goto retry;
> }
>
>
> Thank you.
>
> Nikolai
>
>>
>>
>> Thank you.
>>
>> Nikolai
>>
>>>
>>> Regards,
>>>
>>> Andres
>>>
>>
>>
>>
>
>
>

Re: [GENERAL] Causeless CPU load waves in backend, on windows, 9.5.5(EDB binary).

From

Nikolai Zhubr

Date:

03 February 2017, 16:30:12

03.02.2017 13:52, I wrote:
[...]
> Adding this "Sleep(15)" before "goto retry" into secure_read() has
> apparently eliminated the effect at our production server too. That is,
> my load-bug-detector has been quiet for > 24hr or more.
>
> Now by adding more debigging stuff into secure_read() and secure_write()
> I've found that:
>
> * secure_write() is likely irrelevant, as "goto retry" there was never
> actually hit yet;
>
> * in secure_read(), during the intervals of excessive cpu load,
> WaitLatchOrSocket() was never observed to indicate latch event, and was
> never observed to (erroneously) indicate socket readiness more than once
> (with socket read attempt in between), which I was suspecting happening,
> so I can not blame secure_read() itself and this all makes me wonder
> even more...

Ok, secure_read() is likely irrelevant too.

I think what happened after I inserted "Sleep(15)" into secure_read() is
that this "Sleep(15)" was essentially added into the main "for(;;)" loop
of PostgresMain (through ReadCommand), introducing an artifical
additional CPU relaxation step along with every incoming query and
therefore just masking a real CPU eater.

So probably I'll have to somehow profile this "for(;;)" in PostgresMain.


Thank you.

Nikolai

>
> Note: I'm testing with SSL off now.
>
> As always, and hints greatly appreciated!
>
>
> Thank you.
> Nikolai
>
>>
>> --- be-secure.c.orig 2017-02-01 22:37:37.228032608 +0300
>> +++ be-secure.c 2017-02-01 22:51:17.655751292 +0300
>> @@ -159,6 +159,7 @@
>> * socket to become ready again.
>> */
>> }
>> + Sleep(15); /* n.zhubr */
>> goto retry;
>> }
>>
>> @@ -238,6 +239,7 @@
>> * socket to become ready again.
>> */
>> }
>> + Sleep(15); /* n.zhubr */
>> goto retry;
>> }
>>
>>
>> Thank you.
>>
>> Nikolai
>>
>>>
>>>
>>> Thank you.
>>>
>>> Nikolai
>>>
>>>>
>>>> Regards,
>>>>
>>>> Andres
>>>>
>>>
>>>
>>>
>>
>>
>>
>
>
>

Re: [GENERAL] Causeless CPU load waves in backend, on windows, 9.5.5(EDB binary).

From

Andres Freund

Date:

03 February 2017, 20:29:45

On 2017-02-01 01:02:11 +0300, Nikolai Zhubr wrote:
> 31.01.2017 19:51, Andres Freund:
> [...]
> > > Exactly same trouble with 9.4 (Specifically EDB 9.4.10-1 win32)
> >
> > That's good to know, less because of 519b0757, more because of the latch
> > changes - but they also went in in 9.5...
> >
> > > Would it make sense to check some even older ones?
> >
> > Could you use process monitor or such to see what the process is doing
> > while using a lot of CPU?
>
> I'm not sure how to do this, especially considering that the process in
> question is running as a service?

I don't think that stops you, if you have admin privileges.

Re: [GENERAL] Causeless CPU load waves in backend, on windows, 9.5.5(EDB binary).

From

Andres Freund

Date:

03 February 2017, 22:16:40

On 2017-02-03 22:17:55 +0300, Nikolai Zhubr wrote:
> 03.02.2017 20:29, Andres Freund:
> [...]
> > > > Could you use process monitor or such to see what the process is doing
> > > > while using a lot of CPU?
> > >
> > > I'm not sure how to do this, especially considering that the process in
> > > question is running as a service?
> >
> > I don't think that stops you, if you have admin privileges.
>
> Well, profiling postgres.exe is still beyond my capability at this time
> anyway. Instead, I'll try to prepare a most simple client application
> example for testing the behaviour in question so that anyone could run it
> easily. (And while working on such an example, maybe I'll get some more
> understanding of what is actually going on here.)

I'm missing something. All you need to do is to start
processexplorer.exe and filter out other applications?

Andres

Re: [GENERAL] Causeless CPU load waves in backend, on windows, 9.5.5(EDB binary).

From

Nikolai Zhubr

Date:

03 February 2017, 22:17:55

03.02.2017 20:29, Andres Freund:
[...]
>>> Could you use process monitor or such to see what the process is doing
>>> while using a lot of CPU?
>>
>> I'm not sure how to do this, especially considering that the process in
>> question is running as a service?
>
> I don't think that stops you, if you have admin privileges.

Well, profiling postgres.exe is still beyond my capability at this time
anyway. Instead, I'll try to prepare a most simple client application
example for testing the behaviour in question so that anyone could run
it easily. (And while working on such an example, maybe I'll get some
more understanding of what is actually going on here.)


Thank you.

Nikolai


>

Re: [GENERAL] Causeless CPU load waves in backend, on windows, 9.5.5(EDB binary).

From

Nikolai Zhubr

Date:

05 February 2017, 17:10:47

03.02.2017 22:16, Andres Freund:
[...]
>> Well, profiling postgres.exe is still beyond my capability at this time
>> anyway. Instead, I'll try to prepare a most simple client application
>> example for testing the behaviour in question so that anyone could run it
>> easily. (And while working on such an example, maybe I'll get some more
>> understanding of what is actually going on here.)
>
> I'm missing something. All you need to do is to start
> processexplorer.exe and filter out other applications?

Ah, ok. I first thought about some kind of code-level debugging.

And yes, running Process Explorer gave some new and unexpected input.
During the period of this strange high load it claims 40% CPU is used by
interrupts (normally 0.01%) and 3% used by backend postgres.exe
(normally approx 0%). I'd guess this means some problem happening in the
OS (which hosts this postgres.exe), probably related to network
communication? (Because nothing else seems likely related to interrupts
in such scenario?)

The screenshot: https://yadi.sk/i/hC8FMZCE3CyzPs


Thank you.

Nikolai


>
> Andres
>

Re: [GENERAL] Causeless CPU load waves in backend, on windows, 9.5.5(EDB binary).

From

Nikolai Zhubr

Date:

05 February 2017, 22:05:27

05.02.2017 17:10, I wrote:
[...]
> And yes, running Process Explorer gave some new and unexpected input.
> During the period of this strange high load it claims 40% CPU is used by
> interrupts (normally 0.01%) and 3% used by backend postgres.exe
> (normally approx 0%). I'd guess this means some problem happening in the
> OS (which hosts this postgres.exe), probably related to network
> communication? (Because nothing else seems likely related to interrupts
> in such scenario?)

Additionally, I've now got a kernrate viewer reports, one for high load,
and one for normal load period, attached below. Here, tcpip shows some
more kernel activity during the problematic period, but the difference
is not so huge.

Time   36348 hits, 25000 events per hit --------
  Module                            Hits   msec  %Total  Events/Sec
intelppm                          35048      45640    96 %    19198071
hal                                1030      45640     2 %      564198
ntkrnlpa                            204      45640     0 %      111744
tcpip                                20      45640     0 %       10955
win32k                               18      45640     0 %        9859
afd                                   6      45640     0 %        3286
ipnat                                 6      45640     0 %        3286
NDIS                                  4      45640     0 %        2191
......

Time   37227 hits, 25000 events per hit --------
  Module                            Hits   msec  %Total  Events/Sec
intelppm                          35856      46828    96 %    19142393
hal                                1089      46828     2 %      581382
ntkrnlpa                            229      46828     0 %      122255
win32k                               29      46828     0 %       15482
tcpip                                 9      46828     0 %        4804
NDIS                                  4      46828     0 %        2135
afd                                   3      46828     0 %        1601
psched                                3      46828     0 %        1601
ipnat                                 2      46828     0 %        1067


Thank you.

Nikolai

>
> The screenshot: https://yadi.sk/i/hC8FMZCE3CyzPs
>
>
> Thank you.
>
> Nikolai
>
>
>>
>> Andres
>>
>

Re: [GENERAL] Causeless CPU load waves in backend, on windows, 9.5.5(EDB binary).

From

Nikolai Zhubr

Date:

08 February 2017, 15:50:48

05.02.2017 22:05, I wrote:
[...]
>> And yes, running Process Explorer gave some new and unexpected input.
>> During the period of this strange high load it claims 40% CPU is used by
>> interrupts (normally 0.01%) and 3% used by backend postgres.exe
>> (normally approx 0%). I'd guess this means some problem happening in the
>> OS (which hosts this postgres.exe), probably related to network
>> communication? (Because nothing else seems likely related to interrupts
>> in such scenario?)

Ok, I've got a working example as simple as the following:

do {
     for (i=0; i<40; i++) {
    Sleep(15);
    res = PQexec(conn, "SELECT localtimestamp ");
    if (PQresultStatus(res) != PGRES_TUPLES_OK) return 1;
    PQclear(res);
     }
     fprintf(stdout, "*");
     Sleep(350);
} while(1);

Here, Sleep(15) and Sleep(350) are very important, removing or changing
them can cause the effect to disappear.
Looks like some timing-sensitive issue in windows TCP/IP implementation?
But then, how can it be that no-one have noticed it yet? Puzzling.


Thank you.

Regards,
Nikolai

>
> Additionally, I've now got a kernrate viewer reports, one for high load,
> and one for normal load period, attached below. Here, tcpip shows some
> more kernel activity during the problematic period, but the difference
> is not so huge.
>
> Time 36348 hits, 25000 events per hit --------
> Module Hits msec %Total Events/Sec
> intelppm 35048 45640 96 % 19198071
> hal 1030 45640 2 % 564198
> ntkrnlpa 204 45640 0 % 111744
> tcpip 20 45640 0 % 10955
> win32k 18 45640 0 % 9859
> afd 6 45640 0 % 3286
> ipnat 6 45640 0 % 3286
> NDIS 4 45640 0 % 2191
> ......
>
> Time 37227 hits, 25000 events per hit --------
> Module Hits msec %Total Events/Sec
> intelppm 35856 46828 96 % 19142393
> hal 1089 46828 2 % 581382
> ntkrnlpa 229 46828 0 % 122255
> win32k 29 46828 0 % 15482
> tcpip 9 46828 0 % 4804
> NDIS 4 46828 0 % 2135
> afd 3 46828 0 % 1601
> psched 3 46828 0 % 1601
> ipnat 2 46828 0 % 1067
>
>
> Thank you.
>
> Nikolai
>
>>
>> The screenshot: https://yadi.sk/i/hC8FMZCE3CyzPs
>>
>>
>> Thank you.
>>
>> Nikolai
>>
>>
>>>
>>> Andres
>>>
>>
>
>
>

Re: [GENERAL] Causeless CPU load waves in backend, on windows, 9.5.5(EDB binary).

From

Nikolai Zhubr

Date:

12 February 2017, 20:23:25

Hello all,

In order to locate the problem more precisely, I'd like to prepare a
test, involving some ping-like communication between the server and a
test client. That is, I'd like to repeatedly send something valid to the
server and get some valid replies from it, but without any kind of real
activity happening on the server. I've looked through the main loop in
PostgresMain() but could not find any suitable candidates.

Any thoughts?

Thank you.

Nikolai

03.02.2017 16:30, I wrote:
[...]
> Ok, secure_read() is likely irrelevant too.
>
> I think what happened after I inserted "Sleep(15)" into secure_read() is
> that this "Sleep(15)" was essentially added into the main "for(;;)" loop
> of PostgresMain (through ReadCommand), introducing an artifical
> additional CPU relaxation step along with every incoming query and
> therefore just masking a real CPU eater.
>
> So probably I'll have to somehow profile this "for(;;)" in PostgresMain.
>
>
> Thank you.
>
> Nikolai
>

Re: [GENERAL] Causeless CPU load waves in backend, on windows, 9.5.5(EDB binary).

From

"Rader, David"

Date:

13 February 2017, 23:58:54

How about using pg_isready?

https://www.postgresql.org/docs/current/static/app-pg-isready.html

David Rader

davidr@openscg.com

On Sun, Feb 12, 2017 at 12:23 PM, Nikolai Zhubr <n-a-zhubr@yandex.ru> wrote:

Hello all,

In order to locate the problem more precisely, I'd like to prepare a test, involving some ping-like communication between the server and a test client. That is, I'd like to repeatedly send something valid to the server and get some valid replies from it, but without any kind of real activity happening on the server. I've looked through the main loop in PostgresMain() but could not find any suitable candidates.

Any thoughts?

Thank you.

Nikolai

03.02.2017 16:30, I wrote:
[...]
Ok, secure_read() is likely irrelevant too.

I think what happened after I inserted "Sleep(15)" into secure_read() is
that this "Sleep(15)" was essentially added into the main "for(;;)" loop
of PostgresMain (through ReadCommand), introducing an artifical
additional CPU relaxation step along with every incoming query and
therefore just masking a real CPU eater.

So probably I'll have to somehow profile this "for(;;)" in PostgresMain.

Thank you.

Nikolai

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Causeless CPU load waves in backend, on windows, 9.5.5(EDB binary).

From

Nikolai Zhubr

Date:

14 February 2017, 01:03:41

13.02.2017 23:58, Rader, David:
> How about using pg_isready?
> https://www.postgresql.org/docs/current/static/app-pg-isready.html

But it doesn't actually communicate with the server AFAIK, just checks
if a connection could be established?

Maybe I should have been more specific.
What I need is debugging/profiling pure communication side of server
operation, implying huge lots of requests and replies going over the
wire to and from the server within some continued (valid) session, but
so that the server is not actually doing anything above that (no sql, no
locking, no synchronizing, zero usefull activity, just pumping network I/O)


Regards,
Nikolai

>
> --
> David Rader
> davidr@openscg.com <mailto:davidr@openscg.com>
>
> On Sun, Feb 12, 2017 at 12:23 PM, Nikolai Zhubr <n-a-zhubr@yandex.ru
> <mailto:n-a-zhubr@yandex.ru>> wrote:
>
>     Hello all,
>
>     In order to locate the problem more precisely, I'd like to prepare a
>     test, involving some ping-like communication between the server and
>     a test client. That is, I'd like to repeatedly send something valid
>     to the server and get some valid replies from it, but without any
>     kind of real activity happening on the server. I've looked through
>     the main loop in PostgresMain() but could not find any suitable
>     candidates.
>
>     Any thoughts?
>
>     Thank you.
>
>     Nikolai
>
>
>     03.02.2017 16:30, I wrote:
>     [...]
>
>         Ok, secure_read() is likely irrelevant too.
>
>         I think what happened after I inserted "Sleep(15)" into
>         secure_read() is
>         that this "Sleep(15)" was essentially added into the main
>         "for(;;)" loop
>         of PostgresMain (through ReadCommand), introducing an artifical
>         additional CPU relaxation step along with every incoming query and
>         therefore just masking a real CPU eater.
>
>         So probably I'll have to somehow profile this "for(;;)" in
>         PostgresMain.
>
>
>         Thank you.
>
>         Nikolai

[GENERAL] Re: Causeless CPU load waves in backend, on windows, 9.5.5 (EDBbinary).

From

Thomas Kellerer

Date:

14 February 2017, 01:10:09

Nikolai Zhubr schrieb am 13.02.2017 um 23:03:
> Maybe I should have been more specific.
> What I need is debugging/profiling pure communication side of server
> operation, implying huge lots of requests and replies going over the
> wire to and from the server within some continued (valid) session,
> but so that the server is not actually doing anything above that (no
> sql, no locking, no synchronizing, zero usefull activity, just
> pumping network I/O)
>

If you are willing to drop the "no sql" requirement you could use something like

     select rpad('*', 100000000, '*');

this will send a lot of data over the wire, the SQL overhead should be fairly small.

You can send more data if you combine that with e.g. generate_series()

Re: [GENERAL] Re: Causeless CPU load waves in backend, on windows,9.5.5 (EDB binary).

From

Scott Mead

Date:

14 February 2017, 03:00:55

On Mon, Feb 13, 2017 at 5:10 PM, Thomas Kellerer <spam_eater@gmx.net> wrote:

Nikolai Zhubr schrieb am 13.02.2017 um 23:03:
Maybe I should have been more specific.
What I need is debugging/profiling pure communication side of server
operation, implying huge lots of requests and replies going over the
wire to and from the server within some continued (valid) session,
but so that the server is not actually doing anything above that (no
sql, no locking, no synchronizing, zero usefull activity, just
pumping network I/O)

My personal favorite is 'select 1'

I know you're not looking for SQL, but, you gotta start somewhere...

If you are willing to drop the "no sql" requirement you could use something like

select rpad('*', 100000000, '*');

this will send a lot of data over the wire, the SQL overhead should be fairly small.

You can send more data if you combine that with e.g. generate_series()

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Scott Mead
Sr. Architect
OpenSCG

http://openscg.com

Re: [GENERAL] Causeless CPU load waves in backend, on windows, 9.5.5(EDB binary).

From

Michael Paquier

Date:

14 February 2017, 05:26:31

On Tue, Feb 14, 2017 at 7:03 AM, Nikolai Zhubr <n-a-zhubr@yandex.ru> wrote:
> 13.02.2017 23:58, Rader, David:
>>
>> How about using pg_isready?
>> https://www.postgresql.org/docs/current/static/app-pg-isready.html
>
> But it doesn't actually communicate with the server AFAIK, just checks if a
> connection could be established?

Not exactly, it checks also if the server is able to accept
connections or refusing them, after establishing the connection. The
real deal is in that. For example it is possible to use pg_isready as
well on a hot standby refusing connections because it did not reach
yet a consistent point.
--
Michael

Re: [GENERAL] Re: Causeless CPU load waves in backend, on windows,9.5.5 (EDB binary).

From

Nikolai Zhubr

Date:

14 February 2017, 10:03:01

14.02.2017 1:10, Thomas Kellerer:
> Nikolai Zhubr schrieb am 13.02.2017 um 23:03:
>> Maybe I should have been more specific.
>> What I need is debugging/profiling pure communication side of server
>> operation, implying huge lots of requests and replies going over the
>> wire to and from the server within some continued (valid) session,
>> but so that the server is not actually doing anything above that (no
>> sql, no locking, no synchronizing, zero usefull activity, just
>> pumping network I/O)
>>
>
> If you are willing to drop the "no sql" requirement you could use
> something like
>
> select rpad('*', 100000000, '*');
>
> this will send a lot of data over the wire, the SQL overhead should be
> fairly small.

Well yes, but I've been there already.
Now I'd like to locate a CPU eater more precisely - supposedly there is
some issue with communication, that is why I don't want to mix in
anything else.

Anyway, I've now got pretty sure the standard protocol out of the box
does not provide such short-circuite capability so I'll have to hack it
a bit.


Regards,
Nikolai

>
> You can send more data if you combine that with e.g. generate_series()
>

Re: [GENERAL] Re: Causeless CPU load waves in backend, on windows,9.5.5 (EDB binary).

From

John R Pierce

Date:

14 February 2017, 12:47:35

On 2/13/2017 11:03 PM, Nikolai Zhubr wrote:
> Now I'd like to locate a CPU eater more precisely - supposedly there
> is some issue with communication, that is why I don't want to mix in
> anything else.

use iperf to test the network transport layer, without any postgres in
the loop?


--
john r pierce, recycling bits in santa cruz

Re: [GENERAL] Re: Causeless CPU load waves in backend, on windows,9.5.5 (EDB binary).

From

Nikolai Zhubr

Date:

14 February 2017, 13:28:44

14.02.2017 12:47, John R Pierce:
> On 2/13/2017 11:03 PM, Nikolai Zhubr wrote:
>> Now I'd like to locate a CPU eater more precisely - supposedly there
>> is some issue with communication, that is why I don't want to mix in
>> anything else.
>
> use iperf to test the network transport layer, without any postgres in
> the loop?

No, the network performance in general is fine. However, some specific
continuous communication pattern causes something to go very wrong on
windows periodically. (And there is no such problem on linux btw) While
the issue now seems rather related to windows itself and not postgres,
but the pattern typical for some of my postgres uses is apparently a
reliable trigger. Therefore, I'm thinking to first isolate the problem
inside postgres, and if it still triggers then, second step, create a
test "server" consisting of communication code of postgres to run
without any other parts of postgres.

Regards,
Nikolai

>
>

Re: [GENERAL] Re: Causeless CPU load waves in backend, on windows,9.5.5 (EDB binary).

From

"Rader, David"

Date:

14 February 2017, 18:15:05

David Rader

davidr@openscg.com

On Tue, Feb 14, 2017 at 5:28 AM, Nikolai Zhubr <n-a-zhubr@yandex.ru> wrote:

14.02.2017 12:47, John R Pierce:
On 2/13/2017 11:03 PM, Nikolai Zhubr wrote:
Now I'd like to locate a CPU eater more precisely - supposedly there
is some issue with communication, that is why I don't want to mix in
anything else.

use iperf to test the network transport layer, without any postgres in
the loop?

No, the network performance in general is fine. However, some specific continuous communication pattern causes something to go very wrong on windows periodically. (And there is no such problem on linux btw) While the issue now seems rather related to windows itself and not postgres, but the pattern typical for some of my postgres uses is apparently a reliable trigger. Therefore, I'm thinking to first isolate the problem inside postgres, and if it still triggers then, second step, create a test "server" consisting of communication code of postgres to run without any other parts of postgres.

Regards,
Nikolai

Try the libpq call from pg_isready. It does actually make a round trip to the postgres server and asks the server if it is ready to accept connections. So you are running the socket communication code of postgres and a small bit of "status" check but not any sql.

from mirror: https://github.com/postgres/postgres/blob/master/src/bin/scripts/pg_isready.c

relevant call is PQping:

 rv = PQpingParams(keywords, values, 1);

Re: [GENERAL] Re: Causeless CPU load waves in backend, on windows,9.5.5 (EDB binary).

From

Nikolai Zhubr

Date:

15 February 2017, 00:06:11

14.02.2017 18:15, Rader, David:
[...]
> Try the libpq call from pg_isready. It does actually make a round trip
> to the postgres server and asks the server if it is ready to accept
> connections. So you are running the socket communication code of
> postgres and a small bit of "status" check but not any sql.

Indeed, such function is available. But essentially, this function is a
(kind of) combined login+logout, therefore it would not work for my
purpose. (Despite its name, it can not be used to perform some
communication "ping" within an established session, such thing simply
does not exist in the protocol, at least as of 9.5)


Regards,
Nikolai

>
> from mirror:
> https://github.com/postgres/postgres/blob/master/src/bin/scripts/pg_isready.c
>
> relevant call is PQping:
>
> rv = PQpingParams(keywords, values, 1);
>
>
>

Re: [GENERAL] Re: Causeless CPU load waves in backend, on windows,9.5.5 (EDB binary).

From

Nikolai Zhubr

Date:

27 February 2017, 10:08:34

15.02.2017 0:06, I wrote:
[...]
> Indeed, such function is available. But essentially, this function is a
> (kind of) combined login+logout, therefore it would not work for my
> purpose. (Despite its name, it can not be used to perform some
> communication "ping" within an established session, such thing simply
> does not exist in the protocol, at least as of 9.5)

Ok, I've made some additions to the protocol to be able to do pings to a
real server, some standalone tests mimicing postgresql communication and
employing RDTSC, and run rediculously extensive experiments. And my
findings to the moment is that the reason for CPU load issue is a wierd
behaviour of windows' core functions, specifically
WaitForMultipleObjects (and apparently the whole family of Wait
functions) which is used in postgresql backend for waiting.

So, what I've observed is that Wait* functions _usually_ go to sleep
nicely when the state is not signalled, but _sometimes_, depending on
unknown criteria, it can choose to instead do a busy-loop wait or
something CPU-expensive. Maybe it tries to optimize the delay, or maybe
it is a bug. The effect somewhat varies depending on windows version,
CPU cores, selected system timer frequency, and Wait* call pattern
(frequency).

I can not currently see how it can be fixed in a generic and reliable
way in postgres without sacrificing some performance. On the other hand,
leaving it as-is is exposing the server to a substantial CPU abuse (even
unintentional, as it initially happened in my case).

Regards,
Nikolai

>
>
> Regards,
> Nikolai
>
>>
>> from mirror:
>> https://github.com/postgres/postgres/blob/master/src/bin/scripts/pg_isready.c
>>
>>
>> relevant call is PQping:
>>
>> rv = PQpingParams(keywords, values, 1);
>>
>>
>>
>
>
>

Re: [GENERAL] Re: Causeless CPU load waves in backend, on windows,9.5.5 (EDB binary).

From

Nikolai Zhubr

Date:

02 March 2017, 00:43:30

27.02.2017 10:08, I wrote:
[...]
> So, what I've observed is that Wait* functions _usually_ go to sleep
> nicely when the state is not signalled, but _sometimes_, depending on
> unknown criteria, it can choose to instead do a busy-loop wait or
> something CPU-expensive. Maybe it tries to optimize the delay, or maybe
> it is a bug. The effect somewhat varies depending on windows version,
> CPU cores, selected system timer frequency, and Wait* call pattern
> (frequency).
>
> I can not currently see how it can be fixed in a generic and reliable
> way in postgres without sacrificing some performance. On the other hand,
> leaving it as-is is exposing the server to a substantial CPU abuse (even
> unintentional, as it initially happened in my case).

Ok, I've finally found what happened. False alarm, in short. The server
is perfectly fine.

My mistake was that I was using most basic ways of monitoring CPU load,
i.e. provided by GetProcessTimes() and taskmgr.exe and such. And because
they only operate at the scheduler granularity (usually 16ms approx),
some rounding has to happen in cpu consumption calculation. With this
approach, this rounding can apparently distort the measurement
dramatically (in some corner cases).

Now, xperf employs a more fair approach to cpu consumption measurement,
based on all the exact times of process switching-in and -out, which
apparently gives a much better result.

So, in my case the results of taskmgr and xperf are very different, and
considering the above, taskmgr's report should be dismissed as bogus.
And xperf indicates that my stress-tests actually do not cause postgres
backend process to consume any noticable cpu percentage al all. That's good.

Regards,
Nikolai

>
>
> Regards,
> Nikolai