Home > mailing lists

Thread: Performance degradation in commit ac1d794

Performance degradation in commit ac1d794

From

Васильев Дмитрий

Date:

25 December 2015, 17:08:54

Hello hackers!

I suddenly found commit ac1d794 gives up to 3 times performance degradation.

I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70 CPU-core machine:

commit ac1d794 gives me 363,474 tps

and previous commit a05dc4d gives me 956,146

and master( 3d0c50f ) with revert ac1d794 gives me 969,265

it's shocking

---

Dmitry Vasilyev

Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: Performance degradation in commit ac1d794

From

Andres Freund

Date:

25 December 2015, 17:18:55

On December 25, 2015 6:08:15 PM GMT+01:00, "Васильев Дмитрий" <d.vasilyev@postgrespro.ru> wrote:
>Hello hackers!
>
>I suddenly found commit ac1d794 gives up to 3 times performance
>degradation.
>
>I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70
>CPU-core
>machine:
>commit ac1d794 gives me 363,474 tps
>and previous commit a05dc4d gives me 956,146
>and master( 3d0c50f ) with revert ac1d794 gives me 969,265
>
>it's shocking



You're talking about
http://git.postgresql.org/gitweb/?p=postgresql.git;a=blobdiff;f=src/backend/libpq/be-secure.c;h=2ddcf428f89fd12c230d6f417c2f707fbd97bf39;hp=26d8faaf773a818b388b899b8d83d617bdf7af9b;hb=ac1d794;hpb=a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26e

If so, could you provide a hierarchical before/after profile?

Andres
Hi,

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

Re: Performance degradation in commit ac1d794

From

Васильев Дмитрий

Date:

25 December 2015, 17:27:47

2015-12-25 20:18 GMT+03:00 Andres Freund <andres@anarazel.de>:

On December 25, 2015 6:08:15 PM GMT+01:00, "Васильев Дмитрий" <d.vasilyev@postgrespro.ru> wrote:
>Hello hackers!
>
>I suddenly found commit ac1d794 gives up to 3 times performance
>degradation.
>
>I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70
>CPU-core
>machine:
>commit ac1d794 gives me 363,474 tps
>and previous commit a05dc4d gives me 956,146
>and master( 3d0c50f ) with revert ac1d794 gives me 969,265
>
>it's shocking

You're talking about http://git.postgresql.org/gitweb/?p=postgresql.git;a=blobdiff;f=src/backend/libpq/be-secure.c;h=2ddcf428f89fd12c230d6f417c2f707fbd97bf39;hp=26d8faaf773a818b388b899b8d83d617bdf7af9b;hb=ac1d794;hpb=a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26e

If so, could you provide a hierarchical before/after profile?

Andres
Hi,

---
Please excuse brevity and formatting - I am writing this on my mobile phone.

> You're talking about http://git.postgresql.org/gitweb/?p=postgresql.git;a=blobdiff;f=src/backend/libpq/be-secure.c;h=2ddcf428f89fd12c230d6f417c2f707fbd97bf39;hp=26d8faaf773a818b388b899b8d83d617bdf7af9b;hb=ac1d794;hpb=a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26e

Yes, about this.

>If so, could you provide a hierarchical before/after profile?

Performance | Git hash commit | Date

~ 360k tps | c3e7c24a1d60dc6ad56e2a0723399f1570c54224 | Thu Nov 12 09:12:18 2015 -0500

~ 360k tps | ac1d7945f866b1928c2554c0f80fd52d7f977772 | Thu Nov 12 09:00:33 2015 -0500

~ 960k tps | a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26e | Thu Nov 12 07:40:31 2015 -0500

---

Dmitry Vasilyev

Postgres Professional: http://www.postgrespro.com

Russian Postgres Company

Re: Performance degradation in commit ac1d794

From

Васильев Дмитрий

Date:

25 December 2015, 17:34:41

2015-12-25 20:27 GMT+03:00 Васильев Дмитрий <d.vasilyev@postgrespro.ru>:

2015-12-25 20:18 GMT+03:00 Andres Freund <andres@anarazel.de>:
On December 25, 2015 6:08:15 PM GMT+01:00, "Васильев Дмитрий" <d.vasilyev@postgrespro.ru> wrote:
>Hello hackers!
>
>I suddenly found commit ac1d794 gives up to 3 times performance
>degradation.
>
>I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70
>CPU-core
>machine:
>commit ac1d794 gives me 363,474 tps
>and previous commit a05dc4d gives me 956,146
>and master( 3d0c50f ) with revert ac1d794 gives me 969,265
>
>it's shocking

You're talking about http://git.postgresql.org/gitweb/?p=postgresql.git;a=blobdiff;f=src/backend/libpq/be-secure.c;h=2ddcf428f89fd12c230d6f417c2f707fbd97bf39;hp=26d8faaf773a818b388b899b8d83d617bdf7af9b;hb=ac1d794;hpb=a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26e

If so, could you provide a hierarchical before/after profile?

Andres
Hi,

---
Please excuse brevity and formatting - I am writing this on my mobile phone.

> You're talking about http://git.postgresql.org/gitweb/?p=postgresql.git;a=blobdiff;f=src/backend/libpq/be-secure.c;h=2ddcf428f89fd12c230d6f417c2f707fbd97bf39;hp=26d8faaf773a818b388b899b8d83d617bdf7af9b;hb=ac1d794;hpb=a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26e

Yes, about this.

>If so, could you provide a hierarchical before/after profile?

Performance | Git hash commit | Date
~ 360k tps | c3e7c24a1d60dc6ad56e2a0723399f1570c54224 | Thu Nov 12 09:12:18 2015 -0500
~ 360k tps | ac1d7945f866b1928c2554c0f80fd52d7f977772 | Thu Nov 12 09:00:33 2015 -0500
~ 960k tps | a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26e | Thu Nov 12 07:40:31 2015 -0500

---
Dmitry Vasilyev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

I came across it by studying the results:

vanilla 9.5 = 30020c3fc3b6de5592978313df929d370f5770ce

vanilla 9.6 = c4a8812cf64b142685e39a69694c5276601f40e4

info | clients | tps

-----------------------+---------+---------

vanilla 9.5 | 1 | 30321

vanilla 9.5 | 8 | 216542

vanilla 9.5 | 16 | 412526

vanilla 9.5 | 32 | 751331

vanilla 9.5 | 48 | 956146

<- this point

vanilla 9.5 | 56 | 990122

vanilla 9.5 | 64 | 842436

vanilla 9.5 | 72 | 913272

vanilla 9.5 | 82 | 659332

vanilla 9.5 | 92 | 630111

vanilla 9.5 | 96 | 616863

vanilla 9.5 | 110 | 592080

vanilla 9.5 | 120 | 575831

vanilla 9.5 | 130 | 557521

vanilla 9.5 | 140 | 537951

vanilla 9.5 | 150 | 517019

vanilla 9.5 | 160 | 502312

vanilla 9.5 | 170 | 489162

vanilla 9.5 | 180 | 477178

vanilla 9.5 | 190 | 464620

vanilla 9.6 | 1 | 31738

vanilla 9.6 | 8 | 219692

vanilla 9.6 | 16 | 422933

vanilla 9.6 | 32 | 375546

vanilla 9.6 | 48 | 363474

<- this point

vanilla 9.6 | 56 | 352943

vanilla 9.6 | 64 | 334498

vanilla 9.6 | 72 | 369802

vanilla 9.6 | 82 | 604867

vanilla 9.6 | 92 | 871048

vanilla 9.6 | 96 | 969265

vanilla 9.6 | 105 | 996794

vanilla 9.6 | 110 | 932853

vanilla 9.6 | 115 | 758485

vanilla 9.6 | 120 | 721365

vanilla 9.6 | 125 | 632265

vanilla 9.6 | 130 | 624666

vanilla 9.6 | 135 | 582120

vanilla 9.6 | 140 | 583080

vanilla 9.6 | 150 | 555608

vanilla 9.6 | 160 | 533340

vanilla 9.6 | 170 | 520308

vanilla 9.6 | 180 | 504536

vanilla 9.6 | 190 | 496967

---

Dmitry Vasilyev

Postgres Professional: http://www.postgrespro.com

Russian Postgres Company

Re: Performance degradation in commit ac1d794

From

Andres Freund

Date:

25 December 2015, 17:44:31

On December 25, 2015 6:27:06 PM GMT+01:00, "Васильев Дмитрий" 
>>If so, could you provide a hierarchical before/after profile?
>
>Performance | Git hash commit    | Date
>~ 360k tps | c3e7c24a1d60dc6ad56e2a0723399f1570c54224 | Thu Nov 12
>09:12:18
>2015 -0500
>~ 360k tps | ac1d7945f866b1928c2554c0f80fd52d7f977772    | Thu Nov 12
>09:00:33 2015 -0500
>~ 960k tps | a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26e  | Thu Nov 12
>07:40:31 2015 -0500

Profile as in perf oprofile or something.

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

Re: Performance degradation in commit ac1d794

From

Васильев Дмитрий

Date:

25 December 2015, 17:53:49

2015-12-25 20:44 GMT+03:00 Andres Freund <andres@anarazel.de>:

On December 25, 2015 6:27:06 PM GMT+01:00, "Васильев Дмитрий"
>>If so, could you provide a hierarchical before/after profile?
>
>Performance | Git hash commit | Date
>~ 360k tps | c3e7c24a1d60dc6ad56e2a0723399f1570c54224 | Thu Nov 12
>09:12:18
>2015 -0500
>~ 360k tps | ac1d7945f866b1928c2554c0f80fd52d7f977772 | Thu Nov 12
>09:00:33 2015 -0500
>~ 960k tps | a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26e | Thu Nov 12
>07:40:31 2015 -0500

Profile as in perf oprofile or something.

---
Please excuse brevity and formatting - I am writing this on my mobile phone.

ac1d794:

Samples: 1M of event 'cycles', Event count (approx.): 816922259995, UID: pgpro Overhead Shared Object Symbol

69,72% [kernel] [k] _raw_spin_lock_irqsave

1,43% postgres [.] _bt_compare

1,19% postgres [.] LWLockAcquire

0,99% postgres [.] hash_search_with_hash_value

0,61% postgres [.] PinBuffer

0,46% postgres [.] GetSnapshotData

a05dc4d:

Samples: 1M of event 'cycles', Event count (approx.): 508150718694, UID: pgpro

Overhead Shared Object Symbol

4,77% postgres [.] GetSnapshotData

4,30% postgres [.] _bt_compare

3,13% postgres [.] hash_search_with_hash_value

3,08% postgres [.] LWLockAcquire

2,09% postgres [.] LWLockRelease

2,03% postgres [.] PinBuffer

Perf record generate big traffic:

time perf record -u pgpro -g --call-graph=dwarf

^C[ perf record: Woken up 0 times to write data ]

Warning:

Processed 1078453 events and lost 18257 chunks!

Check IO/CPU overload!

[ perf record: Captured and wrote 8507.985 MB perf.data (1055663 samples) ]

real 0m8.791s

user 0m0.678s

sys 0m8.120s

If you want i give you ssh-access.

Re: Performance degradation in commit ac1d794

From

Tom Lane

Date:

25 December 2015, 18:10:36

Васильев Дмитрий <d.vasilyev@postgrespro.ru> writes:
>  Samples: 1M of event 'cycles', Event count (approx.): 816922259995, UID:
> pgpro
>   Overhead  Shared Object       Symbol

>   69,72%  [kernel]            [k] _raw_spin_lock_irqsave
>   1,43%  postgres            [.] _bt_compare
>    1,19%  postgres            [.] LWLockAcquire
>    0,99%  postgres            [.] hash_search_with_hash_value
>    0,61%  postgres            [.] PinBuffer

Seems like what you've got here is a kernel bug.
        regards, tom lane

Re: Performance degradation in commit ac1d794

From

Andres Freund

Date:

25 December 2015, 18:20:34

On December 25, 2015 7:10:23 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>Васильев Дмитрий
><d.vasilyev@postgrespro.ru> writes:
>> ��� Samples: 1M of event 'cycles', Event count (approx.):
>816922259995, UID:
>> pgpro
>>   Overhead  Shared Object       Symbol
>
>>   69,72%  [kernel]            [k] _raw_spin_lock_irqsave
>>   1,43%  postgres            [.] _bt_compare
>>    1,19%  postgres            [.] LWLockAcquire
>>    0,99%  postgres            [.] hash_search_with_hash_value
>>    0,61%  postgres            [.] PinBuffer
>
>Seems like what you've got here is a kernel bug.

I wouldn't go as far as calling it a kernel bug. Were still doing 300k tps. And were triggering the performance
degradationby adding another socket (IIRC) to the poll(2) call.
 

It certainly be interesting to see the expanded tree below the spinlock. I wonder if this is related to directed
wakeups.

Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

Re: Performance degradation in commit ac1d794

From

Tom Lane

Date:

25 December 2015, 18:29:04

Andres Freund <andres@anarazel.de> writes:
> On December 25, 2015 7:10:23 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Seems like what you've got here is a kernel bug.

> I wouldn't go as far as calling it a kernel bug. Were still doing 300k tps. And were triggering the performance
degradationby adding another socket (IIRC) to the poll(2) call.
 

Hmm.  And all those FDs point to the same pipe.  I wonder if we're looking
at contention for some pipe-related data structure inside the kernel.
        regards, tom lane

Re: Performance degradation in commit ac1d794

From

Васильев Дмитрий

Date:

25 December 2015, 19:42:58

2015-12-25 21:28 GMT+03:00 Tom Lane <tgl@sss.pgh.pa.us>:

Andres Freund <andres@anarazel.de> writes:
> On December 25, 2015 7:10:23 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Seems like what you've got here is a kernel bug.

> I wouldn't go as far as calling it a kernel bug. Were still doing 300k tps. And were triggering the performance degradation by adding another socket (IIRC) to the poll(2) call.

Hmm. And all those FDs point to the same pipe. I wonder if we're looking
at contention for some pipe-related data structure inside the kernel.

regards, tom lane

I did bt on backends and found it in following state:

#0 0x00007f77b0e5bb60 in __poll_nocancel () from /lib64/libc.so.6

#1 0x00000000006a7cd0 in WaitLatchOrSocket (latch=0x7f779e2e96c4, wakeEvents=wakeEvents@entry=19, sock=9, timeout=timeout@entry=0) at pg_latch.c:333

#2 0x0000000000612c7d in secure_read (port=0x17e6af0, ptr=0xcc94a0 <PqRecvBuffer>, len=8192) at be-secure.c:147

#3 0x000000000061be36 in pq_recvbuf () at pqcomm.c:915

#4 pq_getbyte () at pqcomm.c:958

#5 0x0000000000728ad5 in SocketBackend (inBuf=0x7ffd8b6b1460) at postgres.c:345

Perf shows _raw_spin_lock_irqsave call remove_wait_queue add_wait_queue

There’s screenshots: http://i.imgur.com/pux2bGJ.png http://i.imgur.com/LJQbm2V.png

---

Dmitry Vasilyev

Postgres Professional: http://www.postgrespro.com

Russian Postgres Company

Re: Performance degradation in commit ac1d794

From

Васильев Дмитрий

Date:

25 December 2015, 19:52:53

2015-12-25 22:42 GMT+03:00 Васильев Дмитрий <d.vasilyev@postgrespro.ru>:

2015-12-25 21:28 GMT+03:00 Tom Lane <tgl@sss.pgh.pa.us>:
Andres Freund <andres@anarazel.de> writes:
> On December 25, 2015 7:10:23 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Seems like what you've got here is a kernel bug.

> I wouldn't go as far as calling it a kernel bug. Were still doing 300k tps. And were triggering the performance degradation by adding another socket (IIRC) to the poll(2) call.

Hmm. And all those FDs point to the same pipe. I wonder if we're looking
at contention for some pipe-related data structure inside the kernel.

regards, tom lane

I did bt on backends and found it in following state:

#0 0x00007f77b0e5bb60 in __poll_nocancel () from /lib64/libc.so.6
#1 0x00000000006a7cd0 in WaitLatchOrSocket (latch=0x7f779e2e96c4, wakeEvents=wakeEvents@entry=19, sock=9, timeout=timeout@entry=0) at pg_latch.c:333
#2 0x0000000000612c7d in secure_read (port=0x17e6af0, ptr=0xcc94a0 <PqRecvBuffer>, len=8192) at be-secure.c:147
#3 0x000000000061be36 in pq_recvbuf () at pqcomm.c:915
#4 pq_getbyte () at pqcomm.c:958
#5 0x0000000000728ad5 in SocketBackend (inBuf=0x7ffd8b6b1460) at postgres.c:345

Perf shows _raw_spin_lock_irqsave call remove_wait_queue add_wait_queue
There’s screenshots: http://i.imgur.com/pux2bGJ.png http://i.imgur.com/LJQbm2V.png

---
Dmitry Vasilyev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

I’m sorry I meant remove_wait_queue and add_wait_queue functions call _raw_spin_lock_irqsave what holds main processor time.

uname -a: 3.10.0-229.20.1.el7.x86_64 #1 SMP Tue Nov 3 19:10:07 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

---

Dmitry Vasilyev

Postgres Professional: http://www.postgrespro.com

Russian Postgres Company

Re: Performance degradation in commit ac1d794

From

Andres Freund

Date:

25 December 2015, 21:13:39

On 2015-12-25 13:28:55 -0500, Tom Lane wrote:
> Hmm.  And all those FDs point to the same pipe.  I wonder if we're looking
> at contention for some pipe-related data structure inside the kernel.

Sounds fairly likely - and not too surprising. In this scenario we've a
couple 100k registrations/unregistrations to a pretty fundamentally
shared resource (the wait queue for changes to the pipe). Not that
surprising that it becomes a problem.

There's a couple solutions I can think of to that problem:
1) Use epoll()/kqueue, or other similar interfaces that don't require  re-registering fds at every invocation. My guess
isthat that'd be  desirable for performance anyway.

2) Create a pair of fds between postmaster/backend for each  backend. While obviously increasing the the number of FDs
noticeably, it's interesting for other features as well: If we ever want to do FD  passing from postmaster to existing
backends,we're going to need  that anyway.

3) Replace the postmaster_alive_fds socketpair by some other signalling  mechanism. E.g. sending a procsignal to each
backend,which sets the  latch and a special flag in the latch structure.

Andres

Re: Performance degradation in commit ac1d794

From

Tom Lane

Date:

25 December 2015, 21:30:04

Andres Freund <andres@anarazel.de> writes:
> There's a couple solutions I can think of to that problem:
> 1) Use epoll()/kqueue, or other similar interfaces that don't require
>    re-registering fds at every invocation. My guess is that that'd be
>    desirable for performance anyway.

Portability, on the other hand, would be problematic.

> 2) Create a pair of fds between postmaster/backend for each
>    backend. While obviously increasing the the number of FDs noticeably,
>    it's interesting for other features as well: If we ever want to do FD
>    passing from postmaster to existing backends, we're going to need
>    that anyway.

Maybe; it'd provide another limit on how many backends we could run.

> 3) Replace the postmaster_alive_fds socketpair by some other signalling
>    mechanism. E.g. sending a procsignal to each backend, which sets the
>    latch and a special flag in the latch structure.

And what would send the signal?  The entire point here is to notice the
situation where the postmaster has crashed.  It can *not* depend on the
postmaster taking some action.
        regards, tom lane

Re: Performance degradation in commit ac1d794

From

Andres Freund

Date:

26 December 2015, 11:22:54

On 2015-12-25 16:29:53 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > There's a couple solutions I can think of to that problem:
> > 1) Use epoll()/kqueue, or other similar interfaces that don't require
> >    re-registering fds at every invocation. My guess is that that'd be
> >    desirable for performance anyway.
> 
> Portability, on the other hand, would be problematic.

Indeed. But we might be able to get away with it because there's
realistically just one platform on which people run four socket
servers. Obviously we'd leave poll and select support in place.  It'd be
a genuine improvement for less extreme loads on linux, too.

> > 3) Replace the postmaster_alive_fds socketpair by some other signalling
> >    mechanism. E.g. sending a procsignal to each backend, which sets the
> >    latch and a special flag in the latch structure.
> 
> And what would send the signal?  The entire point here is to notice the
> situation where the postmaster has crashed.  It can *not* depend on the
> postmaster taking some action.

Ahem. Um. Look, over there --->

I blame it on all the food.


Andres

Re: Performance degradation in commit ac1d794

From

Andres Freund

Date:

26 December 2015, 12:11:20

On 2015-12-26 12:22:48 +0100, Andres Freund wrote:
> > > 3) Replace the postmaster_alive_fds socketpair by some other signalling
> > >    mechanism. E.g. sending a procsignal to each backend, which sets the
> > >    latch and a special flag in the latch structure.
> > 
> > And what would send the signal?  The entire point here is to notice the
> > situation where the postmaster has crashed.  It can *not* depend on the
> > postmaster taking some action.
> 
> Ahem. Um. Look, over there --->
> 
> I blame it on all the food.

A unportable and easy version of this, actually making sense this time,
would be to use prctl(PR_SET_PDEATHSIG, SIGQUIT). That'd send SIGQUIT to
backends whenever postmaster dies.  Obviously that's not portable
either - doing this for linux only wouldn't be all that kludgey tho.

Re: Performance degradation in commit ac1d794

From

Tom Lane

Date:

26 December 2015, 16:06:52

Andres Freund <andres@anarazel.de> writes:
> A unportable and easy version of this, actually making sense this time,
> would be to use prctl(PR_SET_PDEATHSIG, SIGQUIT). That'd send SIGQUIT to
> backends whenever postmaster dies.  Obviously that's not portable
> either - doing this for linux only wouldn't be all that kludgey tho.

Hmm.  That would have semantics rather substantially different from
the way that the WL_POSTMASTER_DEATH code behaves.  But I don't know
how much we care about that, since the whole scenario is something
that should not happen under normal circumstances.  Maybe cross-platform
variation is OK as long as it doesn't make the code too hairy.
        regards, tom lane

Re: Performance degradation in commit ac1d794

From

Amit Kapila

Date:

28 December 2015, 12:24:30

On Sat, Dec 26, 2015 at 5:41 PM, Andres Freund <andres@anarazel.de> wrote:

On 2015-12-26 12:22:48 +0100, Andres Freund wrote:
> > > 3) Replace the postmaster_alive_fds socketpair by some other signalling
> > > mechanism. E.g. sending a procsignal to each backend, which sets the
> > > latch and a special flag in the latch structure.
> >
> > And what would send the signal? The entire point here is to notice the
> > situation where the postmaster has crashed. It can *not* depend on the
> > postmaster taking some action.
>
> Ahem. Um. Look, over there --->
>
> I blame it on all the food.

A unportable and easy version of this, actually making sense this time,
would be to use prctl(PR_SET_PDEATHSIG, SIGQUIT). That'd send SIGQUIT to
backends whenever postmaster dies. Obviously that's not portable
either - doing this for linux only wouldn't be all that kludgey tho.

There is a way to make backends exit in Windows as well by using

JobObjects and use limitFlags as JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE

for JOBOBJECT_BASIC_LIMIT_INFORMATION.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Performance degradation in commit ac1d794

From

Andres Freund

Date:

14 January 2016, 14:39:48

On 2015-12-26 12:22:48 +0100, Andres Freund wrote:
> On 2015-12-25 16:29:53 -0500, Tom Lane wrote:
> > Andres Freund <andres@anarazel.de> writes:
> > > There's a couple solutions I can think of to that problem:
> > > 1) Use epoll()/kqueue, or other similar interfaces that don't require
> > >    re-registering fds at every invocation. My guess is that that'd be
> > >    desirable for performance anyway.
> >
> > Portability, on the other hand, would be problematic.
>
> Indeed. But we might be able to get away with it because there's
> realistically just one platform on which people run four socket
> servers. Obviously we'd leave poll and select support in place.  It'd be
> a genuine improvement for less extreme loads on linux, too.

I finally got back to working on this. Attached is a WIP patch series
implementing:
0001: Allow to easily choose between the readiness primitives in unix_latch.c
      Pretty helpful for testing, not useful for anything else.
0002: Error out if waiting on socket readiness without a specified socket.
0003: Only clear unix_latch.c's self-pipe if it actually contains data.
      ~2% on high qps workloads
0004: Support using epoll as the polling primitive in unix_latch.c.
      ~3% on high qps workloads, massive scalability improvements (x3)
      on very large machines.

With 0004 obviously being the relevant bit for this thread. I verified
that using epoll addresses the performance problem, using the hardware
the OP noticed the performance problem on.

The reason I went with using epoll over the PR_SET_PDEATHSIG approach is
that it provides semantics that are more similar to the other platforms,
while being just as platform dependant as PR_SET_PDEATHSIG. It also is
actually measurably faster, at least here.

0004 currently contains one debatable optimization, which I'd like to
discuss: Currently the 'sock' passed to WaitLatchOrSocket is not
removed/added to the epoll fd, if it's the numerically same as in the
last call. That's good for performance, but would be wrong if the socket
were close and a new one with the same value would be waited on.  I
think a big warning sign somewhere is sufficient to deal with that
problem - it's not something we're likely to start doing. And even if
it's done at some point, we can just offer an API to reset the last used
socket fd.

Unless somebody comes up with a platform independent way of addressing
this, I'm inclined to press forward using epoll(). Opinions?

Andres

Hi,

On 2016-03-16 15:41:28 -0400, Robert Haas wrote:
> On Wed, Mar 16, 2016 at 3:25 PM, Andres Freund <andres@anarazel.de> wrote:
> >> > - Given current users we don't need a large amount of events, so having
> >> >   to iterate through the registered events doesn't seem bothersome. We
> >> >   could however change the api to be something like
> >> >
> >> >   int WaitLatchEventSet(LatchEventSet *set, OccurredEvents *, int nevents, long timeout);
> >> >
> >> >   which would return the number of events that happened, and would
> >> >   basically "fill" one of the (usually stack allocated) OccurredEvent
> >> >   structures with what happened.
> >>
> >> I definitely think something along these lines is useful.  I want to
> >> be able to have an Append node with 100 ForeignScans under it and kick
> >> off all the scans asynchronously and wait for all of the FDs at once.
> >
> > So you'd like to get only an event for the FD with data back? Or are you
> > ok with iterating through hundred elements in an array, to see which are
> > ready?
>
> I'd like to get an event back for the FD with data.  Iterating sounds
> like it could be really slow.  Say you get lots of little packets back
> from the same connection, while the others are idle.  Now you've got
> to keep iterating through them all over and over again.  Blech.

I've a (working) WIP version that works like I think you want.  It's
implemented in patch 05 (rework with just poll() support) and (add epoll
suppport). It's based on patches posted here earlier, but these aren't
interesting for the discussion.

The API is now:

typedef struct WaitEventSet WaitEventSet;

typedef struct WaitEvent
{
    int        pos;        /* position in the event data structure */
    uint32    events;        /* tripped events */
    int        fd;            /* fd associated with event */
} WaitEvent;

/*
 * Create a WaitEventSet with space for nevents different events to wait for.
 *
 * latch may be NULL.
 */
extern WaitEventSet *CreateWaitEventSet(int nevents);

/* ---
 * Add an event to the set. Possible events are:
 * - WL_LATCH_SET: Wait for the latch to be set
 * - WL_POSTMASTER_DEATH: Wait for postmaster to die
 * - WL_SOCKET_READABLE: Wait for socket to become readable
 *   can be combined in one event with WL_SOCKET_WRITEABLE
 * - WL_SOCKET_WRITABLE: Wait for socket to become readable
 *   can be combined with WL_SOCKET_READABLE
 *
 * Returns the offset in WaitEventSet->events (starting from 0), which can be
 * used to modify previously added wait events.
 */
extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, int fd, Latch *latch);

/*
 * Change the event mask and, if applicable, the associated latch of of a
 * WaitEvent.
 */
extern void ModifyWaitEvent(WaitEventSet *set, int pos, uint32 events, Latch *latch);

/*
 * Wait for events added to the set to happen, or until the timeout is
 * reached.  At most nevents occurrent events are returned.
 *
 * Returns the number of events occurred, or 0 if the timeout was reached.
 */
extern int WaitEventSetWait(WaitEventSet *set, long timeout, WaitEvent* occurred_events, int nevents);

I've for now left the old latch API in place, and only converted
be-secure.c to the new style.  I'd appreciate some feedback before I go
around and convert and polish everything.

Questions:
* I'm kinda inclined to merge the win32 and unix latch
  implementations. There's already a fair bit in common, and this is
  just going to increase that amount.

* Right now the caller has to allocate the WaitEvents he's waiting for
  locally (likely on the stack), but we also could allocate them as part
  of the WaitEventSet. Not sure if that'd be a benefit.

* I can do a blind rewrite of the windows implementation, but I'm
  obviously not going to get that entirely right. So I need some help
  from a windows person to test this.

* This approach, with a 'stateful' wait event data structure, will
  actually allow to fix a couple linering bugs we have on the windows
  port. C.f. http://www.postgresql.org/message-id/4351.1336927207@sss.pgh.pa.us

- Andres

Attachment

Re: Performance degradation in commit ac1d794

From

Amit Kapila

Date:

17 March 2016, 12:53:44

On Thu, Mar 17, 2016 at 7:34 AM, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> * I can do a blind rewrite of the windows implementation, but I'm
> obviously not going to get that entirely right. So I need some help
> from a windows person to test this.
>

I can help you verifying the windows implementation.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Performance degradation in commit ac1d794

From

Robert Haas

Date:

17 March 2016, 13:01:44

On Wed, Mar 16, 2016 at 10:04 PM, Andres Freund <andres@anarazel.de> wrote:
> Questions:
> * I'm kinda inclined to merge the win32 and unix latch
>   implementations. There's already a fair bit in common, and this is
>   just going to increase that amount.

Don't care either way.

> * Right now the caller has to allocate the WaitEvents he's waiting for
>   locally (likely on the stack), but we also could allocate them as part
>   of the WaitEventSet. Not sure if that'd be a benefit.

I'm not seeing this.  What do you mean?

0001: Looking at this again, I'm no longer sure this is a bug.
Doesn't your patch just check the same conditions in the opposite
order?

0002: I think I reviewed this before.  Boring.  Just commit it already.

0003: Mostly boring.  But the change to win32_latch.c seems to remove
an unrelated check.

0004:

+         * drain it everytime WaitLatchOrSocket() is used. Should the
+         * pipe-buffer fill up in some scenarios - widly unlikely - we're

every time
wildly

Why is it wildly (or widly) unlikely?

The rejiggering this does between what is on which element of pfds[]
appears to be unrelated to the ostensible purpose of the patch.

+         * Check again wether latch is set, the arrival of a signal/self-byte

whether.  Also not clearly related to the patch's main purpose.
            /* at least one event occurred, so check masks */
+            if (FD_ISSET(selfpipe_readfd, &input_mask))
+            {
+                /* There's data in the self-pipe, clear it. */
+                drainSelfPipe();
+            }

The comment just preceding this added hunk now seems to be out of
place, and maybe inaccurate as well.  I think the new code could have
a bit more detailed comment.  My understanding is something like /*
Since we didn't clear the self-pipe before attempting to wait,
select() may have returned immediately even though there has been no
recent change to the state of the latch.  To prevent busy-looping, we
must clear the pipe before attempting to wait again. */

I'll look at 0005 next, but thought I would send these comments along first.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Performance degradation in commit ac1d794

From

Robert Haas

Date:

17 March 2016, 13:40:18

On Thu, Mar 17, 2016 at 9:01 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I'll look at 0005 next, but thought I would send these comments along first.

0005: This is obviously very much WIP, but I think the overall
direction of it is good.
0006: Same.

I think you should use PGINVALID_SOCKET rather than -1 in various
places in various patches in this series, especially if you are going
to try to merge the Windows code path.

I wonder if CreateEventSet should accept a MemoryContext argument.  It
seems like callers will usually want TopMemoryContext, and just being
able to pass that might be more convenient than having to switch back
and forth in the calling code.

I wonder if there's a way to refactor this code to avoid having so
much cut-and-paste duplication.

When iterating over the returned events, maybe check whether events is
0 at the top of the loop and skip it forthwith if so.

That's all I've got for now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Performance degradation in commit ac1d794

From

Andres Freund

Date:

17 March 2016, 14:53:39

On 2016-03-17 09:40:08 -0400, Robert Haas wrote:
> On Thu, Mar 17, 2016 at 9:01 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > I'll look at 0005 next, but thought I would send these comments along first.
> 
> 0005: This is obviously very much WIP, but I think the overall
> direction of it is good.
> 0006: Same.
> 
> I think you should use PGINVALID_SOCKET rather than -1 in various
> places in various patches in this series, especially if you are going
> to try to merge the Windows code path.

Sure.

> I wonder if CreateEventSet should accept a MemoryContext argument.  It
> seems like callers will usually want TopMemoryContext, and just being
> able to pass that might be more convenient than having to switch back
> and forth in the calling code.

Makes sense.

> I wonder if there's a way to refactor this code to avoid having so
> much cut-and-paste duplication.

I guess you mean WaitEventSetWait() and WaitEventAdjust*?  I've tried,
and my attempt ended up look nearly unreadable, because of the number of
ifdefs. I've not found a good attempt. Which is sad, because adding back
select support is going to increase the duplication further :( - but
it's also further away from poll etc. (different type of timestamp,
entirely different way of returming events).

> When iterating over the returned events, maybe check whether events is
> 0 at the top of the loop and skip it forthwith if so.

You mean in WaitEventSetWait()? There's    else if (rc == 0)    {        break;    }
which is the timeout case. There should never be any other case of    returning 0 elements?

> That's all I've got for now.

Thanks for looking.

Greetings,

Andres Freund

Re: Performance degradation in commit ac1d794

From

Andres Freund

Date:

17 March 2016, 14:59:37

Hi,

On 2016-03-17 09:01:36 -0400, Robert Haas wrote:
> 0001: Looking at this again, I'm no longer sure this is a bug.
> Doesn't your patch just check the same conditions in the opposite
> order?

Yes, that's what's required

> 0004:
> 
> +         * drain it everytime WaitLatchOrSocket() is used. Should the
> +         * pipe-buffer fill up in some scenarios - widly unlikely - we're
> 
> every time
> wildly
> 
> Why is it wildly (or widly) unlikely?
> 
> The rejiggering this does between what is on which element of pfds[]
> appears to be unrelated to the ostensible purpose of the patch.

Well, not really. We need to know when to do drainSelfPipe(); Which gets
more complicated if pfds[0] is registered optionally.

I'm actually considering to drop this entirely, given the much heavier
rework in the WaitEvent set patch; making these details a bit obsolete.


Greetings,

Andres Freund

Re: Performance degradation in commit ac1d794

From

Andres Freund

Date:

17 March 2016, 15:17:56

On 2016-03-17 09:01:36 -0400, Robert Haas wrote:
> > * Right now the caller has to allocate the WaitEvents he's waiting for
> >   locally (likely on the stack), but we also could allocate them as part
> >   of the WaitEventSet. Not sure if that'd be a benefit.
>
> I'm not seeing this.  What do you mean?

Right now, do use a WaitEventSet you'd do something like    WaitEvent   event;
    ModifyWaitEvent(FeBeWaitSet, 0, waitfor, NULL);
    WaitEventSetWait(FeBeWaitSet, 0 /* no timeout */, &event, 1);

i.e. use a WaitEvent on the stack to receive the changes. If you wanted
to get more changes than just one, you could end up allocating a fair
bit of stack space.

We could instead allocate the returned events as part of the event set,
and return them. Either by returning a NULL terminated array, or by
continuing to return the number of events as now, and additionally
return the event data structure via a pointer.

So the above would be
    WaitEvent   *events;               int nevents;
    ModifyWaitEvent(FeBeWaitSet, 0, waitfor, NULL);
    nevents = WaitEventSetWait(FeBeWaitSet, 0 /* no timeout */, events, 10);
    for (int off = 0; off <= nevents; nevents++)                   ; // stuff

Andres

Re: Performance degradation in commit ac1d794

From

Andres Freund

Date:

18 March 2016, 08:04:39

Hi,

On 2016-03-17 09:01:36 -0400, Robert Haas wrote:
> 0001: Looking at this again, I'm no longer sure this is a bug.
> Doesn't your patch just check the same conditions in the opposite
> order?

Which is important, because what's in what pfds[x] depends on
wakeEvents. Folded it into a later patch; it's not harmful as long as
we're only ever testing pfds[0].

> 0003: Mostly boring.  But the change to win32_latch.c seems to remove
> an unrelated check.

Argh.

> 0004:
>
> +         * drain it everytime WaitLatchOrSocket() is used. Should the
> +         * pipe-buffer fill up in some scenarios - widly unlikely - we're
>
> every time
> wildly
>
> Why is it wildly (or widly) unlikely?

Because SetLatch (if the owner) check latch->is_set before adding to the
pipe, and latch_sigusr1_handler() only writes to the pipe if the current
process is in WaitLatchOrSocket's loop (via the waiting check). Expanded
comment.

>
> +         * Check again wether latch is set, the arrival of a signal/self-byte
>
> whether.  Also not clearly related to the patch's main purpose.

After the change there's no need to re-compute the current timestamp
anymore, that does seem beneficial and kinda related.

>              /* at least one event occurred, so check masks */
> +            if (FD_ISSET(selfpipe_readfd, &input_mask))
> +            {
> +                /* There's data in the self-pipe, clear it. */
> +                drainSelfPipe();
> +            }
>
> The comment just preceding this added hunk now seems to be out of
> place, and maybe inaccurate as well.

Hm. Which comment are you exactly referring to?
            /* at least one event occurred, so check masks */
seems not to fit the bill?

> I think the new code could have
> a bit more detailed comment.  My understanding is something like /*
> Since we didn't clear the self-pipe before attempting to wait,
> select() may have returned immediately even though there has been no
> recent change to the state of the latch.  To prevent busy-looping, we
> must clear the pipe before attempting to wait again. */

Isn't that explained at the top, in
        /*
         * Check if the latch is set already. If so, leave loop immediately,
         * avoid blocking again. We don't attempt to report any other events
         * that might also be satisfied.
         *
         * If someone sets the latch between this and the poll()/select()
         * below, the setter will write a byte to the pipe (or signal us and
         * the signal handler will do that), and the poll()/select() will
         * return immediately.
         *
         * If there's a pending byte in the self pipe, we'll notice whenever
         * blocking. Only clearing the pipe in that case avoids having to
         * drain it every time WaitLatchOrSocket() is used. Should the
         * pipe-buffer fill up in some scenarios - wildly unlikely - we're
         * still ok, because the pipe is in nonblocking mode.
?

I've updated the last paragraph to
         * If there's a pending byte in the self pipe, we'll notice whenever
         * blocking. Only clearing the pipe in that case avoids having to
         * drain it every time WaitLatchOrSocket() is used. Should the
         * pipe-buffer fill up we're still ok, because the pipe is in
         * nonblocking mode. It's unlikely for that to happen, because the
         * self pipe isn't filled unless we're blocking (waiting = true), or
         * from inside a signal handler in latch_sigusr1_handler().

I've also applied the same optimization to windows. Less because I found
that interesting in itself, and more because it makes the WaitEventSet
easier.

Attached is a significantly revised version of the earlier series. Most
importantly I have:
* Unified the window/unix latch implementation into one file (0004)
* Provided a select(2) implementation for the WaitEventSet API
* Provided a windows implementation for the WaitEventSet API
* Reduced duplication between the implementations a good bit by
  splitting WaitEventSetWait into WaitEventSetWait and
  WaitEventSetWaitBlock. Only the latter is implemented separately for
  each readiness primitive
* Added a backward-compatibility implementation of WaitLatchOrSocket
  using the WaitEventSet stuff. Less because I thought that to be
  terribly important, and more because it makes the patch a *lot*
  smaller.  We collected a fair amount of latch users.

This is still not fully ready. The main reamining items are testing (the
windows stuff I've only verified using cross-compiling with mingw) and
documentation.

I'd greatly appreciate a look.

Amit, you offered testing on windows; could you check whether 3/4/5
work? It's quite likely that I've screwed up something.

Robert you'd mentioned on IM that you've a use-case for this somewhere
around multiple FDWs. If somebody has started working on that, could you
ask that person to check whether the API makes sense?

Greetings,

Andres Freund

On Fri, Mar 18, 2016 at 1:34 PM, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2016-03-17 09:01:36 -0400, Robert Haas wrote:
> > 0001: Looking at this again, I'm no longer sure this is a bug.
> > Doesn't your patch just check the same conditions in the opposite
> > order?
>
> Which is important, because what's in what pfds[x] depends on
> wakeEvents. Folded it into a later patch; it's not harmful as long as
> we're only ever testing pfds[0].
>
>
> > 0003: Mostly boring. But the change to win32_latch.c seems to remove
> > an unrelated check.
>
> Argh.
>

+ * from inside a signal handler in latch_sigusr1_handler().

* Note: we assume that the kernel calls involved in drainSelfPipe()

* and SetLatch() will provide adequate synchronization on machines

* with weak memory ordering, so that we cannot miss seeing is_set if

* the signal byte is already in the pipe when we drain it.

- drainSelfPipe();

Above part of comment looks redundant after this patch. I have done some tests on Windows with 0003 patch which includes running the regressions (vcregress check) and it passes. Will look into it tomorrow once again and share if I find anything wrong with it, but feel to proceed if you want.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Performance degradation in commit ac1d794

From

Andres Freund

Date:

18 March 2016, 17:53:47

On 2016-03-18 05:56:41 -0400, Robert Haas wrote:
> >> 0 at the top of the loop and skip it forthwith if so.
> >
> > You mean in WaitEventSetWait()? There's
> >                 else if (rc == 0)
> >                 {
> >                         break;
> >                 }
> > which is the timeout case. There should never be any other case of
> >                 returning 0 elements?
> 
> No, I meant if (cur_event->events == 0) continue;

I'm not following. Why would there be an event without an empty event
mask? Ok, you can disable all notifications for a socket using
ModifyWaitEvent(), but that's not particularly common, right?  At least
for epoll, it'd not play a role anyway, since epoll_wait() will actually
return pointers to the elements we're waiting on; for windows we get the
offset in ->handles.  I guess we could do so in the select/poll case,
but adding another if for something infrequent doesn't strike me as a
great benefit.

- Andres

Re: Performance degradation in commit ac1d794

From

Robert Haas

Date:

18 March 2016, 18:01:03

On Fri, Mar 18, 2016 at 1:53 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-03-18 05:56:41 -0400, Robert Haas wrote:
>> >> 0 at the top of the loop and skip it forthwith if so.
>> >
>> > You mean in WaitEventSetWait()? There's
>> >                 else if (rc == 0)
>> >                 {
>> >                         break;
>> >                 }
>> > which is the timeout case. There should never be any other case of
>> >                 returning 0 elements?
>>
>> No, I meant if (cur_event->events == 0) continue;
>
> I'm not following. Why would there be an event without an empty event
> mask? Ok, you can disable all notifications for a socket using
> ModifyWaitEvent(), but that's not particularly common, right?  At least
> for epoll, it'd not play a role anyway, since epoll_wait() will actually
> return pointers to the elements we're waiting on; for windows we get the
> offset in ->handles.  I guess we could do so in the select/poll case,
> but adding another if for something infrequent doesn't strike me as a
> great benefit.

No, I mean it should be quite common for a particular fd to have no
events reported.  If we're polling on 100 fds and 1 of them is active
and the other 99 are just sitting there, we want to skip over the
other 99 as quickly as possible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Performance degradation in commit ac1d794

From

Andres Freund

Date:

18 March 2016, 18:08:11

On 2016-03-18 14:00:58 -0400, Robert Haas wrote:
> No, I mean it should be quite common for a particular fd to have no
> events reported.  If we're polling on 100 fds and 1 of them is active
> and the other 99 are just sitting there, we want to skip over the
> other 99 as quickly as possible.

cur_events points to the event that's registered with the set, not the
one that's "returned" or "modified" by poll/select, that's where the
confusion is originating from. I'll add a fastpath.

Greetings,

Andres Freund

Re: Performance degradation in commit ac1d794

From

Andres Freund

Date:

18 March 2016, 18:31:06

On 2016-03-18 20:14:07 +0530, Amit Kapila wrote:
> + * from inside a signal handler in latch_sigusr1_handler().
> 
>   *
> 
>   * Note: we assume that the kernel calls involved in drainSelfPipe()
> 
>   * and SetLatch() will provide adequate synchronization on machines
> 
>   * with weak memory ordering, so that we cannot miss seeing is_set if
> 
>   * the signal byte is already in the pipe when we drain it.
> 
>   */
> 
> - drainSelfPipe();
> 
> -
> 
> 
> Above part of comment looks redundant after this patch.

Don't think so. Moving it closer to the drainSelfPipe() call might be
neat, but since there's several callsites...

> I have done some
> tests on Windows with 0003 patch which includes running the regressions
> (vcregress check) and it passes.  Will look into it tomorrow once again and
> share if I find anything wrong with it, but feel to proceed if you want.

Thanks for the testing thus far! Let's see what the buildfarm has to
say.

Andres

Re: Performance degradation in commit ac1d794

From

Alvaro Herrera

Date:

18 March 2016, 18:50:03

Andres Freund wrote:

> From 1d444b0855dbf65d66d73beb647b772fff3404c8 Mon Sep 17 00:00:00 2001
> From: Andres Freund <andres@anarazel.de>
> Date: Fri, 18 Mar 2016 00:52:07 -0700
> Subject: [PATCH 4/5] Combine win32 and unix latch implementations.
> 
> Previously latches for windows and unix had been implemented in
> different files. The next patch in this series will introduce an
> expanded wait infrastructure, keeping the implementation separate would
> introduce too much duplication.
> 
> This basically just moves the functions, without too much change. The
> reason to keep this separate is that it allows blame to continue working
> a little less badly; and to make review a tiny bit easier.

This seems a reasonable change, but I think that the use of WIN32 vs.
LATCH_USE_WIN32 is pretty confusing.  In particular, LATCH_USE_WIN32
isn't actually used for anything ... I suppose we don't care since this
is a temporary state of affairs only?

In 0005: In latch.c you typedef WaitEventSet, but the typedef already
appears in latch.h.  You need only declare the struct in latch.c,
without typedef'ing.

Haven't really reviewed anything here yet, just skimming ATM.  Having so
many #ifdefs all over the place in this file looks really bad, but I
guess there's no way around that because this is very platform-specific.
I hope pgindent doesn't choke on it.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Performance degradation in commit ac1d794

From

Andres Freund

Date:

18 March 2016, 18:56:04

On 2016-03-18 15:49:51 -0300, Alvaro Herrera wrote:
> This seems a reasonable change, but I think that the use of WIN32 vs.
> LATCH_USE_WIN32 is pretty confusing.  In particular, LATCH_USE_WIN32
> isn't actually used for anything ... I suppose we don't care since this
> is a temporary state of affairs only?

Hm, I guess we could make some more of those use LATCH_USE_WIN32.
There's essentially two axes here a) latch notification method
(self-pipe vs windows events) b) readiness notification (epoll vs poll
vs select vs WaitForMultipleObjects).

> In 0005: In latch.c you typedef WaitEventSet, but the typedef already
> appears in latch.h.  You need only declare the struct in latch.c,
> without typedef'ing.

Good catch. It's even important, some compilers choke on that.

> Haven't really reviewed anything here yet, just skimming ATM.  Having so
> many #ifdefs all over the place in this file looks really bad, but I
> guess there's no way around that because this is very platform-specific.

I think it's hard to further reduce the number of ifdefs, but if you
have ideas...

> I hope pgindent doesn't choke on it.

The patch is pgindented (I'd personally never decrease indentation just
to fit a line into 79 chars, as pgindent does...).

Thanks for looking!

Greetings,

Andres Freund

Re: Performance degradation in commit ac1d794

From

Amit Kapila

Date:

19 March 2016, 06:32:57

On Sat, Mar 19, 2016 at 12:00 AM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-03-18 20:14:07 +0530, Amit Kapila wrote:
>
> > I have done some
> > tests on Windows with 0003 patch which includes running the regressions
> > (vcregress check) and it passes. Will look into it tomorrow once again and
> > share if I find anything wrong with it, but feel to proceed if you want.
>
> Thanks for the testing thus far! Let's see what the buildfarm has to
> say.
>

Won't the new code needs to ensure that ResetEvent(latchevent) should get called in case WaitForMultipleObjects() comes out when both pgwin32_signal_event and latchevent are signalled at the same time?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Performance degradation in commit ac1d794

From

Andres Freund

Date:

19 March 2016, 06:44:43


On March 18, 2016 11:32:53 PM PDT, Amit Kapila <amit.kapila16@gmail.com> wrote:
>On Sat, Mar 19, 2016 at 12:00 AM, Andres Freund <andres@anarazel.de>
>wrote:
>>
>> On 2016-03-18 20:14:07 +0530, Amit Kapila wrote:
>>
>> > I have done some
>> > tests on Windows with 0003 patch which includes running the
>regressions
>> > (vcregress check) and it passes.  Will look into it tomorrow once
>again
>and
>> > share if I find anything wrong with it, but feel to proceed if you
>want.
>>
>> Thanks for the testing thus far! Let's see what the buildfarm has to
>> say.
>>
>
>Won't the new code needs to ensure that ResetEvent(latchevent) should
>get
>called in case WaitForMultipleObjects() comes out when both
>pgwin32_signal_event and latchevent are signalled at the same time?

WaitForMultiple only reports the readiness of on event at a time, no?


-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Performance degradation in commit ac1d794

From

Amit Kapila

Date:

19 March 2016, 06:52:13

On Sat, Mar 19, 2016 at 12:14 PM, Andres Freund <andres@anarazel.de> wrote:
>
>
>
> On March 18, 2016 11:32:53 PM PDT, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >On Sat, Mar 19, 2016 at 12:00 AM, Andres Freund <andres@anarazel.de>
> >wrote:
> >>
> >> On 2016-03-18 20:14:07 +0530, Amit Kapila wrote:
> >>
> >> > I have done some
> >> > tests on Windows with 0003 patch which includes running the
> >regressions
> >> > (vcregress check) and it passes. Will look into it tomorrow once
> >again
> >and
> >> > share if I find anything wrong with it, but feel to proceed if you
> >want.
> >>
> >> Thanks for the testing thus far! Let's see what the buildfarm has to
> >> say.
> >>
> >
> >Won't the new code needs to ensure that ResetEvent(latchevent) should
> >get
> >called in case WaitForMultipleObjects() comes out when both
> >pgwin32_signal_event and latchevent are signalled at the same time?
>
> WaitForMultiple only reports the readiness of on event at a time, no?
>

I don't think so, please read link [1] with a focus on below paragraph which states how it reports the readiness or signaled state when multiple objects become signaled.

"When bWaitAll is FALSE, this function checks the handles in the array in order starting with index 0, until one of the objects is signaled. If multiple objects become signaled, the function returns the index of the first handle in the array whose object was signaled."

[1] - https://msdn.microsoft.com/en-us/library/windows/desktop/ms687025(v=vs.85).aspx

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Performance degradation in commit ac1d794

From

Andres Freund

Date:

19 March 2016, 07:10:14


On March 18, 2016 11:52:08 PM PDT, Amit Kapila <amit.kapila16@gmail.com> wrote:
>On Sat, Mar 19, 2016 at 12:14 PM, Andres Freund <andres@anarazel.de>
>wrote:
>>
>>
>>
>> On March 18, 2016 11:32:53 PM PDT, Amit Kapila
><amit.kapila16@gmail.com>
>wrote:
>> >On Sat, Mar 19, 2016 at 12:00 AM, Andres Freund <andres@anarazel.de>
>> >wrote:
>> >>
>> >> On 2016-03-18 20:14:07 +0530, Amit Kapila wrote:
>> >>
>> >> > I have done some
>> >> > tests on Windows with 0003 patch which includes running the
>> >regressions
>> >> > (vcregress check) and it passes.  Will look into it tomorrow
>once
>> >again
>> >and
>> >> > share if I find anything wrong with it, but feel to proceed if
>you
>> >want.
>> >>
>> >> Thanks for the testing thus far! Let's see what the buildfarm has
>to
>> >> say.
>> >>
>> >
>> >Won't the new code needs to ensure that ResetEvent(latchevent)
>should
>> >get
>> >called in case WaitForMultipleObjects() comes out when both
>> >pgwin32_signal_event and latchevent are signalled at the same time?
>>
>> WaitForMultiple only reports the readiness of on event at a time, no?
>>
>
>I don't think so, please read link [1] with a focus on below paragraph
>which states how it reports the readiness or signaled state when
>multiple
>objects become signaled.
>
>"When *bWaitAll* is *FALSE*, this function checks the handles in the
>array
>in order starting with index 0, until one of the objects is signaled.
>If
>multiple objects become signaled, the function returns the index of the
>first handle in the array whose object was signaled."

I think that's OK. We'll just get the next event the next time we call waitfor*. It's also not different to the way the
routineis currently handling normal socket and postmaster events, no?  It's be absurdly broken if it handled edge
triggeredenemy's like FD_CLOSE in a way you can't discover.
 

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Performance degradation in commit ac1d794

From

Amit Kapila

Date:

19 March 2016, 10:13:35

On Sat, Mar 19, 2016 at 12:40 PM, Andres Freund <andres@anarazel.de> wrote:
>
> On March 18, 2016 11:52:08 PM PDT, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >> >Won't the new code needs to ensure that ResetEvent(latchevent)
> >should
> >> >get
> >> >called in case WaitForMultipleObjects() comes out when both
> >> >pgwin32_signal_event and latchevent are signalled at the same time?
> >>
> >> WaitForMultiple only reports the readiness of on event at a time, no?
> >>
> >
> >I don't think so, please read link [1] with a focus on below paragraph
> >which states how it reports the readiness or signaled state when
> >multiple
> >objects become signaled.
> >
> >"When *bWaitAll* is *FALSE*, this function checks the handles in the
> >array
> >in order starting with index 0, until one of the objects is signaled.
> >If
> >multiple objects become signaled, the function returns the index of the
> >first handle in the array whose object was signaled."
>
> I think that's OK. We'll just get the next event the next time we call waitfor*. It's also not different to the way the routine is currently handling normal socket and postmaster events, no?

I think the primary difference with socket and postmaster event as compare to latch event is that it won't allow to start waiting with the waitevent in signalled state. For socket event, it will close the event in the end and create again before entring the wait loop in WaitLatchOrSocket. I could not see any major problem apart from may be spurious wake ups in few cases (as we haven't reset the event to non signalled state for latch event before entering wait, so it can just return immediately) even if we don't Reset the latch event.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Performance degradation in commit ac1d794

From

Amit Kapila

Date:

19 March 2016, 11:14:54

On Fri, Mar 18, 2016 at 1:34 PM, Andres Freund <andres@anarazel.de> wrote:
>
>
> Attached is a significantly revised version of the earlier series. Most
> importantly I have:
> * Unified the window/unix latch implementation into one file (0004)

After applying patch 0004* on HEAD, using command patch -p1 < <path_of_patch>, I am getting build failure:

c1 : fatal error C1083: Cannot open source file: 'src/backend/storage/ipc/latch.c': No such file or directory

I think it could not rename port/unix_latch.c => storage/ipc/latch.c. I have tried with git apply, but no success. Am I doing something wrong?

One minor suggestion about patch:

+#ifndef WIN32

void

latch_sigusr1_handler(void)

{

if (waiting)

sendSelfPipeByte();

}

+#endif /* !WIN32 */

/* Send one byte to the self-pipe, to wake up WaitLatch */

+#ifndef WIN32

static void

sendSelfPipeByte(void)

Instead of individually defining these functions under #ifndef WIN32, isn't it better to combine them all as they are at end of file.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Performance degradation in commit ac1d794

From

Andres Freund

Date:

20 March 2016, 01:43:31

On 2016-03-19 15:43:27 +0530, Amit Kapila wrote:
> On Sat, Mar 19, 2016 at 12:40 PM, Andres Freund <andres@anarazel.de> wrote:
> >
> > On March 18, 2016 11:52:08 PM PDT, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > >> >Won't the new code needs to ensure that ResetEvent(latchevent)
> > >should
> > >> >get
> > >> >called in case WaitForMultipleObjects() comes out when both
> > >> >pgwin32_signal_event and latchevent are signalled at the same time?

> > >> WaitForMultiple only reports the readiness of on event at a time, no?
> > >>
> > >
> > >I don't think so, please read link [1] with a focus on below paragraph
> > >which states how it reports the readiness or signaled state when
> > >multiple
> > >objects become signaled.
> > >
> > >"When *bWaitAll* is *FALSE*, this function checks the handles in the
> > >array
> > >in order starting with index 0, until one of the objects is signaled.
> > >If
> > >multiple objects become signaled, the function returns the index of the
> > >first handle in the array whose object was signaled."

I think this is just incredibly bad documentation. See
https://blogs.msdn.microsoft.com/oldnewthing/20150409-00/?p=44273
(Raymond Chen can be considered an authority here imo).

Andres

Re: Performance degradation in commit ac1d794

From

Andres Freund

Date:

20 March 2016, 01:45:46

On 2016-03-19 16:44:49 +0530, Amit Kapila wrote:
> On Fri, Mar 18, 2016 at 1:34 PM, Andres Freund <andres@anarazel.de> wrote:
> >
> >
> > Attached is a significantly revised version of the earlier series. Most
> > importantly I have:
> > * Unified the window/unix latch implementation into one file (0004)
> >
> 
> After applying patch 0004* on HEAD, using command patch -p1 <
> <path_of_patch>, I am getting build failure:
> 
> c1 : fatal error C1083: Cannot open source file:
> 'src/backend/storage/ipc/latch.c': No such file or directory
> 
> I think it could not rename port/unix_latch.c => storage/ipc/latch.c.  I
> have tried with git apply, but no success.  Am I doing something wrong?

You skipped applying 0003.

I'll send an updated version - with all the docs and such - in the next
hours.

> One minor suggestion about patch:
> 
> +#ifndef WIN32
>  void
>  latch_sigusr1_handler(void)
>  {
>   if (waiting)
>   sendSelfPipeByte();
>  }
> +#endif   /* !WIN32 */
> 
>  /* Send one byte to the self-pipe, to wake up WaitLatch */
> +#ifndef WIN32
>  static void
>  sendSelfPipeByte(void)
> 
> 
> Instead of individually defining these functions under #ifndef WIN32, isn't
> it better to combine them all as they are at end of file.

They're all at the end of the file. I just found it easier to reason
about if both #if and #endif are visible in one screen full of
code. Don't feel super strong about it tho.

Greetings,

Andres Freund

Re: Performance degradation in commit ac1d794

From

Andres Freund

Date:

20 March 2016, 04:14:19

On 2016-03-19 18:45:36 -0700, Andres Freund wrote:
> On 2016-03-19 16:44:49 +0530, Amit Kapila wrote:
> > On Fri, Mar 18, 2016 at 1:34 PM, Andres Freund <andres@anarazel.de> wrote:
> > >
> > >
> > > Attached is a significantly revised version of the earlier series. Most
> > > importantly I have:
> > > * Unified the window/unix latch implementation into one file (0004)
> > >
> >
> > After applying patch 0004* on HEAD, using command patch -p1 <
> > <path_of_patch>, I am getting build failure:
> >
> > c1 : fatal error C1083: Cannot open source file:
> > 'src/backend/storage/ipc/latch.c': No such file or directory
> >
> > I think it could not rename port/unix_latch.c => storage/ipc/latch.c.  I
> > have tried with git apply, but no success.  Am I doing something wrong?
>
> You skipped applying 0003.
>
> I'll send an updated version - with all the docs and such - in the next
> hours.

Here we go. I think this is getting pretty clos eto being committable,
minus a bit of testing edge cases on unix (postmaster death,
disconnecting clients in various ways (especially with COPY)) and
windows (uh, does it even work at all?).

There's no large code changes in this revision, mainly some code
polishing and a good bit more comment improvements.

Regards,

Andres

On Mon, Mar 21, 2016 at 3:46 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-03-21 01:31:30 +1300, Thomas Munro wrote:
>> I couldn't get the second patch to apply for some reason,
>
> Weird? Even efter appying the first one first?
>
>
>> but I have been trying out your "latch" branch on some different OSs
>> and porting some code that does a bunch of waiting on many sockets
>> over to this API to try it out.
>
>> One thing that I want to do but can't with this interface is remove an
>> fd from the set.
>
> Yea, I didn't see an in-core need for that, so I didn't want to add
> that yet.
>
> Other than that, how did things go?

So far, so good.  No surprises.

>> I can AddWaitEventToSet returning a position, and I
>> can ModifyWait to provide new event mask by position including zero
>> mask, I can't actually remove the fd (for example to avoid future
>> error events that can't be masked, or to allow that fd to be closed
>> and perhaps allow that fd number to coincidentally be readded later,
>> and just generally to free the slot).  There is an underlying way to
>> remove an fd from a set with poll (sort of), epoll, kqueue.  (Not sure
>> about Windows.  But surely...).  I wonder if there should be
>> RemoveWaitEventFromSet(set, position) which recycles event slots,
>> sticking them on a freelist (and setting corresponding pollfd structs'
>> fd to -1).
>
> I'm inclined that if we add this, we memmmove everything later, and
> adjust the offsets. As we e.g. pass the entire pollfd array to poll() at
> once, not having gaps will be more important.
>
> What's the use case where you need that?

I was experimenting with a change to Append so that it could deal with
asynchronous subplans, and in particular support for asynchronous
foreign scans.  See attached patch which applies on top of your latch
branch (or your two patches above, I assume) which does that.  It is
not a very ambitious form of asynchrony and there are opportunities to
do so much more in this area (about which more later in some future
thread), but it is some relevant educational code I had to hand, so I
ported it to your new API as a way to try the API out.

The contract that I invented here is that an async-aware parent node
can ask any child node "are you ready?" and get back various answers
including an fd which means please don't call ExecProcNode until this
fd is ready to read.  But after ExecProcNode is called, the fd must
not be accessed again (the subplan has the right to close it, return a
different one next time etc), so it must not appear in any
WaitEventSet wait on after that.  As you can see, in
append_next_async_wait, it therefore needs to create an new
WaitEventSet every time it needs to wait, which makes it feel more
like select() than epoll().  Ideally it'd just have just one single
WaitEventSet for the lifetime of the append node, and just add and
remove fds as required.

>> I wonder if it would be useful to reallocate the arrays as needed so
>> there isn't really a hard limit to the number of things you add, just
>> an initial size.
>
> Doubles the amount of palloc()s required. As lots of these sets are
> actually going to be very short-lived (a single WaitLatch call), that's
> something I'd rather avoid. Although I guess you could just allocate
> separate arrays at that moment.

On the subject of short-lived and frequent calls to WaitLatchOrSocket,
I wonder if there would be some benefit in reusing a statically
allocated WaitEventSet for that.  That would become possible if you
could add and remove the latch and socket as discussed, with an
opportunity to skip the modification work completely if the reusable
WaitEventSet already happens to have the right stuff in it from the
last WaitLatchOrSocket call.  Or maybe the hot wait loops should
simply be rewritten to reuse a WaitEventSet explicitly so they can
manage that...

Some other assorted thoughts:

* It looks like epoll (and maybe kqueue?) can associate user data with
an event and give it back to you; if you have a mapping between fds
and some other object (in the case of the attached patch, subplan
nodes that are ready to be pulled), would it be possible and useful to
expose that (and simulate it where needed) rather than the caller
having to maintain associative data structures (see the linear search
in my patch)?

* I would be interested in writing a kqueue implementation of this for
*BSD (and MacOSX?) at some point if someone doesn't beat me to it.

* It would be very cool to get some view into which WaitEventSetWait
call backends are waiting in from a system view if it could be done
cheaply enough.  A name/ID for the call site, and an indication of
which latch and how many fds...  or something.

--
Thomas Munro
http://www.enterprisedb.com

Attachment

async-append-wait-set-hack.patch

Re: Performance degradation in commit ac1d794

From

Amit Kapila

Date:

21 March 2016, 04:12:48

On Sun, Mar 20, 2016 at 7:13 AM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-03-19 15:43:27 +0530, Amit Kapila wrote:
> > On Sat, Mar 19, 2016 at 12:40 PM, Andres Freund <andres@anarazel.de> wrote:
> > >
> > > On March 18, 2016 11:52:08 PM PDT, Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> > > >> >Won't the new code needs to ensure that ResetEvent(latchevent)
> > > >should
> > > >> >get
> > > >> >called in case WaitForMultipleObjects() comes out when both
> > > >> >pgwin32_signal_event and latchevent are signalled at the same time?
>
> > > >> WaitForMultiple only reports the readiness of on event at a time, no?
> > > >>
> > > >
> > > >I don't think so, please read link [1] with a focus on below paragraph
> > > >which states how it reports the readiness or signaled state when
> > > >multiple
> > > >objects become signaled.
> > > >
> > > >"When *bWaitAll* is *FALSE*, this function checks the handles in the
> > > >array
> > > >in order starting with index 0, until one of the objects is signaled.
> > > >If
> > > >multiple objects become signaled, the function returns the index of the
> > > >first handle in the array whose object was signaled."
>
> I think this is just incredibly bad documentation. See
> https://blogs.msdn.microsoft.com/oldnewthing/20150409-00/?p=44273
> (Raymond Chen can be considered an authority here imo).
>

The article pointed by you justifies that the way ResetEvent is done by patch is correct. I am not sure, but you can weigh, if there is a need of comment so that if we want enhance this part of code (or want to write something similar) in future, we don't need to rediscover this fact.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Performance degradation in commit ac1d794

From

Andres Freund

Date:

21 March 2016, 04:51:31

On March 21, 2016 5:12:38 AM GMT+01:00, Amit Kapila <amit.kapila16@gmail.com> wrote:

>The article pointed by you justifies that the way ResetEvent is done by
>patch is correct.  I am not sure, but you can weigh, if there is a need
>of
>comment so that if we want enhance this part of code (or want to write
>something similar) in future, we don't need to rediscover this fact.

I've added a reference in a comment.

Did you have a chance of running the patched versions on windows?

I plan to push this sometime today, so I can get on to some performance patches I was planning to look into
committing.

Andres

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Performance degradation in commit ac1d794

From

Amit Kapila

Date:

21 March 2016, 04:56:10

On Mon, Mar 21, 2016 at 10:21 AM, Andres Freund <andres@anarazel.de> wrote:

On March 21, 2016 5:12:38 AM GMT+01:00, Amit Kapila <amit.kapila16@gmail.com> wrote:

>The article pointed by you justifies that the way ResetEvent is done by
>patch is correct. I am not sure, but you can weigh, if there is a need
>of
>comment so that if we want enhance this part of code (or want to write
>something similar) in future, we don't need to rediscover this fact.

I've added a reference in a comment.

Did you have a chance of running the patched versions on windows?

I am planning to do it in next few hours.

I plan to push this sometime today, so I can get on to some performance patches I was planning to look into committing.

have we done testing to ensure that it actually mitigate the impact of performance degradation due to commit ac1d794. I wanted to do that, but unfortunately the hight-end m/c on which this problem is reproducible is down.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Performance degradation in commit ac1d794

From

Andres Freund

Date:

21 March 2016, 05:09:32

Hi,


On 2016-03-21 11:52:43 +1300, Thomas Munro wrote:
> The contract that I invented here is that an async-aware parent node
> can ask any child node "are you ready?" and get back various answers
> including an fd which means please don't call ExecProcNode until this
> fd is ready to read.  But after ExecProcNode is called, the fd must
> not be accessed again (the subplan has the right to close it, return a
> different one next time etc), so it must not appear in any
> WaitEventSet wait on after that.

Hm, why did you choose to go with that contract? Why do children need to
switch fds and such?


> As you can see, in append_next_async_wait, it therefore needs to
> create an new WaitEventSet every time it needs to wait, which makes it
> feel more like select() than epoll().  Ideally it'd just have just one
> single WaitEventSet for the lifetime of the append node, and just add
> and remove fds as required.

Yes, I see. I think that'd be less efficient than not manipulating the
set all the time, but still a lot more efficient than adding/removing
the postmaster death latch every round.

Anyway, I want to commit this ASAP, so I can get on to some patches in
the CF. But I'd welcome you playing with adding that ability.


> On the subject of short-lived and frequent calls to WaitLatchOrSocket,
> I wonder if there would be some benefit in reusing a statically
> allocated WaitEventSet for that.

Wondered the same.  It looks like there'd be a number (just latch, latch
+ postmaster death), for it to be beneficial. Most latch waits aren't
that frequent though, so I've decided not to initially go there.


> That would become possible if you
> could add and remove the latch and socket as discussed, with an
> opportunity to skip the modification work completely if the reusable
> WaitEventSet already happens to have the right stuff in it from the
> last WaitLatchOrSocket call.  Or maybe the hot wait loops should
> simply be rewritten to reuse a WaitEventSet explicitly so they can
> manage that...

Yea, I think that's better.


> * It looks like epoll (and maybe kqueue?) can associate user data with
> an event and give it back to you; if you have a mapping between fds
> and some other object (in the case of the attached patch, subplan
> nodes that are ready to be pulled), would it be possible and useful to
> expose that (and simulate it where needed) rather than the caller
> having to maintain associative data structures (see the linear search
> in my patch)?

We already use the pointer that epoll gives you. But note that the
'WaitEvent' struct already contains the original position the event was
registered at. That's filled in when you call WaitEventSetWait().  So
you could either just build an array based on ->pos, or we could also
add a void *private; or such to the definition.


> * I would be interested in writing a kqueue implementation of this for
> *BSD (and MacOSX?) at some point if someone doesn't beat me to it.

I hoped that somebody would do that - that'd afaics be the only major
API missing.


> * It would be very cool to get some view into which WaitEventSetWait
> call backends are waiting in from a system view if it could be done
> cheaply enough.  A name/ID for the call site, and an indication of
> which latch and how many fds...  or something.

That's not something I plan to tackle today; we don't have wait event
integration for latches atm. There's some threads somewhere about this,
but that seems mostly independent of the way latches are implemented.


Thanks,

Andres

Re: Performance degradation in commit ac1d794

From

Amit Kapila

Date:

21 March 2016, 05:56:50

On Mon, Mar 21, 2016 at 10:26 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Mar 21, 2016 at 10:21 AM, Andres Freund <andres@anarazel.de> wrote:
>>
>>
>>
>> On March 21, 2016 5:12:38 AM GMT+01:00, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> >The article pointed by you justifies that the way ResetEvent is done by
>> >patch is correct. I am not sure, but you can weigh, if there is a need
>> >of
>> >comment so that if we want enhance this part of code (or want to write
>> >something similar) in future, we don't need to rediscover this fact.
>>
>> I've added a reference in a comment.
>>
>> Did you have a chance of running the patched versions on windows?
>>
>
> I am planning to do it in next few hours.

With 0002-Introduce-new-WaitEventSet-API, initdb is successful, but server start leads to below problem:

LOG: database system was shut down at 2016-03-21 11:17:13 IST

LOG: MultiXact member wraparound protections are now enabled

LOG: database system is ready to accept connections

LOG: autovacuum launcher started

FATAL: failed to set up event for socket: error code 10022

LOG: statistics collector process (PID 668) exited with exit code 1

LOG: autovacuum launcher process (PID 3868) was terminated by exception 0xC0000005

HINT: See C include file "ntstatus.h" for a description of the hexadecimal value.

I haven't investigated the problem still.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Performance degradation in commit ac1d794

From

Andres Freund

Date:

21 March 2016, 10:11:10

Hi,m

On 2016-03-21 11:26:43 +0530, Amit Kapila wrote:
> LOG:  database system was shut down at 2016-03-21 11:17:13 IST
> LOG:  MultiXact member wraparound protections are now enabled
> LOG:  database system is ready to accept connections
> LOG:  autovacuum launcher started
> FATAL:  failed to set up event for socket: error code 10022
> LOG:  statistics collector process (PID 668) exited with exit code 1
> LOG:  autovacuum launcher process (PID 3868) was terminated by exception
> 0xC0000005
> HINT:  See C include file "ntstatus.h" for a description of the hexadecimal
> value.

We resolved this and two followup issues on IM; all localized issues.
Thanks again for the help!

I've attached two refreshed patches including the relevant fixes, and
the addition of a 'user_data' pointer to events; as desired by Thomas.

I plan to push these after some msvc animals turned green. We might want
to iterate on the API in some parts, but this will fix the regression,
and imo looks good.

Regards,

Andres

Attachment

Re: Performance degradation in commit ac1d794

From

Andres Freund

Date:

21 March 2016, 11:46:26

Hi,

I've pushed the new API. We might want to use it in more places...

On 2015-12-25 20:08:15 +0300, Васильев Дмитрий wrote:
> I suddenly found commit ac1d794 gives up to 3 times performance degradation.
> 
> I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70 CPU-core
> machine:
> commit ac1d794 gives me 363,474 tps
> and previous commit a05dc4d gives me 956,146
> and master( 3d0c50f ) with revert ac1d794 gives me 969,265

Could you please verify that the performance figures are good now? I've
tested this on a larger two socket system, and there I managed to found
a configuration showing the slowdown (-c64 -j4), and I confirmed the
fix.

Regards,

Andres

Re: Performance degradation in commit ac1d794

From

Thomas Munro

Date:

21 March 2016, 18:51:32

On Mon, Mar 21, 2016 at 6:09 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-03-21 11:52:43 +1300, Thomas Munro wrote:
>> * I would be interested in writing a kqueue implementation of this for
>> *BSD (and MacOSX?) at some point if someone doesn't beat me to it.
>
> I hoped that somebody would do that - that'd afaics be the only major
> API missing.

Here's a first swing at it, though I need to find some spare time to
test and figure out if some details like error conditions and EOF are
handled correctly.  It could in theory minimise the number of syscalls
it makes by buffering changes (additions and hopefully one day
removals) and then use a single kevent syscall to apply all
modifications to the set and begin waiting, but for now it mirrors the
epoll code.  One user-visible difference compared to epoll/poll/select
is that it delivers readable and writable events separately if both
conditions are true for a single fd.  It builds and appears to work
correctly on FreeBSD 10.2 and MacOSX 10.10.2.  Sharing this early
version in case any BSD users have any feedback.  I hope to do some
testing this weekend.

--
Thomas Munro
http://www.enterprisedb.com

Attachment

kqueue.patch

Re: Performance degradation in commit ac1d794

From

Noah Misch

Date:

29 May 2016, 05:41:05

On Fri, Dec 25, 2015 at 08:08:15PM +0300, Васильев Дмитрий wrote:
> I suddenly found commit ac1d794 gives up to 3 times performance degradation.
> 
> I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70 CPU-core
> machine:
> commit ac1d794 gives me 363,474 tps
> and previous commit a05dc4d gives me 956,146
> and master( 3d0c50f ) with revert ac1d794 gives me 969,265

[This is a generic notification.]

The above-described topic is currently a PostgreSQL 9.6 open item.  Robert,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
9.6 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within 72 hours of this
message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping 9.6rc1.  Consequently, I will appreciate your
efforts toward speedy resolution.  Thanks.

[1] http://www.postgresql.org/message-id/20160527025039.GA447393@tornado.leadboat.com

Re: Performance degradation in commit ac1d794

From

Robert Haas

Date:

01 June 2016, 02:09:15

On Sun, May 29, 2016 at 1:40 AM, Noah Misch <noah@leadboat.com> wrote:
> On Fri, Dec 25, 2015 at 08:08:15PM +0300, Васильев Дмитрий wrote:
>> I suddenly found commit ac1d794 gives up to 3 times performance degradation.
>>
>> I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70 CPU-core
>> machine:
>> commit ac1d794 gives me 363,474 tps
>> and previous commit a05dc4d gives me 956,146
>> and master( 3d0c50f ) with revert ac1d794 gives me 969,265
>
> [This is a generic notification.]
>
> The above-described topic is currently a PostgreSQL 9.6 open item.  Robert,
> since you committed the patch believed to have created it, you own this open
> item.  If some other commit is more relevant or if this does not belong as a
> 9.6 open item, please let us know.  Otherwise, please observe the policy on
> open item ownership[1] and send a status update within 72 hours of this
> message.  Include a date for your subsequent status update.  Testers may
> discover new open items at any time, and I want to plan to get them all fixed
> well in advance of shipping 9.6rc1.  Consequently, I will appreciate your
> efforts toward speedy resolution.  Thanks.

So, the reason this is back on the open items list is that Mithun Cy
re-reported this problem in:

https://www.postgresql.org/message-id/CAD__OuhPmc6XH=wYRm_+Q657yQE88DakN4=Ybh2oveFasHkoeA@mail.gmail.com

When I saw that, I moved this from CLOSE_WAIT back to open.  However,
subsequently, Ashutosh Sharma posted this, which suggests (not
conclusively) that in fact the problem has been fixed:

https://www.postgresql.org/message-id/CAE9k0PkFEhVq-Zg4MH0bZ-zt_oE5PAS6dAuxRCXwX9kEVWceag@mail.gmail.com

What I *think* is going on here is:

- ac1d794 lowered performance
- backend_flush_after with a non-zero default lowered performance with
a vengeance
- 98a64d0 repaired the damage done by ac1d794, or much of it, but
Mithun couldn't see it in his benchmarks because backend_flush_after>0
is so bad

That could be wrong, but I haven't seen any evidence that it's wrong.
So I'm inclined to say we should just move this open item back to the
CLOSE_WAIT list (adding a link to this email to explain why we did
so).  Does that work for you?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Performance degradation in commit ac1d794

From

Noah Misch

Date:

01 June 2016, 02:20:13

On Tue, May 31, 2016 at 10:09:05PM -0400, Robert Haas wrote:
> On Sun, May 29, 2016 at 1:40 AM, Noah Misch <noah@leadboat.com> wrote:
> > On Fri, Dec 25, 2015 at 08:08:15PM +0300, Васильев Дмитрий wrote:
> >> I suddenly found commit ac1d794 gives up to 3 times performance degradation.
> >>
> >> I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70 CPU-core
> >> machine:
> >> commit ac1d794 gives me 363,474 tps
> >> and previous commit a05dc4d gives me 956,146
> >> and master( 3d0c50f ) with revert ac1d794 gives me 969,265
> >
> > [This is a generic notification.]
> >
> > The above-described topic is currently a PostgreSQL 9.6 open item.  Robert,
> > since you committed the patch believed to have created it, you own this open
> > item.  If some other commit is more relevant or if this does not belong as a
> > 9.6 open item, please let us know.  Otherwise, please observe the policy on
> > open item ownership[1] and send a status update within 72 hours of this
> > message.  Include a date for your subsequent status update.  Testers may
> > discover new open items at any time, and I want to plan to get them all fixed
> > well in advance of shipping 9.6rc1.  Consequently, I will appreciate your
> > efforts toward speedy resolution.  Thanks.
> 
> So, the reason this is back on the open items list is that Mithun Cy
> re-reported this problem in:
> 
> https://www.postgresql.org/message-id/CAD__OuhPmc6XH=wYRm_+Q657yQE88DakN4=Ybh2oveFasHkoeA@mail.gmail.com
> 
> When I saw that, I moved this from CLOSE_WAIT back to open.  However,
> subsequently, Ashutosh Sharma posted this, which suggests (not
> conclusively) that in fact the problem has been fixed:
> 
> https://www.postgresql.org/message-id/CAE9k0PkFEhVq-Zg4MH0bZ-zt_oE5PAS6dAuxRCXwX9kEVWceag@mail.gmail.com
> 
> What I *think* is going on here is:
> 
> - ac1d794 lowered performance
> - backend_flush_after with a non-zero default lowered performance with
> a vengeance
> - 98a64d0 repaired the damage done by ac1d794, or much of it, but
> Mithun couldn't see it in his benchmarks because backend_flush_after>0
> is so bad

Ashutosh Sharma's measurements do bolster that conclusion.

> That could be wrong, but I haven't seen any evidence that it's wrong.
> So I'm inclined to say we should just move this open item back to the
> CLOSE_WAIT list (adding a link to this email to explain why we did
> so).  Does that work for you?

That works for me.

Re: Performance degradation in commit ac1d794

From

Tom Lane

Date:

01 June 2016, 02:23:36

Noah Misch <noah@leadboat.com> writes:
> On Tue, May 31, 2016 at 10:09:05PM -0400, Robert Haas wrote:
>> What I *think* is going on here is:
>> - ac1d794 lowered performance
>> - backend_flush_after with a non-zero default lowered performance with
>> a vengeance
>> - 98a64d0 repaired the damage done by ac1d794, or much of it, but
>> Mithun couldn't see it in his benchmarks because backend_flush_after>0
>> is so bad

> Ashutosh Sharma's measurements do bolster that conclusion.

>> That could be wrong, but I haven't seen any evidence that it's wrong.
>> So I'm inclined to say we should just move this open item back to the
>> CLOSE_WAIT list (adding a link to this email to explain why we did
>> so).  Does that work for you?

> That works for me.  

Can we make a note to re-examine this after the backend_flush_after
business is resolved?  Or at least get Mithun to redo his benchmarks
with backend_flush_after set to zero?
        regards, tom lane

Re: Performance degradation in commit ac1d794

From

Robert Haas

Date:

01 June 2016, 02:33:55

On Tue, May 31, 2016 at 10:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Noah Misch <noah@leadboat.com> writes:
>> On Tue, May 31, 2016 at 10:09:05PM -0400, Robert Haas wrote:
>>> What I *think* is going on here is:
>>> - ac1d794 lowered performance
>>> - backend_flush_after with a non-zero default lowered performance with
>>> a vengeance
>>> - 98a64d0 repaired the damage done by ac1d794, or much of it, but
>>> Mithun couldn't see it in his benchmarks because backend_flush_after>0
>>> is so bad
>
>> Ashutosh Sharma's measurements do bolster that conclusion.
>
>>> That could be wrong, but I haven't seen any evidence that it's wrong.
>>> So I'm inclined to say we should just move this open item back to the
>>> CLOSE_WAIT list (adding a link to this email to explain why we did
>>> so).  Does that work for you?
>
>> That works for me.
>
> Can we make a note to re-examine this after the backend_flush_after
> business is resolved?  Or at least get Mithun to redo his benchmarks
> with backend_flush_after set to zero?

Ashutosh Sharma already did pretty much that test in the email to
which I linked.

(Ashutosh Sharma and Mithun CY work in the same office and have access
to the same hardware.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company