Thread: Performance degradation in commit ac1d794

Performance degradation in commit ac1d794

From
Васильев Дмитрий
Date:
Hello hackers!

I suddenly found commit ac1d794 gives up to 3 times performance degradation.

I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70 CPU-core machine:
commit ac1d794 gives me 363,474 tps
and previous commit a05dc4d gives me 956,146
and master( 3d0c50f ) with revert ac1d794 gives me 969,265

it's shocking

---
Dmitry Vasilyev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
On December 25, 2015 6:08:15 PM GMT+01:00, "Васильев Дмитрий" <d.vasilyev@postgrespro.ru> wrote:
>Hello hackers!
>
>I suddenly found commit ac1d794 gives up to 3 times performance
>degradation.
>
>I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70
>CPU-core
>machine:
>commit ac1d794 gives me 363,474 tps
>and previous commit a05dc4d gives me 956,146
>and master( 3d0c50f ) with revert ac1d794 gives me 969,265
>
>it's shocking



You're talking about
http://git.postgresql.org/gitweb/?p=postgresql.git;a=blobdiff;f=src/backend/libpq/be-secure.c;h=2ddcf428f89fd12c230d6f417c2f707fbd97bf39;hp=26d8faaf773a818b388b899b8d83d617bdf7af9b;hb=ac1d794;hpb=a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26e

If so, could you provide a hierarchical before/after profile?

Andres
Hi,

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.



Re: Performance degradation in commit ac1d794

From
Васильев Дмитрий
Date:

2015-12-25 20:18 GMT+03:00 Andres Freund <andres@anarazel.de>:
On December 25, 2015 6:08:15 PM GMT+01:00, "Васильев Дмитрий" <d.vasilyev@postgrespro.ru> wrote:
>Hello hackers!
>
>I suddenly found commit ac1d794 gives up to 3 times performance
>degradation.
>
>I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70
>CPU-core
>machine:
>commit ac1d794 gives me 363,474 tps
>and previous commit a05dc4d gives me 956,146
>and master( 3d0c50f ) with revert ac1d794 gives me 969,265
>
>it's shocking



You're talking about http://git.postgresql.org/gitweb/?p=postgresql.git;a=blobdiff;f=src/backend/libpq/be-secure.c;h=2ddcf428f89fd12c230d6f417c2f707fbd97bf39;hp=26d8faaf773a818b388b899b8d83d617bdf7af9b;hb=ac1d794;hpb=a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26e

​​
If so, could you provide a hierarchical before/after profile?

Andres
Hi,

---
Please excuse brevity and formatting - I am writing this on my mobile phone.


Yes, about this.

>​If so, could you provide a hierarchical before/after profile?

​Performance | Git hash commit    | Date
~ 360k tps | c3e7c24a1d60dc6ad56e2a0723399f1570c54224 | Thu Nov 12 09:12:18 2015 -0500
~ 360k tps | ac1d7945f866b1928c2554c0f80fd52d7f977772    | Thu Nov 12 09:00:33 2015 -0500
~ 960k tps | a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26e  | Thu Nov 12 07:40:31 2015 -0500


​​---
Dmitry Vasilyev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company​

Re: Performance degradation in commit ac1d794

From
Васильев Дмитрий
Date:
2015-12-25 20:27 GMT+03:00 Васильев Дмитрий <d.vasilyev@postgrespro.ru>:

2015-12-25 20:18 GMT+03:00 Andres Freund <andres@anarazel.de>:
On December 25, 2015 6:08:15 PM GMT+01:00, "Васильев Дмитрий" <d.vasilyev@postgrespro.ru> wrote:
>Hello hackers!
>
>I suddenly found commit ac1d794 gives up to 3 times performance
>degradation.
>
>I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70
>CPU-core
>machine:
>commit ac1d794 gives me 363,474 tps
>and previous commit a05dc4d gives me 956,146
>and master( 3d0c50f ) with revert ac1d794 gives me 969,265
>
>it's shocking



You're talking about http://git.postgresql.org/gitweb/?p=postgresql.git;a=blobdiff;f=src/backend/libpq/be-secure.c;h=2ddcf428f89fd12c230d6f417c2f707fbd97bf39;hp=26d8faaf773a818b388b899b8d83d617bdf7af9b;hb=ac1d794;hpb=a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26e

​​
If so, could you provide a hierarchical before/after profile?

Andres
Hi,

---
Please excuse brevity and formatting - I am writing this on my mobile phone.

Yes, about this.

>​If so, could you provide a hierarchical before/after profile?

​Performance | Git hash commit    | Date
~ 360k tps | c3e7c24a1d60dc6ad56e2a0723399f1570c54224 | Thu Nov 12 09:12:18 2015 -0500
~ 360k tps | ac1d7945f866b1928c2554c0f80fd52d7f977772    | Thu Nov 12 09:00:33 2015 -0500
~ 960k tps | a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26e  | Thu Nov 12 07:40:31 2015 -0500


​​
​​---
Dmitry Vasilyev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company​


​I came across it by studying the results:​​

vanilla 9.5 = 30020c3fc3b6de5592978313df929d370f5770ce
vanilla 9.6 = c4a8812cf64b142685e39a69694c5276601f40e4

         info          | clients |   tps   
-----------------------+---------+---------
 vanilla 9.5           |       1 |   30321
 vanilla 9.5           |       8 |  216542
 vanilla 9.5           |      16 |  412526
 vanilla 9.5           |      32 |  751331
 vanilla 9.5           |      48 |  956146
​ <- this point​
 vanilla 9.5           |      56 |  990122
 vanilla 9.5           |      64 |  842436
 vanilla 9.5           |      72 |  913272
 vanilla 9.5           |      82 |  659332
 vanilla 9.5           |      92 |  630111
 vanilla 9.5           |      96 |  616863
 vanilla 9.5           |     110 |  592080
 vanilla 9.5           |     120 |  575831
 vanilla 9.5           |     130 |  557521
 vanilla 9.5           |     140 |  537951
 vanilla 9.5           |     150 |  517019
 vanilla 9.5           |     160 |  502312
 vanilla 9.5           |     170 |  489162
 vanilla 9.5           |     180 |  477178
 vanilla 9.5           |     190 |  464620
 vanilla 9.6           |       1 |   31738
 vanilla 9.6           |       8 |  219692
 vanilla 9.6           |      16 |  422933
 vanilla 9.6           |      32 |  375546
 vanilla 9.6           |      48 |  363474
​ <- this point​
 vanilla 9.6           |      56 |  352943
 vanilla 9.6           |      64 |  334498
 vanilla 9.6           |      72 |  369802
 vanilla 9.6           |      82 |  604867
 vanilla 9.6           |      92 |  871048
 vanilla 9.6           |      96 |  969265
 vanilla 9.6           |     105 |  996794
 vanilla 9.6           |     110 |  932853
 vanilla 9.6           |     115 |  758485
 vanilla 9.6           |     120 |  721365
 vanilla 9.6           |     125 |  632265
 vanilla 9.6           |     130 |  624666
 vanilla 9.6           |     135 |  582120
 vanilla 9.6           |     140 |  583080
 vanilla 9.6           |     150 |  555608
 vanilla 9.6           |     160 |  533340
 vanilla 9.6           |     170 |  520308
 vanilla 9.6           |     180 |  504536
 vanilla 9.6           |     190 |  496967​
 

​​​​---
Dmitry Vasilyev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company​​


Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
On December 25, 2015 6:27:06 PM GMT+01:00, "Васильев Дмитрий" 
>>​If so, could you provide a hierarchical before/after profile?
>
>​Performance | Git hash commit    | Date
>~ 360k tps | c3e7c24a1d60dc6ad56e2a0723399f1570c54224 | Thu Nov 12
>09:12:18
>2015 -0500
>~ 360k tps | ac1d7945f866b1928c2554c0f80fd52d7f977772    | Thu Nov 12
>09:00:33 2015 -0500
>~ 960k tps | a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26e  | Thu Nov 12
>07:40:31 2015 -0500

Profile as in perf oprofile or something.

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.



Re: Performance degradation in commit ac1d794

From
Васильев Дмитрий
Date:

2015-12-25 20:44 GMT+03:00 Andres Freund <andres@anarazel.de>:
On December 25, 2015 6:27:06 PM GMT+01:00, "Васильев Дмитрий"
>>​If so, could you provide a hierarchical before/after profile?
>
>​Performance | Git hash commit    | Date
>~ 360k tps | c3e7c24a1d60dc6ad56e2a0723399f1570c54224 | Thu Nov 12
>09:12:18
>2015 -0500
>~ 360k tps | ac1d7945f866b1928c2554c0f80fd52d7f977772    | Thu Nov 12
>09:00:33 2015 -0500
>~ 960k tps | a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26e  | Thu Nov 12
>07:40:31 2015 -0500

Profile as in perf oprofile or something.

---
Please excuse brevity and formatting - I am writing this on my mobile phone.

​ac1d794​:

​ Samples: 1M of event 'cycles', Event count (approx.): 816922259995, UID: pgpro                                                                         Overhead  Shared Object       Symbol                                                                                                                                         
  69,72%  [kernel]            [k] _raw_spin_lock_irqsave
  1,43%  postgres            [.] _bt_compare
   1,19%  postgres            [.] LWLockAcquire
   0,99%  postgres            [.] hash_search_with_hash_value
   0,61%  postgres            [.] PinBuffer                                                                                                                                  
   0,46%  postgres            [.] GetSnapshotData ​

a05dc4d:

​Samples: 1M of event 'cycles', Event count (approx.): 508150718694, UID: pgpro                
Overhead  Shared Object       Symbol                                                                                                                                         
   4,77%  postgres            [.] GetSnapshotData
   4,30%  postgres            [.] _bt_compare
   3,13%  postgres            [.] hash_search_with_hash_value
   3,08%  postgres            [.] LWLockAcquire
   2,09%  postgres            [.] LWLockRelease
   2,03%  postgres            [.] PinBuffer                                                                                                                                  


​Perf record ​generate big traffic:

time perf record -u pgpro -g --call-graph=dwarf
^C[ perf record: Woken up 0 times to write data ]
Warning:
Processed 1078453 events and lost 18257 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 8507.985 MB perf.data (1055663 samples) ]
real    0m8.791s
user    0m0.678s
sys     0m8.120s



​If you want i give you ssh-access.

Re: Performance degradation in commit ac1d794

From
Tom Lane
Date:
Васильев Дмитрий <d.vasilyev@postgrespro.ru> writes:
> ​ Samples: 1M of event 'cycles', Event count (approx.): 816922259995, UID:
> pgpro
>   Overhead  Shared Object       Symbol

>   69,72%  [kernel]            [k] _raw_spin_lock_irqsave
>   1,43%  postgres            [.] _bt_compare
>    1,19%  postgres            [.] LWLockAcquire
>    0,99%  postgres            [.] hash_search_with_hash_value
>    0,61%  postgres            [.] PinBuffer

Seems like what you've got here is a kernel bug.
        regards, tom lane



Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
On December 25, 2015 7:10:23 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>Васильев Дмитрий
><d.vasilyev@postgrespro.ru> writes:
>> ��� Samples: 1M of event 'cycles', Event count (approx.):
>816922259995, UID:
>> pgpro
>>   Overhead  Shared Object       Symbol
>
>>   69,72%  [kernel]            [k] _raw_spin_lock_irqsave
>>   1,43%  postgres            [.] _bt_compare
>>    1,19%  postgres            [.] LWLockAcquire
>>    0,99%  postgres            [.] hash_search_with_hash_value
>>    0,61%  postgres            [.] PinBuffer
>
>Seems like what you've got here is a kernel bug.

I wouldn't go as far as calling it a kernel bug. Were still doing 300k tps. And were triggering the performance
degradationby adding another socket (IIRC) to the poll(2) call.
 

It certainly be interesting to see the expanded tree below the spinlock. I wonder if this is related to directed
wakeups.

Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.



Re: Performance degradation in commit ac1d794

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On December 25, 2015 7:10:23 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Seems like what you've got here is a kernel bug.

> I wouldn't go as far as calling it a kernel bug. Were still doing 300k tps. And were triggering the performance
degradationby adding another socket (IIRC) to the poll(2) call.
 

Hmm.  And all those FDs point to the same pipe.  I wonder if we're looking
at contention for some pipe-related data structure inside the kernel.
        regards, tom lane



Re: Performance degradation in commit ac1d794

From
Васильев Дмитрий
Date:


2015-12-25 21:28 GMT+03:00 Tom Lane <tgl@sss.pgh.pa.us>:
Andres Freund <andres@anarazel.de> writes:
> On December 25, 2015 7:10:23 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Seems like what you've got here is a kernel bug.

> I wouldn't go as far as calling it a kernel bug. Were still doing 300k tps. And were triggering the performance degradation by adding another socket (IIRC) to the poll(2) call.

Hmm.  And all those FDs point to the same pipe.  I wonder if we're looking
at contention for some pipe-related data structure inside the kernel.

                        regards, tom lane

​I did bt on backends and found it in following state:

#0  0x00007f77b0e5bb60 in __poll_nocancel () from /lib64/libc.so.6
#1  0x00000000006a7cd0 in WaitLatchOrSocket (latch=0x7f779e2e96c4, wakeEvents=wakeEvents@entry=19, sock=9, timeout=timeout@entry=0) at pg_latch.c:333
#2  0x0000000000612c7d in secure_read (port=0x17e6af0, ptr=0xcc94a0 <PqRecvBuffer>, len=8192) at be-secure.c:147
#3  0x000000000061be36 in pq_recvbuf () at pqcomm.c:915
#4  pq_getbyte () at pqcomm.c:958
#5  0x0000000000728ad5 in SocketBackend (inBuf=0x7ffd8b6b1460) at postgres.c:345
Perf shows _raw_spin_lock_irqsave call remove_wait_queue add_wait_queue


​​---
Dmitry Vasilyev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company​

Re: Performance degradation in commit ac1d794

From
Васильев Дмитрий
Date:

2015-12-25 22:42 GMT+03:00 Васильев Дмитрий <d.vasilyev@postgrespro.ru>:


2015-12-25 21:28 GMT+03:00 Tom Lane <tgl@sss.pgh.pa.us>:
Andres Freund <andres@anarazel.de> writes:
> On December 25, 2015 7:10:23 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Seems like what you've got here is a kernel bug.

> I wouldn't go as far as calling it a kernel bug. Were still doing 300k tps. And were triggering the performance degradation by adding another socket (IIRC) to the poll(2) call.

Hmm.  And all those FDs point to the same pipe.  I wonder if we're looking
at contention for some pipe-related data structure inside the kernel.

                        regards, tom lane

​I did bt on backends and found it in following state:

#0  0x00007f77b0e5bb60 in __poll_nocancel () from /lib64/libc.so.6
#1  0x00000000006a7cd0 in WaitLatchOrSocket (latch=0x7f779e2e96c4, wakeEvents=wakeEvents@entry=19, sock=9, timeout=timeout@entry=0) at pg_latch.c:333
#2  0x0000000000612c7d in secure_read (port=0x17e6af0, ptr=0xcc94a0 <PqRecvBuffer>, len=8192) at be-secure.c:147
#3  0x000000000061be36 in pq_recvbuf () at pqcomm.c:915
#4  pq_getbyte () at pqcomm.c:958
#5  0x0000000000728ad5 in SocketBackend (inBuf=0x7ffd8b6b1460) at postgres.c:345
Perf shows _raw_spin_lock_irqsave call remove_wait_queue add_wait_queue


​​---
Dmitry Vasilyev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company​


​I’m sorry I meant remove_wait_queue and add_wait_queue functions call _raw_spin_lock_irqsave what holds main processor time.​
 
​uname -a: 3.10.0-229.20.1.el7.x86_64 #1 SMP Tue Nov 3 19:10:07 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux​


​​---
Dmitry Vasilyev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company​

Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
On 2015-12-25 13:28:55 -0500, Tom Lane wrote:
> Hmm.  And all those FDs point to the same pipe.  I wonder if we're looking
> at contention for some pipe-related data structure inside the kernel.

Sounds fairly likely - and not too surprising. In this scenario we've a
couple 100k registrations/unregistrations to a pretty fundamentally
shared resource (the wait queue for changes to the pipe). Not that
surprising that it becomes a problem.

There's a couple solutions I can think of to that problem:
1) Use epoll()/kqueue, or other similar interfaces that don't require  re-registering fds at every invocation. My guess
isthat that'd be  desirable for performance anyway.
 

2) Create a pair of fds between postmaster/backend for each  backend. While obviously increasing the the number of FDs
noticeably, it's interesting for other features as well: If we ever want to do FD  passing from postmaster to existing
backends,we're going to need  that anyway.
 

3) Replace the postmaster_alive_fds socketpair by some other signalling  mechanism. E.g. sending a procsignal to each
backend,which sets the  latch and a special flag in the latch structure.
 

Andres



Re: Performance degradation in commit ac1d794

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> There's a couple solutions I can think of to that problem:
> 1) Use epoll()/kqueue, or other similar interfaces that don't require
>    re-registering fds at every invocation. My guess is that that'd be
>    desirable for performance anyway.

Portability, on the other hand, would be problematic.

> 2) Create a pair of fds between postmaster/backend for each
>    backend. While obviously increasing the the number of FDs noticeably,
>    it's interesting for other features as well: If we ever want to do FD
>    passing from postmaster to existing backends, we're going to need
>    that anyway.

Maybe; it'd provide another limit on how many backends we could run.

> 3) Replace the postmaster_alive_fds socketpair by some other signalling
>    mechanism. E.g. sending a procsignal to each backend, which sets the
>    latch and a special flag in the latch structure.

And what would send the signal?  The entire point here is to notice the
situation where the postmaster has crashed.  It can *not* depend on the
postmaster taking some action.
        regards, tom lane



Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
On 2015-12-25 16:29:53 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > There's a couple solutions I can think of to that problem:
> > 1) Use epoll()/kqueue, or other similar interfaces that don't require
> >    re-registering fds at every invocation. My guess is that that'd be
> >    desirable for performance anyway.
> 
> Portability, on the other hand, would be problematic.

Indeed. But we might be able to get away with it because there's
realistically just one platform on which people run four socket
servers. Obviously we'd leave poll and select support in place.  It'd be
a genuine improvement for less extreme loads on linux, too.

> > 3) Replace the postmaster_alive_fds socketpair by some other signalling
> >    mechanism. E.g. sending a procsignal to each backend, which sets the
> >    latch and a special flag in the latch structure.
> 
> And what would send the signal?  The entire point here is to notice the
> situation where the postmaster has crashed.  It can *not* depend on the
> postmaster taking some action.

Ahem. Um. Look, over there --->

I blame it on all the food.


Andres



Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
On 2015-12-26 12:22:48 +0100, Andres Freund wrote:
> > > 3) Replace the postmaster_alive_fds socketpair by some other signalling
> > >    mechanism. E.g. sending a procsignal to each backend, which sets the
> > >    latch and a special flag in the latch structure.
> > 
> > And what would send the signal?  The entire point here is to notice the
> > situation where the postmaster has crashed.  It can *not* depend on the
> > postmaster taking some action.
> 
> Ahem. Um. Look, over there --->
> 
> I blame it on all the food.

A unportable and easy version of this, actually making sense this time,
would be to use prctl(PR_SET_PDEATHSIG, SIGQUIT). That'd send SIGQUIT to
backends whenever postmaster dies.  Obviously that's not portable
either - doing this for linux only wouldn't be all that kludgey tho.



Re: Performance degradation in commit ac1d794

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> A unportable and easy version of this, actually making sense this time,
> would be to use prctl(PR_SET_PDEATHSIG, SIGQUIT). That'd send SIGQUIT to
> backends whenever postmaster dies.  Obviously that's not portable
> either - doing this for linux only wouldn't be all that kludgey tho.

Hmm.  That would have semantics rather substantially different from
the way that the WL_POSTMASTER_DEATH code behaves.  But I don't know
how much we care about that, since the whole scenario is something
that should not happen under normal circumstances.  Maybe cross-platform
variation is OK as long as it doesn't make the code too hairy.
        regards, tom lane



Re: Performance degradation in commit ac1d794

From
Amit Kapila
Date:
On Sat, Dec 26, 2015 at 5:41 PM, Andres Freund <andres@anarazel.de> wrote:
On 2015-12-26 12:22:48 +0100, Andres Freund wrote:
> > > 3) Replace the postmaster_alive_fds socketpair by some other signalling
> > >    mechanism. E.g. sending a procsignal to each backend, which sets the
> > >    latch and a special flag in the latch structure.
> >
> > And what would send the signal?  The entire point here is to notice the
> > situation where the postmaster has crashed.  It can *not* depend on the
> > postmaster taking some action.
>
> Ahem. Um. Look, over there --->
>
> I blame it on all the food.

A unportable and easy version of this, actually making sense this time,
would be to use prctl(PR_SET_PDEATHSIG, SIGQUIT). That'd send SIGQUIT to
backends whenever postmaster dies.  Obviously that's not portable
either - doing this for linux only wouldn't be all that kludgey tho.


There is a way to make backends exit in Windows as well by using
JobObjects and use limitFlags as JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE
for JOBOBJECT_BASIC_LIMIT_INFORMATION.




With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
On 2015-12-26 12:22:48 +0100, Andres Freund wrote:
> On 2015-12-25 16:29:53 -0500, Tom Lane wrote:
> > Andres Freund <andres@anarazel.de> writes:
> > > There's a couple solutions I can think of to that problem:
> > > 1) Use epoll()/kqueue, or other similar interfaces that don't require
> > >    re-registering fds at every invocation. My guess is that that'd be
> > >    desirable for performance anyway.
> >
> > Portability, on the other hand, would be problematic.
>
> Indeed. But we might be able to get away with it because there's
> realistically just one platform on which people run four socket
> servers. Obviously we'd leave poll and select support in place.  It'd be
> a genuine improvement for less extreme loads on linux, too.

I finally got back to working on this. Attached is a WIP patch series
implementing:
0001: Allow to easily choose between the readiness primitives in unix_latch.c
      Pretty helpful for testing, not useful for anything else.
0002: Error out if waiting on socket readiness without a specified socket.
0003: Only clear unix_latch.c's self-pipe if it actually contains data.
      ~2% on high qps workloads
0004: Support using epoll as the polling primitive in unix_latch.c.
      ~3% on high qps workloads, massive scalability improvements (x3)
      on very large machines.

With 0004 obviously being the relevant bit for this thread. I verified
that using epoll addresses the performance problem, using the hardware
the OP noticed the performance problem on.

The reason I went with using epoll over the PR_SET_PDEATHSIG approach is
that it provides semantics that are more similar to the other platforms,
while being just as platform dependant as PR_SET_PDEATHSIG. It also is
actually measurably faster, at least here.

0004 currently contains one debatable optimization, which I'd like to
discuss: Currently the 'sock' passed to WaitLatchOrSocket is not
removed/added to the epoll fd, if it's the numerically same as in the
last call. That's good for performance, but would be wrong if the socket
were close and a new one with the same value would be waited on.  I
think a big warning sign somewhere is sufficient to deal with that
problem - it's not something we're likely to start doing. And even if
it's done at some point, we can just offer an API to reset the last used
socket fd.


Unless somebody comes up with a platform independent way of addressing
this, I'm inclined to press forward using epoll(). Opinions?

Andres

Attachment

Re: Performance degradation in commit ac1d794

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> 0004 currently contains one debatable optimization, which I'd like to
> discuss: Currently the 'sock' passed to WaitLatchOrSocket is not
> removed/added to the epoll fd, if it's the numerically same as in the
> last call. That's good for performance, but would be wrong if the socket
> were close and a new one with the same value would be waited on.  I
> think a big warning sign somewhere is sufficient to deal with that
> problem - it's not something we're likely to start doing. And even if
> it's done at some point, we can just offer an API to reset the last used
> socket fd.

Perhaps a cleaner API solution would be to remove the socket argument per
se from the function altogether, instead providing a separate
SetSocketToWaitOn() call.

(Also, if there is a need for it, we could provide a function that still
takes a socket argument, with the understanding that it's to be used for
short-lived sockets where you don't want to change the process's main
epoll state.)
        regards, tom lane



Re: Performance degradation in commit ac1d794

From
Robert Haas
Date:
On Thu, Jan 14, 2016 at 9:39 AM, Andres Freund <andres@anarazel.de> wrote:
> I finally got back to working on this. Attached is a WIP patch series
> implementing:
> 0001: Allow to easily choose between the readiness primitives in unix_latch.c
>       Pretty helpful for testing, not useful for anything else.

Looks good.

> 0002: Error out if waiting on socket readiness without a specified socket.

Looks good.

> 0003: Only clear unix_latch.c's self-pipe if it actually contains data.
>       ~2% on high qps workloads

everytime -> every time

+        if ((wakeEvents & WL_LATCH_SET) && latch->is_set)
+        {
+            result |= WL_LATCH_SET;
+        }

Excess braces.

Doesn't this code make it possible for the self-pipe to fill up,
self-deadlocking the process?  Suppose we repeatedly enter
WaitLatchOrSocket().  Each time we do, just after waiting = true is
set, somebody sets the latch.  We handle the signal and put a byte
into the pipe.  Returning from the signal handler, we then notice that
is_set is true and return at once, without draining the pipe.  Repeat
until something bad happens.

> 0004: Support using epoll as the polling primitive in unix_latch.c.
>       ~3% on high qps workloads, massive scalability improvements (x3)
>       on very large machines.
>
> With 0004 obviously being the relevant bit for this thread. I verified
> that using epoll addresses the performance problem, using the hardware
> the OP noticed the performance problem on.

+                /*
+                 * Unnecessarily pass data for delete due to bug errorneously
+                 * requiring it in the past.
+                 */

This is pretty vague.  And it has a spelling mistake.

Further down, nodified -> notified.

+        if (wakeEvents != latch->lastmask || latch->lastwatchfd != sock)

I don't like this very much.  I think it's a bad idea to test
latch->lastwatchfd != sock.  That has an excellent change of letting
people write code that appears to work but then doesn't.  I think it
would be better, if we're going to change the API contract, to make it
a hard break, as I see Tom has also suggested while I've been writing
this.

Incidentally, if we're going to whack around the latch API, it would
be nice to pick a design which wouldn't be too hard to extend to
waiting on multiple sockets.  The application I have in mind is to
send of queries to several foreign servers at once and then wait until
bytes come back from any of them.  It's mostly pie in the sky at this
point, but it seems highly likely to me that we'd want to do such a
thing by waiting for bytes from any of the sockets involved OR a latch
event.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance degradation in commit ac1d794

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Incidentally, if we're going to whack around the latch API, it would
> be nice to pick a design which wouldn't be too hard to extend to
> waiting on multiple sockets.  The application I have in mind is to
> send of queries to several foreign servers at once and then wait until
> bytes come back from any of them.  It's mostly pie in the sky at this
> point, but it seems highly likely to me that we'd want to do such a
> thing by waiting for bytes from any of the sockets involved OR a latch
> event.

Instead of SetSocketToWaitOn, maybe AddSocketToWaitSet and
RemoveSocketFromWaitSet?  And you'd need some way of identifying
which socket came ready after a wait call...
        regards, tom lane



Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
On 2016-01-14 10:39:55 -0500, Robert Haas wrote:
> Doesn't this code make it possible for the self-pipe to fill up,
> self-deadlocking the process?  Suppose we repeatedly enter
> WaitLatchOrSocket().  Each time we do, just after waiting = true is
> set, somebody sets the latch.  We handle the signal and put a byte
> into the pipe.  Returning from the signal handler, we then notice that
> is_set is true and return at once, without draining the pipe.  Repeat
> until something bad happens.

Should be fine because the self-pipe is marked as non-blockingif (fcntl(pipefd[1], F_SETFL, O_NONBLOCK) < 0)
elog(FATAL,"fcntl() failed on write-end of self-pipe: %m");
 
and sendSelfPipeByte accepts the blocking case as success
    /*     * If the pipe is full, we don't need to retry, the data that's there     * already is enough to wake up
WaitLatch.    */    if (errno == EAGAIN || errno == EWOULDBLOCK)        return;
 


> > 0004: Support using epoll as the polling primitive in unix_latch.c.
> >       ~3% on high qps workloads, massive scalability improvements (x3)
> >       on very large machines.
> >
> > With 0004 obviously being the relevant bit for this thread. I verified
> > that using epoll addresses the performance problem, using the hardware
> > the OP noticed the performance problem on.
> 
> +                /*
> +                 * Unnecessarily pass data for delete due to bug errorneously
> +                 * requiring it in the past.
> +                 */
> 
> This is pretty vague.  And it has a spelling mistake.

Will add a reference to the manpage (where that requirement is coming
from).

> Further down, nodified -> notified.
> 
> +        if (wakeEvents != latch->lastmask || latch->lastwatchfd != sock)
> 
> I don't like this very much.

Yea, me neither, which is why I called it out... I think it's not too
likely to cause problems in practice though. But I think changing the
API makes sense, so the likelihood shouldn't be a relevant issue.

> Incidentally, if we're going to whack around the latch API, it would
> be nice to pick a design which wouldn't be too hard to extend to
> waiting on multiple sockets.

Hm. That seems likely to make usage harder for users of the API. So it
seems like it'd make sense to provide a simpler version anyway, for the
majority of users.

So, I'm wondering how we'd exactly use a hyptothetical
SetSocketToWaitOn, or SetSocketsToWaitOn (or whatever). I mean it can
make a fair bit of sense to sometimes wait on MyLatch/port->sock and
sometimes on MyLatch/fdw connections. The simple proposed code would
change the epoll set whenever switching between both, but with
SetSocketsToWaitOn you'd probably end up switching this much more often?

One way to address that would be to create a 'latch wait' datastructure,
that'd then contain the epoll fd/win32 wait events/... That way you
could have one 'LatchWait' for latch + client socket and one for latch +
fdw sockets.

Greetings,

Andres Freund



Re: Performance degradation in commit ac1d794

From
Robert Haas
Date:
On Thu, Jan 14, 2016 at 10:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Incidentally, if we're going to whack around the latch API, it would
>> be nice to pick a design which wouldn't be too hard to extend to
>> waiting on multiple sockets.  The application I have in mind is to
>> send of queries to several foreign servers at once and then wait until
>> bytes come back from any of them.  It's mostly pie in the sky at this
>> point, but it seems highly likely to me that we'd want to do such a
>> thing by waiting for bytes from any of the sockets involved OR a latch
>> event.
>
> Instead of SetSocketToWaitOn, maybe AddSocketToWaitSet and
> RemoveSocketFromWaitSet?  And you'd need some way of identifying
> which socket came ready after a wait call...

Yeah.  Although I think for now it would be fine to just error out if
somebody tries to add a socket and there already is one.  Then we
could lift that limitation in a later commit.  Of course if Andres
wants to do the whole thing now I'm not going to get in the way, but
since that will require Windows tinkering and so on it may be more
than he wants to dive into.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
On January 14, 2016 5:16:59 PM GMT+01:00, Robert Haas <robertmhaas@gmail.com> wrote:
>Yeah.  Although I think for now it would be fine to just error out if
>somebody tries to add a socket and there already is one.  Then we
>could lift that limitation in a later commit.  Of course if Andres
>wants to do the whole thing now I'm not going to get in the way, but
>since that will require Windows tinkering and so on it may be more
>than he wants to dive into.

Yea, I don't want to do anything really large at the moment. My primary interest is fixing the major performance
regression.

Andres


--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.



Re: Performance degradation in commit ac1d794

From
Robert Haas
Date:
On Thu, Jan 14, 2016 at 10:56 AM, Andres Freund <andres@anarazel.de> wrote:
> So, I'm wondering how we'd exactly use a hyptothetical
> SetSocketToWaitOn, or SetSocketsToWaitOn (or whatever). I mean it can
> make a fair bit of sense to sometimes wait on MyLatch/port->sock and
> sometimes on MyLatch/fdw connections. The simple proposed code would
> change the epoll set whenever switching between both, but with
> SetSocketsToWaitOn you'd probably end up switching this much more often?
>
> One way to address that would be to create a 'latch wait' datastructure,
> that'd then contain the epoll fd/win32 wait events/... That way you
> could have one 'LatchWait' for latch + client socket and one for latch +
> fdw sockets.

I see your point.  As far as I can see, it's currently true that,
right now, the only places where we wait for a socket are places where
the socket will live for the lifetime of the backend, but I think we
should regard it as likely that, in the future, we'll want to use it
anywhere we want to wait for a socket to become ready.  There are
getting to be a lot of places where we need to unstick some loop
whenever the process latch gets set, and it seems likely to me that
needs will only continue to grow.  So the API should probably
contemplate that sort of need.

I think your idea of a data structure the encapsulates a set of events
for which to wait is probably a good one.  WaitLatch doesn't seem like
a great name.  Maybe WaitEventSet, and then we can have
WaitLatch(&latch) and WaitEvents(&eventset).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
On 2016-01-14 11:31:03 -0500, Robert Haas wrote:
> On Thu, Jan 14, 2016 at 10:56 AM, Andres Freund <andres@anarazel.de> wrote:
> I think your idea of a data structure the encapsulates a set of events
> for which to wait is probably a good one.  WaitLatch doesn't seem like
> a great name.  Maybe WaitEventSet, and then we can have
> WaitLatch(&latch) and WaitEvents(&eventset).

Hm, I'd like to have latch in the name. It seems far from improbably to
have another wait data structure. LatchEventSet maybe? The wait would be
implied by WaitLatch.

So effectively we'd create a LatchEventSet feLatchSet; somewhere global
(and update it from a backend local to the proc latch in
SwitchToSharedLatch/SwitchBackToLocalLatch()). Then change all WaitLatch
calls to refer to those.

Do we want to provide a backward compatible API for all this? I'm fine
either way.

Andres



Re: Performance degradation in commit ac1d794

From
Robert Haas
Date:
On Thu, Jan 14, 2016 at 12:06 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-01-14 11:31:03 -0500, Robert Haas wrote:
>> On Thu, Jan 14, 2016 at 10:56 AM, Andres Freund <andres@anarazel.de> wrote:
>> I think your idea of a data structure the encapsulates a set of events
>> for which to wait is probably a good one.  WaitLatch doesn't seem like
>> a great name.  Maybe WaitEventSet, and then we can have
>> WaitLatch(&latch) and WaitEvents(&eventset).
>
> Hm, I'd like to have latch in the name. It seems far from improbably to
> have another wait data structure. LatchEventSet maybe? The wait would be
> implied by WaitLatch.

I can live with that.

> So effectively we'd create a LatchEventSet feLatchSet; somewhere global
> (and update it from a backend local to the proc latch in
> SwitchToSharedLatch/SwitchBackToLocalLatch()). Then change all WaitLatch
> calls to refer to those.

Sure.

> Do we want to provide a backward compatible API for all this? I'm fine
> either way.

How would that work?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance degradation in commit ac1d794

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Jan 14, 2016 at 12:06 PM, Andres Freund <andres@anarazel.de> wrote:
>> Do we want to provide a backward compatible API for all this? I'm fine
>> either way.

> How would that work?

I see no great need to be backwards-compatible on this, especially if it
would complicate matters at all.  I doubt there's a lot of third-party
code using WaitLatch right now.  Just make sure there's an obvious
compile failure for anyone who is.
        regards, tom lane



Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
On 2016-01-14 12:07:23 -0500, Robert Haas wrote:
> > Do we want to provide a backward compatible API for all this? I'm fine
> > either way.
>
> How would that work?

I'm thinking of something like;

int WaitOnLatchSet(LatchEventSet *set, int wakeEvents, long timeout);

int
WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,long timeout)
{       LatchEventSet set;
       LatchEventSetInit(&set, latch);
       if (sock != PGINVALID_SOCKET)          LatchEventSetAddSock(&set, sock);
       return WaitOnLatchSet(set, wakeEvents, timeout);
}

I think we'll need to continue having wakeEvents and timeout parameters
for WaitOnLatchSet, we quite frequently want to wait socket
readability/writability, not wait on the socket, or have/not have
timeouts.




Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
On 2016-01-14 18:14:21 +0100, Andres Freund wrote:
> I'm thinking of something like;
> 
> int WaitOnLatchSet(LatchEventSet *set, int wakeEvents, long timeout);
> 
> int
> WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,long timeout)
> {
>         LatchEventSet set;
> 
>         LatchEventSetInit(&set, latch);
> 
>         if (sock != PGINVALID_SOCKET)
>            LatchEventSetAddSock(&set, sock);
> 
>         return WaitOnLatchSet(set, wakeEvents, timeout);
> }
> 
> I think we'll need to continue having wakeEvents and timeout parameters
> for WaitOnLatchSet, we quite frequently want to wait socket
> readability/writability, not wait on the socket, or have/not have
> timeouts.

This brings me to something related: I'm wondering if we shouldn't merge
unix/win32_latch.c. If we go this route it seems like the amount of
shared infrastructure will further increase. The difference between
win32 and, say, the select code isn't much bigger than the difference
between select/poll. epoll/win32 are probably more similar than that
actually.

Andres



Re: Performance degradation in commit ac1d794

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> This brings me to something related: I'm wondering if we shouldn't merge
> unix/win32_latch.c.

Well, it's duplicated code on the one hand versus maze-of-ifdefs on the
other.  Feel free to try it and see, but I'm unsure it'd be an improvement.
        regards, tom lane



Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
On 2016-01-14 18:14:21 +0100, Andres Freund wrote:
> On 2016-01-14 12:07:23 -0500, Robert Haas wrote:
> > > Do we want to provide a backward compatible API for all this? I'm fine
> > > either way.
> >
> > How would that work?
> 
> I'm thinking of something like;
> 
> int WaitOnLatchSet(LatchEventSet *set, int wakeEvents, long timeout);
> 
> int
> WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,long timeout)
> {
>         LatchEventSet set;
> 
>         LatchEventSetInit(&set, latch);
> 
>         if (sock != PGINVALID_SOCKET)
>            LatchEventSetAddSock(&set, sock);
> 
>         return WaitOnLatchSet(set, wakeEvents, timeout);
> }
> 
> I think we'll need to continue having wakeEvents and timeout parameters
> for WaitOnLatchSet, we quite frequently want to wait socket
> readability/writability, not wait on the socket, or have/not have
> timeouts.

Hm. If we really want to support multiple sockets at some point the
above WaitOnLatchSet signature isn't going to fly, because it won't
support figuring out which fd the event this triggered on.

So it seems we'd need to return something like
struct LatchEvent
{       enum LatchEventType {WL_LATCH_EVENT;WL_TIMEOUT;WL_SOCKET_EVENT; WL_POSTMASTER_EVENT;...} event_type;       int
mask;      pgsocket event_sock;
 
};

that'd also allow to extend this to return multiple events if we want
that at some point. Alternatively we could add a pgsocket* argument, but
that doesn't really seem much better.

Not super happy about the above proposal.

Andres



Re: Performance degradation in commit ac1d794

From
Robert Haas
Date:
On Thu, Jan 14, 2016 at 12:14 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-01-14 12:07:23 -0500, Robert Haas wrote:
>> > Do we want to provide a backward compatible API for all this? I'm fine
>> > either way.
>>
>> How would that work?
>
> I'm thinking of something like;
>
> int WaitOnLatchSet(LatchEventSet *set, int wakeEvents, long timeout);
>
> int
> WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,long timeout)
> {
>         LatchEventSet set;
>
>         LatchEventSetInit(&set, latch);
>
>         if (sock != PGINVALID_SOCKET)
>            LatchEventSetAddSock(&set, sock);
>
>         return WaitOnLatchSet(set, wakeEvents, timeout);
> }
>
> I think we'll need to continue having wakeEvents and timeout parameters
> for WaitOnLatchSet, we quite frequently want to wait socket
> readability/writability, not wait on the socket, or have/not have
> timeouts.

Well, if we ever wanted to support multiple FDs, we'd need the
readability/writeability thing to be per-fd, not per-set.

Overall, if this is what you have in mind for backward compatibility,
I rate it M for Meh.  Let's just break compatibility and people will
have to update their code.  That shouldn't be hard, and if we don't
make people do it when we make the change, then we'll be stuck with
the backward-compatibility interface for a decade.  I doubt it's worth
it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance degradation in commit ac1d794

From
Kyotaro HORIGUCHI
Date:
Hello, I am one who wants waiting on many sockets at once.

At Thu, 14 Jan 2016 18:55:51 +0100, Andres Freund <andres@anarazel.de> wrote in
<20160114175551.GM10941@awork2.anarazel.de>
> On 2016-01-14 18:14:21 +0100, Andres Freund wrote:
> > On 2016-01-14 12:07:23 -0500, Robert Haas wrote:
> > > > Do we want to provide a backward compatible API for all this? I'm fine
> > > > either way.
> > >
> > > How would that work?
> > 
> > I'm thinking of something like;
> > 
> > int WaitOnLatchSet(LatchEventSet *set, int wakeEvents, long timeout);
> > 
> > int
> > WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,long timeout)
> > {
> >         LatchEventSet set;
> > 
> >         LatchEventSetInit(&set, latch);
> > 
> >         if (sock != PGINVALID_SOCKET)
> >            LatchEventSetAddSock(&set, sock);
> > 
> >         return WaitOnLatchSet(set, wakeEvents, timeout);
> > }
> > 
> > I think we'll need to continue having wakeEvents and timeout parameters
> > for WaitOnLatchSet, we quite frequently want to wait socket
> > readability/writability, not wait on the socket, or have/not have
> > timeouts.
> 
> Hm. If we really want to support multiple sockets at some point the
> above WaitOnLatchSet signature isn't going to fly, because it won't
> support figuring out which fd the event this triggered on.
> 
> So it seems we'd need to return something like
> struct LatchEvent
> {
>         enum LatchEventType {WL_LATCH_EVENT;WL_TIMEOUT;WL_SOCKET_EVENT; WL_POSTMASTER_EVENT;...} event_type;
>         int mask;
>         pgsocket event_sock;
> };
> 
> that'd also allow to extend this to return multiple events if we want
> that at some point. Alternatively we could add a pgsocket* argument, but
> that doesn't really seem much better.
> 
> Not super happy about the above proposal.

How about allowing registration of a callback for every waiting
socket. The signature of the callback function would be like

enum LATCH_CALLBACK_STATELatchWaitCallback(pgsocket event_sock,                  enum LatchEventType, int mask?, void
*bogus);

It can return, for instance, LCB_CONTINUE, LCB_BREAK or
LCB_IMMEDBREAK, and if any one of them returns LCB_BREAK, it will
break after, maybe, calling all callbacks for fired events.

We could have predefined callbacks for every event which does
only setting a corresponding flag and returns LCB_BREAK.

/* Waiting set has been constructed so far */
if (!WaitOnLatchSet(&set(?))  errorr()

if (is_sock_readable[sockid]) {} /** is... would be global /
if (is_sock_writable[sockid]) {} /** is... would be global /
/* Any other types of trigger would processes elsewhere */


Although it might be slow if we have an enormous number of
sockets fired at once, I suppose it returns only for a few
sockets on most cases.

I don't see the big picture of the whole latch/signalling
mechanism with this, but callbacks might be usable for signalling
on many sockets.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Performance degradation in commit ac1d794

From
Robert Haas
Date:
On Thu, Jan 14, 2016 at 12:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@anarazel.de> writes:
>> This brings me to something related: I'm wondering if we shouldn't merge
>> unix/win32_latch.c.
>
> Well, it's duplicated code on the one hand versus maze-of-ifdefs on the
> other.  Feel free to try it and see, but I'm unsure it'd be an improvement.

I think we should either get this fixed RSN or revert the problematic
commit until we get it fixed.  I'd be rather disappointed about the
latter because I think this was a very good thing on the merits, but
probably not good enough to justify taking the performance hit over
the long term.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance degradation in commit ac1d794

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I think we should either get this fixed RSN or revert the problematic
> commit until we get it fixed.  I'd be rather disappointed about the
> latter because I think this was a very good thing on the merits, but
> probably not good enough to justify taking the performance hit over
> the long term.

Since it's only in HEAD, I'm not seeing the urgency of reverting it.
However, it'd be a good idea to put this on the 9.6 open items list
(have we got such a page yet?) to make sure it gets addressed before
beta.
        regards, tom lane



Re: Performance degradation in commit ac1d794

From
Robert Haas
Date:
On Thu, Feb 11, 2016 at 12:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I think we should either get this fixed RSN or revert the problematic
>> commit until we get it fixed.  I'd be rather disappointed about the
>> latter because I think this was a very good thing on the merits, but
>> probably not good enough to justify taking the performance hit over
>> the long term.
>
> Since it's only in HEAD, I'm not seeing the urgency of reverting it.
> However, it'd be a good idea to put this on the 9.6 open items list
> (have we got such a page yet?) to make sure it gets addressed before
> beta.

One problem is that it makes for misleading results if you try to
benchmark 9.5 against 9.6.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
On 2016-02-11 12:50:58 -0500, Robert Haas wrote:
> On Thu, Feb 11, 2016 at 12:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> >> I think we should either get this fixed RSN or revert the problematic
> >> commit until we get it fixed.  I'd be rather disappointed about the
> >> latter because I think this was a very good thing on the merits, but
> >> probably not good enough to justify taking the performance hit over
> >> the long term.
> >
> > Since it's only in HEAD, I'm not seeing the urgency of reverting it.
> > However, it'd be a good idea to put this on the 9.6 open items list
> > (have we got such a page yet?) to make sure it gets addressed before
> > beta.
> 
> One problem is that it makes for misleading results if you try to
> benchmark 9.5 against 9.6.

You need a really beefy box to show the problem. On a large/new 2 socket
machine the performance regression in in the 1-3% range for a pgbench of
SELECT 1. So it's not like it's immediately showing up for everyone.

Putting it on the open items list sounds good to me.

Regards,

Andres



Re: Performance degradation in commit ac1d794

From
Robert Haas
Date:
On Thu, Feb 11, 2016 at 12:53 PM, Andres Freund <andres@anarazel.de> wrote:
>> One problem is that it makes for misleading results if you try to
>> benchmark 9.5 against 9.6.
>
> You need a really beefy box to show the problem. On a large/new 2 socket
> machine the performance regression in in the 1-3% range for a pgbench of
> SELECT 1. So it's not like it's immediately showing up for everyone.
>
> Putting it on the open items list sounds good to me.

Well, OK, I've done that then.  I don't really agree that it's not a
problem; the OP said he saw a 3x regression, and some of my colleagues
doing benchmarking are complaining about this commit, too.  It doesn't
seem like much of a stretch to think that it might be affecting other
people as well.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
On 2016-02-11 13:09:27 -0500, Robert Haas wrote:
> On Thu, Feb 11, 2016 at 12:53 PM, Andres Freund <andres@anarazel.de> wrote:
> >> One problem is that it makes for misleading results if you try to
> >> benchmark 9.5 against 9.6.
> >
> > You need a really beefy box to show the problem. On a large/new 2 socket
> > machine the performance regression in in the 1-3% range for a pgbench of
> > SELECT 1. So it's not like it's immediately showing up for everyone.
> >
> > Putting it on the open items list sounds good to me.
> 
> Well, OK, I've done that then.  I don't really agree that it's not a
> problem; the OP said he saw a 3x regression, and some of my colleagues
> doing benchmarking are complaining about this commit, too.  It doesn't
> seem like much of a stretch to think that it might be affecting other
> people as well.

Well, I can't do anything about that right now. I won't have the time to
whip up the new/more complex API we discussed upthread in the next few
days.  So either we go with a simpler API (e.g. pretty much a cleaned up
version of my earlier patch), revert the postmaster deatch check, or
somebody else has to take lead in renovating, or we wait...

Andres



Re: Performance degradation in commit ac1d794

From
Robert Haas
Date:
On Thu, Feb 11, 2016 at 1:19 PM, Andres Freund <andres@anarazel.de> wrote:
>> > Putting it on the open items list sounds good to me.
>>
>> Well, OK, I've done that then.  I don't really agree that it's not a
>> problem; the OP said he saw a 3x regression, and some of my colleagues
>> doing benchmarking are complaining about this commit, too.  It doesn't
>> seem like much of a stretch to think that it might be affecting other
>> people as well.
>
> Well, I can't do anything about that right now. I won't have the time to
> whip up the new/more complex API we discussed upthread in the next few
> days.  So either we go with a simpler API (e.g. pretty much a cleaned up
> version of my earlier patch), revert the postmaster deatch check, or
> somebody else has to take lead in renovating, or we wait...

Well, I thought we could just revert the patch until you had time to
deal with it, and then put it back in.  That seemed like a simple and
practical option from here, and I don't think I quite understand why
you and Tom don't like it.  I don't have a problem with deferring to
the majority will here, but I would sort of like to understand the
reason for the majority will.

BTW, if need be, I can look for an EnterpriseDB resource to work on
this.  It won't likely be me, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance degradation in commit ac1d794

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Feb 11, 2016 at 1:19 PM, Andres Freund <andres@anarazel.de> wrote:
>> Well, I can't do anything about that right now. I won't have the time to
>> whip up the new/more complex API we discussed upthread in the next few
>> days.  So either we go with a simpler API (e.g. pretty much a cleaned up
>> version of my earlier patch), revert the postmaster deatch check, or
>> somebody else has to take lead in renovating, or we wait...

> Well, I thought we could just revert the patch until you had time to
> deal with it, and then put it back in.  That seemed like a simple and
> practical option from here, and I don't think I quite understand why
> you and Tom don't like it.

Don't particularly want the git history churn, if we expect that the
patch will ship as-committed in 9.6.  If it becomes clear that the
performance fix is unlikely to happen, we can revert then.

If the performance change were an issue for a lot of testing, I'd agree
with a temporary revert, but I concur with Andres that it's not blocking
much.  Anybody who does have an issue there can revert locally, no?
        regards, tom lane



Re: Performance degradation in commit ac1d794

From
Robert Haas
Date:
On Thu, Feb 11, 2016 at 1:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Feb 11, 2016 at 1:19 PM, Andres Freund <andres@anarazel.de> wrote:
>>> Well, I can't do anything about that right now. I won't have the time to
>>> whip up the new/more complex API we discussed upthread in the next few
>>> days.  So either we go with a simpler API (e.g. pretty much a cleaned up
>>> version of my earlier patch), revert the postmaster deatch check, or
>>> somebody else has to take lead in renovating, or we wait...
>
>> Well, I thought we could just revert the patch until you had time to
>> deal with it, and then put it back in.  That seemed like a simple and
>> practical option from here, and I don't think I quite understand why
>> you and Tom don't like it.
>
> Don't particularly want the git history churn, if we expect that the
> patch will ship as-committed in 9.6.  If it becomes clear that the
> performance fix is unlikely to happen, we can revert then.
>
> If the performance change were an issue for a lot of testing, I'd agree
> with a temporary revert, but I concur with Andres that it's not blocking
> much.  Anybody who does have an issue there can revert locally, no?

True.  Maybe we'll just have to start doing that for EnterpriseDB
benchmarking as standard practice.  Not sure everybody who is
benchmarking will realize the issue though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



[PoC] WaitLatchOrSocketMulti (Re: Performance degradation in commit ac1d794)

From
Kyotaro HORIGUCHI
Date:
Hello, I don't see how ac1d794 will be dealt, but I tried an
example implement of multi-socket version of WaitLatchOrSocket
using callbacks on top of the current master where ac1d794 has
not been removed yet.

At Thu, 14 Jan 2016 13:46:44 -0500, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+TgmoYBa8TJRGS07JCSLKpqGkrRd5hLpirvwp36s=83ChmQDA@mail.gmail.com>
> On Thu, Jan 14, 2016 at 12:14 PM, Andres Freund <andres@anarazel.de> wrote:
> > On 2016-01-14 12:07:23 -0500, Robert Haas wrote:
> >> > Do we want to provide a backward compatible API for all this? I'm fine
> >> > either way.
> >>
> >> How would that work?
> >
> > I'm thinking of something like;
> >
> > int WaitOnLatchSet(LatchEventSet *set, int wakeEvents, long timeout);
> >
> > int
> > WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,long timeout)
> > {
> >         LatchEventSet set;
> >
> >         LatchEventSetInit(&set, latch);
> >
> >         if (sock != PGINVALID_SOCKET)
> >            LatchEventSetAddSock(&set, sock);
> >
> >         return WaitOnLatchSet(set, wakeEvents, timeout);
> > }
> >
> > I think we'll need to continue having wakeEvents and timeout parameters
> > for WaitOnLatchSet, we quite frequently want to wait socket
> > readability/writability, not wait on the socket, or have/not have
> > timeouts.
> 
> Well, if we ever wanted to support multiple FDs, we'd need the
> readability/writeability thing to be per-fd, not per-set.
> 
> Overall, if this is what you have in mind for backward compatibility,
> I rate it M for Meh.  Let's just break compatibility and people will
> have to update their code.  That shouldn't be hard, and if we don't
> make people do it when we make the change, then we'll be stuck with
> the backward-compatibility interface for a decade.  I doubt it's worth
> it.

The API is similar to what Robert suggested but different because
it would too complicate a bit for the most cases. So this example
implement has an intermediate style of the current API and the
Robert's suggestion, and using callbacks as I proposed.

int WaitLatchOrSocketMulti(pgwaitobject *wobjs, int nobjs, long timeout);

This is implemented only for poll, not for select.

A sample usage is seen in secure_read().

>     pgwaitobject objs[3];
...
>     InitWaitLatch(objs[0], MyLatch);
>     InitWaitPostmasterDeath(objs[1]);
>     InitWaitSocket(objs[2], port->sock, waitfor);
> 
>     w = WaitLatchOrSocketMulti(objs, 3, 0);
> //    w = WaitLatchOrSocket(MyLatch,
> //                WL_LATCH_SET | WL_POSTMASTER_DEATH | waitfor,
> //                port->sock, 0);


The core of the function looks as the following. It runs
callbacks for every fired events.

>    rc = poll(pfds, nfds, (int) cur_timeout);
...
>    if (rc < 0)
...
>    else
>    {
>      for (i = 0 ; i < nfds ; i++)
>      {
>        wobjs[i].retEvents = 0;
>        if (pfds[i].revents && wobjs[i].cb)
>          result |= wobjs[i].cb(&wobjs[i], pfds[i].revents);
>
>        if (result & WL_IMMEDIATELY_BREAK)
>          break;
>      }
>    }

In the above part, poll()'s event is passed the callbacks so
callbacks may have a different inplement for select().

Having a callback for sockets. The initializer could be as the
following.

> InitWaitSocketCB(wait_obj, sock, event, your_callback);


If we want to have the waiting-object array independently from
specific functions to achieve asynchronous handling of socket
events. It could be realised by providing a set of wrapper
functions as exactly what Robert said as above.

Does this make sense?
Does anyone have any opinion? or thoughts?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
On 2016-02-08 17:49:18 +0900, Kyotaro HORIGUCHI wrote:
> How about allowing registration of a callback for every waiting
> socket. The signature of the callback function would be like

I don't think a a callback based API is going to serve us well. Most of
the current latch callers would get noticeably more complex that
way. And a number of them will benefit from latches using epoll
internally.

Greetings,

Andres Freund



Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
On 2016-01-14 12:07:23 -0500, Robert Haas wrote:
> On Thu, Jan 14, 2016 at 12:06 PM, Andres Freund <andres@anarazel.de> wrote:
> > On 2016-01-14 11:31:03 -0500, Robert Haas wrote:
> >> On Thu, Jan 14, 2016 at 10:56 AM, Andres Freund <andres@anarazel.de> wrote:
> >> I think your idea of a data structure the encapsulates a set of events
> >> for which to wait is probably a good one.  WaitLatch doesn't seem like
> >> a great name.  Maybe WaitEventSet, and then we can have
> >> WaitLatch(&latch) and WaitEvents(&eventset).
> >
> > Hm, I'd like to have latch in the name. It seems far from improbably to
> > have another wait data structure. LatchEventSet maybe? The wait would be
> > implied by WaitLatch.
> 
> I can live with that.


How about the following sketch of an API

typedef struct LatchEvent
{uint32 events; /* interesting events */uint32 revents;  /* returned events */int fd; /* fd associated with event */
} LatchEvent;

typedef struct LatchEventSet
{int nevents;LatchEvent events[FLEXIBLE_ARRAY_MEMBER];
} LatchEventSet;

/** Create a LatchEventSet with space for nevents different events to wait for.** latch may be NULL.*/
extern LatchEventSet *CreateLatchEventSet(int nevents, Latch *latch);

/* ---* Add an event to the set. Possible events are:* - WL_LATCH_SET: Wait for the latch to be set* -
WL_SOCKET_READABLE:Wait for socket to become readable*   can be combined in one event with WL_SOCKET_WRITEABLE* -
WL_SOCKET_WRITABLE:Wait for socket to become readable*   can be combined with WL_SOCKET_READABLE* -
WL_POSTMASTER_DEATH:Wait for postmaster to die*/
 
extern void AddLatchEventToSet(LatchEventSet *set, uint32 events, int fd);

/** Wait for any events added to the set to happen, or until the timeout is* reached.** The return value is the union
ofall the events that were detected. This* allows to avoid having to look into the associated events[i].revents*
fields.*/
extern uint32 WaitLatchEventSet(LatchEventSet *set, long timeout);



I've two questions:
- Is there any benefit of being able to wait for more than one latch? I'm inclined to not allow that for now, that'd
makethe patch bigger, and I don't see a use-case right now.
 
- Given current users we don't need a large amount of events, so having to iterate through the registered events
doesn'tseem bothersome. We could however change the api to be something like  int WaitLatchEventSet(LatchEventSet *set,
OccurredEvents*, int nevents, long timeout);
 
 which would return the number of events that happened, and would basically "fill" one of the (usually stack allocated)
OccurredEventstructures with what happened.
 


Comments?



Re: Performance degradation in commit ac1d794

From
Robert Haas
Date:
On Wed, Mar 16, 2016 at 2:52 PM, Andres Freund <andres@anarazel.de> wrote:
> How about the following sketch of an API
>
> typedef struct LatchEvent
> {
>         uint32 events; /* interesting events */
>         uint32 revents;  /* returned events */
>         int fd; /* fd associated with event */
> } LatchEvent;
>
> typedef struct LatchEventSet
> {
>         int nevents;
>         LatchEvent events[FLEXIBLE_ARRAY_MEMBER];
> } LatchEventSet;
>
> /*
>  * Create a LatchEventSet with space for nevents different events to wait for.
>  *
>  * latch may be NULL.
>  */
> extern LatchEventSet *CreateLatchEventSet(int nevents, Latch *latch);

We might be able to rejigger this so that it didn't require palloc, if
we got rid of FLEXIBLE_ARRAY_MEMBER and passed int nevents and
LatchEvent * separately to WaitLatchThingy().  But I guess maybe this
will be infrequent enough not to matter.

> I've two questions:
> - Is there any benefit of being able to wait for more than one latch?
>   I'm inclined to not allow that for now, that'd make the patch bigger,
>   and I don't see a use-case right now.

I don't see a use case, either.

> - Given current users we don't need a large amount of events, so having
>   to iterate through the registered events doesn't seem bothersome. We
>   could however change the api to be something like
>
>   int WaitLatchEventSet(LatchEventSet *set, OccurredEvents *, int nevents, long timeout);
>
>   which would return the number of events that happened, and would
>   basically "fill" one of the (usually stack allocated) OccurredEvent
>   structures with what happened.

I definitely think something along these lines is useful.  I want to
be able to have an Append node with 100 ForeignScans under it and kick
off all the scans asynchronously and wait for all of the FDs at once.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
On 2016-03-16 15:08:07 -0400, Robert Haas wrote:
> On Wed, Mar 16, 2016 at 2:52 PM, Andres Freund <andres@anarazel.de> wrote:
> > How about the following sketch of an API
> >
> > typedef struct LatchEvent
> > {
> >         uint32 events; /* interesting events */
> >         uint32 revents;  /* returned events */
> >         int fd; /* fd associated with event */
> > } LatchEvent;
> >
> > typedef struct LatchEventSet
> > {
> >         int nevents;
> >         LatchEvent events[FLEXIBLE_ARRAY_MEMBER];
> > } LatchEventSet;
> >
> > /*
> >  * Create a LatchEventSet with space for nevents different events to wait for.
> >  *
> >  * latch may be NULL.
> >  */
> > extern LatchEventSet *CreateLatchEventSet(int nevents, Latch *latch);
> 
> We might be able to rejigger this so that it didn't require palloc, if
> we got rid of FLEXIBLE_ARRAY_MEMBER and passed int nevents and
> LatchEvent * separately to WaitLatchThingy().  But I guess maybe this
> will be infrequent enough not to matter.

I think we'll basically end up allocating them once for the frequent
callsites.


> > - Given current users we don't need a large amount of events, so having
> >   to iterate through the registered events doesn't seem bothersome. We
> >   could however change the api to be something like
> >
> >   int WaitLatchEventSet(LatchEventSet *set, OccurredEvents *, int nevents, long timeout);
> >
> >   which would return the number of events that happened, and would
> >   basically "fill" one of the (usually stack allocated) OccurredEvent
> >   structures with what happened.
> 
> I definitely think something along these lines is useful.  I want to
> be able to have an Append node with 100 ForeignScans under it and kick
> off all the scans asynchronously and wait for all of the FDs at once.

So you'd like to get only an event for the FD with data back? Or are you
ok with iterating through hundred elements in an array, to see which are
ready?

Andres



Re: Performance degradation in commit ac1d794

From
Robert Haas
Date:
On Wed, Mar 16, 2016 at 3:25 PM, Andres Freund <andres@anarazel.de> wrote:
>> > - Given current users we don't need a large amount of events, so having
>> >   to iterate through the registered events doesn't seem bothersome. We
>> >   could however change the api to be something like
>> >
>> >   int WaitLatchEventSet(LatchEventSet *set, OccurredEvents *, int nevents, long timeout);
>> >
>> >   which would return the number of events that happened, and would
>> >   basically "fill" one of the (usually stack allocated) OccurredEvent
>> >   structures with what happened.
>>
>> I definitely think something along these lines is useful.  I want to
>> be able to have an Append node with 100 ForeignScans under it and kick
>> off all the scans asynchronously and wait for all of the FDs at once.
>
> So you'd like to get only an event for the FD with data back? Or are you
> ok with iterating through hundred elements in an array, to see which are
> ready?

I'd like to get an event back for the FD with data.  Iterating sounds
like it could be really slow.  Say you get lots of little packets back
from the same connection, while the others are idle.  Now you've got
to keep iterating through them all over and over again.  Blech.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
On 2016-03-16 15:41:28 -0400, Robert Haas wrote:
> On Wed, Mar 16, 2016 at 3:25 PM, Andres Freund <andres@anarazel.de> wrote:
> >> > - Given current users we don't need a large amount of events, so having
> >> >   to iterate through the registered events doesn't seem bothersome. We
> >> >   could however change the api to be something like
> >> >
> >> >   int WaitLatchEventSet(LatchEventSet *set, OccurredEvents *, int nevents, long timeout);
> >> >
> >> >   which would return the number of events that happened, and would
> >> >   basically "fill" one of the (usually stack allocated) OccurredEvent
> >> >   structures with what happened.
> >>
> >> I definitely think something along these lines is useful.  I want to
> >> be able to have an Append node with 100 ForeignScans under it and kick
> >> off all the scans asynchronously and wait for all of the FDs at once.
> >
> > So you'd like to get only an event for the FD with data back? Or are you
> > ok with iterating through hundred elements in an array, to see which are
> > ready?
> 
> I'd like to get an event back for the FD with data.  Iterating sounds
> like it could be really slow.  Say you get lots of little packets back
> from the same connection, while the others are idle.  Now you've got
> to keep iterating through them all over and over again.  Blech.

Well, that's what poll() and select() require you to do internally
anyway, even if we abstract it away. But most platforms have better
implementations (epoll, kqueue, ...), so it seems fair to design for
those.


Andres



Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
Hi,

On 2016-03-16 15:41:28 -0400, Robert Haas wrote:
> On Wed, Mar 16, 2016 at 3:25 PM, Andres Freund <andres@anarazel.de> wrote:
> >> > - Given current users we don't need a large amount of events, so having
> >> >   to iterate through the registered events doesn't seem bothersome. We
> >> >   could however change the api to be something like
> >> >
> >> >   int WaitLatchEventSet(LatchEventSet *set, OccurredEvents *, int nevents, long timeout);
> >> >
> >> >   which would return the number of events that happened, and would
> >> >   basically "fill" one of the (usually stack allocated) OccurredEvent
> >> >   structures with what happened.
> >>
> >> I definitely think something along these lines is useful.  I want to
> >> be able to have an Append node with 100 ForeignScans under it and kick
> >> off all the scans asynchronously and wait for all of the FDs at once.
> >
> > So you'd like to get only an event for the FD with data back? Or are you
> > ok with iterating through hundred elements in an array, to see which are
> > ready?
>
> I'd like to get an event back for the FD with data.  Iterating sounds
> like it could be really slow.  Say you get lots of little packets back
> from the same connection, while the others are idle.  Now you've got
> to keep iterating through them all over and over again.  Blech.

I've a (working) WIP version that works like I think you want.  It's
implemented in patch 05 (rework with just poll() support) and (add epoll
suppport). It's based on patches posted here earlier, but these aren't
interesting for the discussion.

The API is now:

typedef struct WaitEventSet WaitEventSet;

typedef struct WaitEvent
{
    int        pos;        /* position in the event data structure */
    uint32    events;        /* tripped events */
    int        fd;            /* fd associated with event */
} WaitEvent;

/*
 * Create a WaitEventSet with space for nevents different events to wait for.
 *
 * latch may be NULL.
 */
extern WaitEventSet *CreateWaitEventSet(int nevents);

/* ---
 * Add an event to the set. Possible events are:
 * - WL_LATCH_SET: Wait for the latch to be set
 * - WL_POSTMASTER_DEATH: Wait for postmaster to die
 * - WL_SOCKET_READABLE: Wait for socket to become readable
 *   can be combined in one event with WL_SOCKET_WRITEABLE
 * - WL_SOCKET_WRITABLE: Wait for socket to become readable
 *   can be combined with WL_SOCKET_READABLE
 *
 * Returns the offset in WaitEventSet->events (starting from 0), which can be
 * used to modify previously added wait events.
 */
extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, int fd, Latch *latch);

/*
 * Change the event mask and, if applicable, the associated latch of of a
 * WaitEvent.
 */
extern void ModifyWaitEvent(WaitEventSet *set, int pos, uint32 events, Latch *latch);

/*
 * Wait for events added to the set to happen, or until the timeout is
 * reached.  At most nevents occurrent events are returned.
 *
 * Returns the number of events occurred, or 0 if the timeout was reached.
 */
extern int WaitEventSetWait(WaitEventSet *set, long timeout, WaitEvent* occurred_events, int nevents);


I've for now left the old latch API in place, and only converted
be-secure.c to the new style.  I'd appreciate some feedback before I go
around and convert and polish everything.

Questions:
* I'm kinda inclined to merge the win32 and unix latch
  implementations. There's already a fair bit in common, and this is
  just going to increase that amount.

* Right now the caller has to allocate the WaitEvents he's waiting for
  locally (likely on the stack), but we also could allocate them as part
  of the WaitEventSet. Not sure if that'd be a benefit.

* I can do a blind rewrite of the windows implementation, but I'm
  obviously not going to get that entirely right. So I need some help
  from a windows person to test this.

* This approach, with a 'stateful' wait event data structure, will
  actually allow to fix a couple linering bugs we have on the windows
  port. C.f. http://www.postgresql.org/message-id/4351.1336927207@sss.pgh.pa.us

- Andres

Attachment

Re: Performance degradation in commit ac1d794

From
Amit Kapila
Date:
On Thu, Mar 17, 2016 at 7:34 AM, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> * I can do a blind rewrite of the windows implementation, but I'm
>   obviously not going to get that entirely right. So I need some help
>   from a windows person to test this.
>

I can help you verifying the windows implementation.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Performance degradation in commit ac1d794

From
Robert Haas
Date:
On Wed, Mar 16, 2016 at 10:04 PM, Andres Freund <andres@anarazel.de> wrote:
> Questions:
> * I'm kinda inclined to merge the win32 and unix latch
>   implementations. There's already a fair bit in common, and this is
>   just going to increase that amount.

Don't care either way.

> * Right now the caller has to allocate the WaitEvents he's waiting for
>   locally (likely on the stack), but we also could allocate them as part
>   of the WaitEventSet. Not sure if that'd be a benefit.

I'm not seeing this.  What do you mean?

0001: Looking at this again, I'm no longer sure this is a bug.
Doesn't your patch just check the same conditions in the opposite
order?

0002: I think I reviewed this before.  Boring.  Just commit it already.

0003: Mostly boring.  But the change to win32_latch.c seems to remove
an unrelated check.

0004:

+         * drain it everytime WaitLatchOrSocket() is used. Should the
+         * pipe-buffer fill up in some scenarios - widly unlikely - we're

every time
wildly

Why is it wildly (or widly) unlikely?

The rejiggering this does between what is on which element of pfds[]
appears to be unrelated to the ostensible purpose of the patch.

+         * Check again wether latch is set, the arrival of a signal/self-byte

whether.  Also not clearly related to the patch's main purpose.
            /* at least one event occurred, so check masks */
+            if (FD_ISSET(selfpipe_readfd, &input_mask))
+            {
+                /* There's data in the self-pipe, clear it. */
+                drainSelfPipe();
+            }

The comment just preceding this added hunk now seems to be out of
place, and maybe inaccurate as well.  I think the new code could have
a bit more detailed comment.  My understanding is something like /*
Since we didn't clear the self-pipe before attempting to wait,
select() may have returned immediately even though there has been no
recent change to the state of the latch.  To prevent busy-looping, we
must clear the pipe before attempting to wait again. */

I'll look at 0005 next, but thought I would send these comments along first.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance degradation in commit ac1d794

From
Robert Haas
Date:
On Thu, Mar 17, 2016 at 9:01 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I'll look at 0005 next, but thought I would send these comments along first.

0005: This is obviously very much WIP, but I think the overall
direction of it is good.
0006: Same.

I think you should use PGINVALID_SOCKET rather than -1 in various
places in various patches in this series, especially if you are going
to try to merge the Windows code path.

I wonder if CreateEventSet should accept a MemoryContext argument.  It
seems like callers will usually want TopMemoryContext, and just being
able to pass that might be more convenient than having to switch back
and forth in the calling code.

I wonder if there's a way to refactor this code to avoid having so
much cut-and-paste duplication.

When iterating over the returned events, maybe check whether events is
0 at the top of the loop and skip it forthwith if so.

That's all I've got for now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
On 2016-03-17 09:40:08 -0400, Robert Haas wrote:
> On Thu, Mar 17, 2016 at 9:01 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > I'll look at 0005 next, but thought I would send these comments along first.
> 
> 0005: This is obviously very much WIP, but I think the overall
> direction of it is good.
> 0006: Same.
> 
> I think you should use PGINVALID_SOCKET rather than -1 in various
> places in various patches in this series, especially if you are going
> to try to merge the Windows code path.

Sure.


> I wonder if CreateEventSet should accept a MemoryContext argument.  It
> seems like callers will usually want TopMemoryContext, and just being
> able to pass that might be more convenient than having to switch back
> and forth in the calling code.

Makes sense.


> I wonder if there's a way to refactor this code to avoid having so
> much cut-and-paste duplication.

I guess you mean WaitEventSetWait() and WaitEventAdjust*?  I've tried,
and my attempt ended up look nearly unreadable, because of the number of
ifdefs. I've not found a good attempt. Which is sad, because adding back
select support is going to increase the duplication further :( - but
it's also further away from poll etc. (different type of timestamp,
entirely different way of returming events).


> When iterating over the returned events, maybe check whether events is
> 0 at the top of the loop and skip it forthwith if so.

You mean in WaitEventSetWait()? There's    else if (rc == 0)    {        break;    }
which is the timeout case. There should never be any other case of    returning 0 elements?

> That's all I've got for now.

Thanks for looking.

Greetings,

Andres Freund



Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
Hi,

On 2016-03-17 09:01:36 -0400, Robert Haas wrote:
> 0001: Looking at this again, I'm no longer sure this is a bug.
> Doesn't your patch just check the same conditions in the opposite
> order?

Yes, that's what's required

> 0004:
> 
> +         * drain it everytime WaitLatchOrSocket() is used. Should the
> +         * pipe-buffer fill up in some scenarios - widly unlikely - we're
> 
> every time
> wildly
> 
> Why is it wildly (or widly) unlikely?
> 
> The rejiggering this does between what is on which element of pfds[]
> appears to be unrelated to the ostensible purpose of the patch.

Well, not really. We need to know when to do drainSelfPipe(); Which gets
more complicated if pfds[0] is registered optionally.

I'm actually considering to drop this entirely, given the much heavier
rework in the WaitEvent set patch; making these details a bit obsolete.


Greetings,

Andres Freund



Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
On 2016-03-17 09:01:36 -0400, Robert Haas wrote:
> > * Right now the caller has to allocate the WaitEvents he's waiting for
> >   locally (likely on the stack), but we also could allocate them as part
> >   of the WaitEventSet. Not sure if that'd be a benefit.
>
> I'm not seeing this.  What do you mean?

Right now, do use a WaitEventSet you'd do something like    WaitEvent   event;
    ModifyWaitEvent(FeBeWaitSet, 0, waitfor, NULL);
    WaitEventSetWait(FeBeWaitSet, 0 /* no timeout */, &event, 1);

i.e. use a WaitEvent on the stack to receive the changes. If you wanted
to get more changes than just one, you could end up allocating a fair
bit of stack space.

We could instead allocate the returned events as part of the event set,
and return them. Either by returning a NULL terminated array, or by
continuing to return the number of events as now, and additionally
return the event data structure via a pointer.

So the above would be
    WaitEvent   *events;               int nevents;
    ModifyWaitEvent(FeBeWaitSet, 0, waitfor, NULL);
    nevents = WaitEventSetWait(FeBeWaitSet, 0 /* no timeout */, events, 10);
    for (int off = 0; off <= nevents; nevents++)                   ; // stuff

Andres



Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
Hi,

On 2016-03-17 09:01:36 -0400, Robert Haas wrote:
> 0001: Looking at this again, I'm no longer sure this is a bug.
> Doesn't your patch just check the same conditions in the opposite
> order?

Which is important, because what's in what pfds[x] depends on
wakeEvents. Folded it into a later patch; it's not harmful as long as
we're only ever testing pfds[0].


> 0003: Mostly boring.  But the change to win32_latch.c seems to remove
> an unrelated check.

Argh.


> 0004:
>
> +         * drain it everytime WaitLatchOrSocket() is used. Should the
> +         * pipe-buffer fill up in some scenarios - widly unlikely - we're
>
> every time
> wildly
>
> Why is it wildly (or widly) unlikely?

Because SetLatch (if the owner) check latch->is_set before adding to the
pipe, and latch_sigusr1_handler() only writes to the pipe if the current
process is in WaitLatchOrSocket's loop (via the waiting check). Expanded
comment.


>
> +         * Check again wether latch is set, the arrival of a signal/self-byte
>
> whether.  Also not clearly related to the patch's main purpose.

After the change there's no need to re-compute the current timestamp
anymore, that does seem beneficial and kinda related.


>              /* at least one event occurred, so check masks */
> +            if (FD_ISSET(selfpipe_readfd, &input_mask))
> +            {
> +                /* There's data in the self-pipe, clear it. */
> +                drainSelfPipe();
> +            }
>
> The comment just preceding this added hunk now seems to be out of
> place, and maybe inaccurate as well.

Hm. Which comment are you exactly referring to?
            /* at least one event occurred, so check masks */
seems not to fit the bill?


> I think the new code could have
> a bit more detailed comment.  My understanding is something like /*
> Since we didn't clear the self-pipe before attempting to wait,
> select() may have returned immediately even though there has been no
> recent change to the state of the latch.  To prevent busy-looping, we
> must clear the pipe before attempting to wait again. */

Isn't that explained at the top, in
        /*
         * Check if the latch is set already. If so, leave loop immediately,
         * avoid blocking again. We don't attempt to report any other events
         * that might also be satisfied.
         *
         * If someone sets the latch between this and the poll()/select()
         * below, the setter will write a byte to the pipe (or signal us and
         * the signal handler will do that), and the poll()/select() will
         * return immediately.
         *
         * If there's a pending byte in the self pipe, we'll notice whenever
         * blocking. Only clearing the pipe in that case avoids having to
         * drain it every time WaitLatchOrSocket() is used. Should the
         * pipe-buffer fill up in some scenarios - wildly unlikely - we're
         * still ok, because the pipe is in nonblocking mode.
?

I've updated the last paragraph to
         * If there's a pending byte in the self pipe, we'll notice whenever
         * blocking. Only clearing the pipe in that case avoids having to
         * drain it every time WaitLatchOrSocket() is used. Should the
         * pipe-buffer fill up we're still ok, because the pipe is in
         * nonblocking mode. It's unlikely for that to happen, because the
         * self pipe isn't filled unless we're blocking (waiting = true), or
         * from inside a signal handler in latch_sigusr1_handler().

I've also applied the same optimization to windows. Less because I found
that interesting in itself, and more because it makes the WaitEventSet
easier.


Attached is a significantly revised version of the earlier series. Most
importantly I have:
* Unified the window/unix latch implementation into one file (0004)
* Provided a select(2) implementation for the WaitEventSet API
* Provided a windows implementation for the WaitEventSet API
* Reduced duplication between the implementations a good bit by
  splitting WaitEventSetWait into WaitEventSetWait and
  WaitEventSetWaitBlock. Only the latter is implemented separately for
  each readiness primitive
* Added a backward-compatibility implementation of WaitLatchOrSocket
  using the WaitEventSet stuff. Less because I thought that to be
  terribly important, and more because it makes the patch a *lot*
  smaller.  We collected a fair amount of latch users.

This is still not fully ready. The main reamining items are testing (the
windows stuff I've only verified using cross-compiling with mingw) and
documentation.

I'd greatly appreciate a look.

Amit, you offered testing on windows; could you check whether 3/4/5
work? It's quite likely that I've screwed up something.


Robert you'd mentioned on IM that you've a use-case for this somewhere
around multiple FDWs. If somebody has started working on that, could you
ask that person to check whether the API makes sense?

Greetings,

Andres Freund

Attachment

Re: Performance degradation in commit ac1d794

From
Robert Haas
Date:
On Thu, Mar 17, 2016 at 10:57 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-03-17 09:01:36 -0400, Robert Haas wrote:
>> 0001: Looking at this again, I'm no longer sure this is a bug.
>> Doesn't your patch just check the same conditions in the opposite
>> order?
>
> Yes, that's what's required

I mean, they are just variables.  You can check them in either order
and get the same results, no?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance degradation in commit ac1d794

From
Robert Haas
Date:
On Thu, Mar 17, 2016 at 10:53 AM, Andres Freund <andres@anarazel.de> wrote:
>> I wonder if there's a way to refactor this code to avoid having so
>> much cut-and-paste duplication.
>
> I guess you mean WaitEventSetWait() and WaitEventAdjust*?  I've tried,
> and my attempt ended up look nearly unreadable, because of the number of
> ifdefs. I've not found a good attempt. Which is sad, because adding back
> select support is going to increase the duplication further :( - but
> it's also further away from poll etc. (different type of timestamp,
> entirely different way of returming events).

I was more thinking of stuff like this:

+            /*
+             * We expect an EPOLLHUP when the remote end is closed, but
+             * because we don't expect the pipe to become readable or to have
+             * any errors either, treat those cases as postmaster death, too.
+             *
+             * According to the select(2) man page on Linux, select(2) may
+             * spuriously return and report a file descriptor as readable,
+             * when it's not; and presumably so can epoll_wait(2).  It's not
+             * clear that the relevant cases would ever apply to the
+             * postmaster pipe, but since the consequences of falsely
+             * returning WL_POSTMASTER_DEATH could be pretty unpleasant, we
+             * take the trouble to positively verify EOF with
+             * PostmasterIsAlive().
+             */

>> 0 at the top of the loop and skip it forthwith if so.
>
> You mean in WaitEventSetWait()? There's
>                 else if (rc == 0)
>                 {
>                         break;
>                 }
> which is the timeout case. There should never be any other case of
>                 returning 0 elements?

No, I meant if (cur_event->events == 0) continue;

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance degradation in commit ac1d794

From
Robert Haas
Date:
On Thu, Mar 17, 2016 at 11:17 AM, Andres Freund <andres@anarazel.de> wrote:
> Right now, do use a WaitEventSet you'd do something like
>                 WaitEvent   event;
>
>                 ModifyWaitEvent(FeBeWaitSet, 0, waitfor, NULL);
>
>                 WaitEventSetWait(FeBeWaitSet, 0 /* no timeout */, &event, 1);
>
> i.e. use a WaitEvent on the stack to receive the changes. If you wanted
> to get more changes than just one, you could end up allocating a fair
> bit of stack space.
>
> We could instead allocate the returned events as part of the event set,
> and return them. Either by returning a NULL terminated array, or by
> continuing to return the number of events as now, and additionally
> return the event data structure via a pointer.
>
> So the above would be
>
>                 WaitEvent   *events;
>                 int nevents;
>
>                 ModifyWaitEvent(FeBeWaitSet, 0, waitfor, NULL);
>
>                 nevents = WaitEventSetWait(FeBeWaitSet, 0 /* no timeout */, events, 10);
>
>                 for (int off = 0; off <= nevents; nevents++)
>                     ; // stuff

I don't see this as being particularly better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance degradation in commit ac1d794

From
Amit Kapila
Date:
On Fri, Mar 18, 2016 at 1:34 PM, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2016-03-17 09:01:36 -0400, Robert Haas wrote:
> > 0001: Looking at this again, I'm no longer sure this is a bug.
> > Doesn't your patch just check the same conditions in the opposite
> > order?
>
> Which is important, because what's in what pfds[x] depends on
> wakeEvents. Folded it into a later patch; it's not harmful as long as
> we're only ever testing pfds[0].
>
>
> > 0003: Mostly boring.  But the change to win32_latch.c seems to remove
> > an unrelated check.
>
> Argh.
>

+ * from inside a signal handler in latch_sigusr1_handler().

  *

  * Note: we assume that the kernel calls involved in drainSelfPipe()

  * and SetLatch() will provide adequate synchronization on machines

  * with weak memory ordering, so that we cannot miss seeing is_set if

  * the signal byte is already in the pipe when we drain it.

  */

- drainSelfPipe();

-



Above part of comment looks redundant after this patch.  I have done some tests on Windows with 0003 patch which includes running the regressions (vcregress check) and it passes.  Will look into it tomorrow once again and share if I find anything wrong with it, but feel to proceed if you want.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
On 2016-03-18 05:56:41 -0400, Robert Haas wrote:
> >> 0 at the top of the loop and skip it forthwith if so.
> >
> > You mean in WaitEventSetWait()? There's
> >                 else if (rc == 0)
> >                 {
> >                         break;
> >                 }
> > which is the timeout case. There should never be any other case of
> >                 returning 0 elements?
> 
> No, I meant if (cur_event->events == 0) continue;

I'm not following. Why would there be an event without an empty event
mask? Ok, you can disable all notifications for a socket using
ModifyWaitEvent(), but that's not particularly common, right?  At least
for epoll, it'd not play a role anyway, since epoll_wait() will actually
return pointers to the elements we're waiting on; for windows we get the
offset in ->handles.  I guess we could do so in the select/poll case,
but adding another if for something infrequent doesn't strike me as a
great benefit.

- Andres



Re: Performance degradation in commit ac1d794

From
Robert Haas
Date:
On Fri, Mar 18, 2016 at 1:53 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-03-18 05:56:41 -0400, Robert Haas wrote:
>> >> 0 at the top of the loop and skip it forthwith if so.
>> >
>> > You mean in WaitEventSetWait()? There's
>> >                 else if (rc == 0)
>> >                 {
>> >                         break;
>> >                 }
>> > which is the timeout case. There should never be any other case of
>> >                 returning 0 elements?
>>
>> No, I meant if (cur_event->events == 0) continue;
>
> I'm not following. Why would there be an event without an empty event
> mask? Ok, you can disable all notifications for a socket using
> ModifyWaitEvent(), but that's not particularly common, right?  At least
> for epoll, it'd not play a role anyway, since epoll_wait() will actually
> return pointers to the elements we're waiting on; for windows we get the
> offset in ->handles.  I guess we could do so in the select/poll case,
> but adding another if for something infrequent doesn't strike me as a
> great benefit.

No, I mean it should be quite common for a particular fd to have no
events reported.  If we're polling on 100 fds and 1 of them is active
and the other 99 are just sitting there, we want to skip over the
other 99 as quickly as possible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
On 2016-03-18 14:00:58 -0400, Robert Haas wrote:
> No, I mean it should be quite common for a particular fd to have no
> events reported.  If we're polling on 100 fds and 1 of them is active
> and the other 99 are just sitting there, we want to skip over the
> other 99 as quickly as possible.

cur_events points to the event that's registered with the set, not the
one that's "returned" or "modified" by poll/select, that's where the
confusion is originating from. I'll add a fastpath.

Greetings,

Andres Freund



Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
On 2016-03-18 20:14:07 +0530, Amit Kapila wrote:
> + * from inside a signal handler in latch_sigusr1_handler().
> 
>   *
> 
>   * Note: we assume that the kernel calls involved in drainSelfPipe()
> 
>   * and SetLatch() will provide adequate synchronization on machines
> 
>   * with weak memory ordering, so that we cannot miss seeing is_set if
> 
>   * the signal byte is already in the pipe when we drain it.
> 
>   */
> 
> - drainSelfPipe();
> 
> -
> 
> 
> Above part of comment looks redundant after this patch.

Don't think so. Moving it closer to the drainSelfPipe() call might be
neat, but since there's several callsites...

> I have done some
> tests on Windows with 0003 patch which includes running the regressions
> (vcregress check) and it passes.  Will look into it tomorrow once again and
> share if I find anything wrong with it, but feel to proceed if you want.

Thanks for the testing thus far! Let's see what the buildfarm has to
say.

Andres



Re: Performance degradation in commit ac1d794

From
Alvaro Herrera
Date:
Andres Freund wrote:

> From 1d444b0855dbf65d66d73beb647b772fff3404c8 Mon Sep 17 00:00:00 2001
> From: Andres Freund <andres@anarazel.de>
> Date: Fri, 18 Mar 2016 00:52:07 -0700
> Subject: [PATCH 4/5] Combine win32 and unix latch implementations.
> 
> Previously latches for windows and unix had been implemented in
> different files. The next patch in this series will introduce an
> expanded wait infrastructure, keeping the implementation separate would
> introduce too much duplication.
> 
> This basically just moves the functions, without too much change. The
> reason to keep this separate is that it allows blame to continue working
> a little less badly; and to make review a tiny bit easier.

This seems a reasonable change, but I think that the use of WIN32 vs.
LATCH_USE_WIN32 is pretty confusing.  In particular, LATCH_USE_WIN32
isn't actually used for anything ... I suppose we don't care since this
is a temporary state of affairs only?

In 0005: In latch.c you typedef WaitEventSet, but the typedef already
appears in latch.h.  You need only declare the struct in latch.c,
without typedef'ing.


Haven't really reviewed anything here yet, just skimming ATM.  Having so
many #ifdefs all over the place in this file looks really bad, but I
guess there's no way around that because this is very platform-specific.
I hope pgindent doesn't choke on it.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
On 2016-03-18 15:49:51 -0300, Alvaro Herrera wrote:
> This seems a reasonable change, but I think that the use of WIN32 vs.
> LATCH_USE_WIN32 is pretty confusing.  In particular, LATCH_USE_WIN32
> isn't actually used for anything ... I suppose we don't care since this
> is a temporary state of affairs only?

Hm, I guess we could make some more of those use LATCH_USE_WIN32.
There's essentially two axes here a) latch notification method
(self-pipe vs windows events) b) readiness notification (epoll vs poll
vs select vs WaitForMultipleObjects).


> In 0005: In latch.c you typedef WaitEventSet, but the typedef already
> appears in latch.h.  You need only declare the struct in latch.c,
> without typedef'ing.

Good catch. It's even important, some compilers choke on that.


> Haven't really reviewed anything here yet, just skimming ATM.  Having so
> many #ifdefs all over the place in this file looks really bad, but I
> guess there's no way around that because this is very platform-specific.

I think it's hard to further reduce the number of ifdefs, but if you
have ideas...


> I hope pgindent doesn't choke on it.

The patch is pgindented (I'd personally never decrease indentation just
to fit a line into 79 chars, as pgindent does...).

Thanks for looking!


Greetings,

Andres Freund



Re: Performance degradation in commit ac1d794

From
Amit Kapila
Date:
On Sat, Mar 19, 2016 at 12:00 AM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-03-18 20:14:07 +0530, Amit Kapila wrote:
>
> > I have done some
> > tests on Windows with 0003 patch which includes running the regressions
> > (vcregress check) and it passes.  Will look into it tomorrow once again and
> > share if I find anything wrong with it, but feel to proceed if you want.
>
> Thanks for the testing thus far! Let's see what the buildfarm has to
> say.
>

Won't the new code needs to ensure that ResetEvent(latchevent) should get called in case WaitForMultipleObjects() comes out when both pgwin32_signal_event and latchevent are signalled at the same time?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:

On March 18, 2016 11:32:53 PM PDT, Amit Kapila <amit.kapila16@gmail.com> wrote:
>On Sat, Mar 19, 2016 at 12:00 AM, Andres Freund <andres@anarazel.de>
>wrote:
>>
>> On 2016-03-18 20:14:07 +0530, Amit Kapila wrote:
>>
>> > I have done some
>> > tests on Windows with 0003 patch which includes running the
>regressions
>> > (vcregress check) and it passes.  Will look into it tomorrow once
>again
>and
>> > share if I find anything wrong with it, but feel to proceed if you
>want.
>>
>> Thanks for the testing thus far! Let's see what the buildfarm has to
>> say.
>>
>
>Won't the new code needs to ensure that ResetEvent(latchevent) should
>get
>called in case WaitForMultipleObjects() comes out when both
>pgwin32_signal_event and latchevent are signalled at the same time?

WaitForMultiple only reports the readiness of on event at a time, no?


-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: Performance degradation in commit ac1d794

From
Amit Kapila
Date:
On Sat, Mar 19, 2016 at 12:14 PM, Andres Freund <andres@anarazel.de> wrote:
>
>
>
> On March 18, 2016 11:32:53 PM PDT, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >On Sat, Mar 19, 2016 at 12:00 AM, Andres Freund <andres@anarazel.de>
> >wrote:
> >>
> >> On 2016-03-18 20:14:07 +0530, Amit Kapila wrote:
> >>
> >> > I have done some
> >> > tests on Windows with 0003 patch which includes running the
> >regressions
> >> > (vcregress check) and it passes.  Will look into it tomorrow once
> >again
> >and
> >> > share if I find anything wrong with it, but feel to proceed if you
> >want.
> >>
> >> Thanks for the testing thus far! Let's see what the buildfarm has to
> >> say.
> >>
> >
> >Won't the new code needs to ensure that ResetEvent(latchevent) should
> >get
> >called in case WaitForMultipleObjects() comes out when both
> >pgwin32_signal_event and latchevent are signalled at the same time?
>
> WaitForMultiple only reports the readiness of on event at a time, no?
>

I don't think so, please read link [1] with a focus on below paragraph which states how it reports the readiness or signaled state when multiple objects become signaled.

"When bWaitAll is FALSE, this function checks the handles in the array in order starting with index 0, until one of the objects is signaled. If multiple objects become signaled, the function returns the index of the first handle in the array whose object was signaled."





With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:

On March 18, 2016 11:52:08 PM PDT, Amit Kapila <amit.kapila16@gmail.com> wrote:
>On Sat, Mar 19, 2016 at 12:14 PM, Andres Freund <andres@anarazel.de>
>wrote:
>>
>>
>>
>> On March 18, 2016 11:32:53 PM PDT, Amit Kapila
><amit.kapila16@gmail.com>
>wrote:
>> >On Sat, Mar 19, 2016 at 12:00 AM, Andres Freund <andres@anarazel.de>
>> >wrote:
>> >>
>> >> On 2016-03-18 20:14:07 +0530, Amit Kapila wrote:
>> >>
>> >> > I have done some
>> >> > tests on Windows with 0003 patch which includes running the
>> >regressions
>> >> > (vcregress check) and it passes.  Will look into it tomorrow
>once
>> >again
>> >and
>> >> > share if I find anything wrong with it, but feel to proceed if
>you
>> >want.
>> >>
>> >> Thanks for the testing thus far! Let's see what the buildfarm has
>to
>> >> say.
>> >>
>> >
>> >Won't the new code needs to ensure that ResetEvent(latchevent)
>should
>> >get
>> >called in case WaitForMultipleObjects() comes out when both
>> >pgwin32_signal_event and latchevent are signalled at the same time?
>>
>> WaitForMultiple only reports the readiness of on event at a time, no?
>>
>
>I don't think so, please read link [1] with a focus on below paragraph
>which states how it reports the readiness or signaled state when
>multiple
>objects become signaled.
>
>"When *bWaitAll* is *FALSE*, this function checks the handles in the
>array
>in order starting with index 0, until one of the objects is signaled.
>If
>multiple objects become signaled, the function returns the index of the
>first handle in the array whose object was signaled."

I think that's OK. We'll just get the next event the next time we call waitfor*. It's also not different to the way the
routineis currently handling normal socket and postmaster events, no?  It's be absurdly broken if it handled edge
triggeredenemy's like FD_CLOSE in a way you can't discover.
 

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: Performance degradation in commit ac1d794

From
Amit Kapila
Date:
On Sat, Mar 19, 2016 at 12:40 PM, Andres Freund <andres@anarazel.de> wrote:
>
> On March 18, 2016 11:52:08 PM PDT, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >> >Won't the new code needs to ensure that ResetEvent(latchevent)
> >should
> >> >get
> >> >called in case WaitForMultipleObjects() comes out when both
> >> >pgwin32_signal_event and latchevent are signalled at the same time?
> >>
> >> WaitForMultiple only reports the readiness of on event at a time, no?
> >>
> >
> >I don't think so, please read link [1] with a focus on below paragraph
> >which states how it reports the readiness or signaled state when
> >multiple
> >objects become signaled.
> >
> >"When *bWaitAll* is *FALSE*, this function checks the handles in the
> >array
> >in order starting with index 0, until one of the objects is signaled.
> >If
> >multiple objects become signaled, the function returns the index of the
> >first handle in the array whose object was signaled."
>
> I think that's OK. We'll just get the next event the next time we call waitfor*. It's also not different to the way the routine is currently handling normal socket and postmaster events, no?
>

I think the primary difference with socket and postmaster event as compare to latch event is that it won't allow to start waiting with the waitevent in signalled state.  For socket event, it will close the event in the end and create again before entring the wait loop in WaitLatchOrSocket.  I could not see any major problem apart from may be spurious wake ups in few cases (as we haven't reset the event to non signalled state for latch event before entering wait, so it can just return immediately) even if we don't Reset the latch event.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Performance degradation in commit ac1d794

From
Amit Kapila
Date:
On Fri, Mar 18, 2016 at 1:34 PM, Andres Freund <andres@anarazel.de> wrote:
>
>
> Attached is a significantly revised version of the earlier series. Most
> importantly I have:
> * Unified the window/unix latch implementation into one file (0004)
>

After applying patch 0004* on HEAD, using command patch -p1 < <path_of_patch>, I am getting build failure:

c1 : fatal error C1083: Cannot open source file: 'src/backend/storage/ipc/latch.c': No such file or directory

I think it could not rename port/unix_latch.c => storage/ipc/latch.c.  I have tried with git apply, but no success.  Am I doing something wrong?

One minor suggestion about patch:

+#ifndef WIN32
 void
 latch_sigusr1_handler(void)
 {
  if (waiting)
  sendSelfPipeByte();
 }
+#endif   /* !WIN32 */
 
 /* Send one byte to the self-pipe, to wake up WaitLatch */
+#ifndef WIN32
 static void
 sendSelfPipeByte(void)


Instead of individually defining these functions under #ifndef WIN32, isn't it better to combine them all as they are at end of file.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
On 2016-03-19 15:43:27 +0530, Amit Kapila wrote:
> On Sat, Mar 19, 2016 at 12:40 PM, Andres Freund <andres@anarazel.de> wrote:
> >
> > On March 18, 2016 11:52:08 PM PDT, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > >> >Won't the new code needs to ensure that ResetEvent(latchevent)
> > >should
> > >> >get
> > >> >called in case WaitForMultipleObjects() comes out when both
> > >> >pgwin32_signal_event and latchevent are signalled at the same time?

> > >> WaitForMultiple only reports the readiness of on event at a time, no?
> > >>
> > >
> > >I don't think so, please read link [1] with a focus on below paragraph
> > >which states how it reports the readiness or signaled state when
> > >multiple
> > >objects become signaled.
> > >
> > >"When *bWaitAll* is *FALSE*, this function checks the handles in the
> > >array
> > >in order starting with index 0, until one of the objects is signaled.
> > >If
> > >multiple objects become signaled, the function returns the index of the
> > >first handle in the array whose object was signaled."

I think this is just incredibly bad documentation. See
https://blogs.msdn.microsoft.com/oldnewthing/20150409-00/?p=44273
(Raymond Chen can be considered an authority here imo).

Andres



Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
On 2016-03-19 16:44:49 +0530, Amit Kapila wrote:
> On Fri, Mar 18, 2016 at 1:34 PM, Andres Freund <andres@anarazel.de> wrote:
> >
> >
> > Attached is a significantly revised version of the earlier series. Most
> > importantly I have:
> > * Unified the window/unix latch implementation into one file (0004)
> >
> 
> After applying patch 0004* on HEAD, using command patch -p1 <
> <path_of_patch>, I am getting build failure:
> 
> c1 : fatal error C1083: Cannot open source file:
> 'src/backend/storage/ipc/latch.c': No such file or directory
> 
> I think it could not rename port/unix_latch.c => storage/ipc/latch.c.  I
> have tried with git apply, but no success.  Am I doing something wrong?

You skipped applying 0003.

I'll send an updated version - with all the docs and such - in the next
hours.

> One minor suggestion about patch:
> 
> +#ifndef WIN32
>  void
>  latch_sigusr1_handler(void)
>  {
>   if (waiting)
>   sendSelfPipeByte();
>  }
> +#endif   /* !WIN32 */
> 
>  /* Send one byte to the self-pipe, to wake up WaitLatch */
> +#ifndef WIN32
>  static void
>  sendSelfPipeByte(void)
> 
> 
> Instead of individually defining these functions under #ifndef WIN32, isn't
> it better to combine them all as they are at end of file.

They're all at the end of the file. I just found it easier to reason
about if both #if and #endif are visible in one screen full of
code. Don't feel super strong about it tho.

Greetings,

Andres Freund



Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
On 2016-03-19 18:45:36 -0700, Andres Freund wrote:
> On 2016-03-19 16:44:49 +0530, Amit Kapila wrote:
> > On Fri, Mar 18, 2016 at 1:34 PM, Andres Freund <andres@anarazel.de> wrote:
> > >
> > >
> > > Attached is a significantly revised version of the earlier series. Most
> > > importantly I have:
> > > * Unified the window/unix latch implementation into one file (0004)
> > >
> >
> > After applying patch 0004* on HEAD, using command patch -p1 <
> > <path_of_patch>, I am getting build failure:
> >
> > c1 : fatal error C1083: Cannot open source file:
> > 'src/backend/storage/ipc/latch.c': No such file or directory
> >
> > I think it could not rename port/unix_latch.c => storage/ipc/latch.c.  I
> > have tried with git apply, but no success.  Am I doing something wrong?
>
> You skipped applying 0003.
>
> I'll send an updated version - with all the docs and such - in the next
> hours.

Here we go. I think this is getting pretty clos eto being committable,
minus a bit of testing edge cases on unix (postmaster death,
disconnecting clients in various ways (especially with COPY)) and
windows (uh, does it even work at all?).

There's no large code changes in this revision, mainly some code
polishing and a good bit more comment improvements.

Regards,

Andres

Attachment

Re: Performance degradation in commit ac1d794

From
Thomas Munro
Date:
On Sun, Mar 20, 2016 at 5:14 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-03-19 18:45:36 -0700, Andres Freund wrote:
>> On 2016-03-19 16:44:49 +0530, Amit Kapila wrote:
>> > On Fri, Mar 18, 2016 at 1:34 PM, Andres Freund <andres@anarazel.de> wrote:
>> > >
>> > >
>> > > Attached is a significantly revised version of the earlier series. Most
>> > > importantly I have:
>> > > * Unified the window/unix latch implementation into one file (0004)
>> > >
>> >
>> > After applying patch 0004* on HEAD, using command patch -p1 <
>> > <path_of_patch>, I am getting build failure:
>> >
>> > c1 : fatal error C1083: Cannot open source file:
>> > 'src/backend/storage/ipc/latch.c': No such file or directory
>> >
>> > I think it could not rename port/unix_latch.c => storage/ipc/latch.c.  I
>> > have tried with git apply, but no success.  Am I doing something wrong?
>>
>> You skipped applying 0003.
>>
>> I'll send an updated version - with all the docs and such - in the next
>> hours.
>
> Here we go. I think this is getting pretty clos eto being committable,
> minus a bit of testing edge cases on unix (postmaster death,
> disconnecting clients in various ways (especially with COPY)) and
> windows (uh, does it even work at all?).
>
> There's no large code changes in this revision, mainly some code
> polishing and a good bit more comment improvements.

Hi

I couldn't get the second patch to apply for some reason, but I have
been trying out your "latch" branch on some different OSs and porting
some code that does a bunch of waiting on many sockets over to this
API to try it out.

One thing that I want to do but can't with this interface is remove an
fd from the set.  I can AddWaitEventToSet returning a position, and I
can ModifyWait to provide new event mask by position including zero
mask, I can't actually remove the fd (for example to avoid future
error events that can't be masked, or to allow that fd to be closed
and perhaps allow that fd number to coincidentally be readded later,
and just generally to free the slot).  There is an underlying way to
remove an fd from a set with poll (sort of), epoll, kqueue.  (Not sure
about Windows.  But surely...).  I wonder if there should be
RemoveWaitEventFromSet(set, position) which recycles event slots,
sticking them on a freelist (and setting corresponding pollfd structs'
fd to -1).

I wonder if it would be useful to reallocate the arrays as needed so
there isn't really a hard limit to the number of things you add, just
an initial size.

A couple of typos:

+ * Wait using linux' epoll_wait(2).

linux's

+       /*
+        * If neither the event mask nor the associated latch changes, return
+        * early. That's an important optimization for some sockets, were
+        * ModifyWaitEvent is frequently used to switch from waiting
for reads to
+        * waiting on writes.
+        */

s/were/where/

+       /*
+        * Even though unused, we also poss epoll_ev as the data argument if
+        * EPOLL_CTL_DELETE is passed as action.  There used to be an epoll bug
+        * requiring that, and acutally it makes the code simpler...
+        */

s/poss/pass/
s/EPOLL_CTL_DELETE/EPOLL_CTL_DEL/
s/acutally/actually/

There is no code passing EPOLL_CTL_DEL, but maybe this comment is a
clue that you have already implemented remove on some other branch...

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
On 2016-03-21 01:31:30 +1300, Thomas Munro wrote:
> I couldn't get the second patch to apply for some reason,

Weird? Even efter appying the first one first?


> but I have been trying out your "latch" branch on some different OSs
> and porting some code that does a bunch of waiting on many sockets
> over to this API to try it out.

> One thing that I want to do but can't with this interface is remove an
> fd from the set.

Yea, I didn't see an in-core need for that, so I didn't want to add
that yet.

Other than that, how did things go?


> I can AddWaitEventToSet returning a position, and I
> can ModifyWait to provide new event mask by position including zero
> mask, I can't actually remove the fd (for example to avoid future
> error events that can't be masked, or to allow that fd to be closed
> and perhaps allow that fd number to coincidentally be readded later,
> and just generally to free the slot).  There is an underlying way to
> remove an fd from a set with poll (sort of), epoll, kqueue.  (Not sure
> about Windows.  But surely...).  I wonder if there should be
> RemoveWaitEventFromSet(set, position) which recycles event slots,
> sticking them on a freelist (and setting corresponding pollfd structs'
> fd to -1).

I'm inclined that if we add this, we memmmove everything later, and
adjust the offsets. As we e.g. pass the entire pollfd array to poll() at
once, not having gaps will be more important.

What's the use case where you need that?


> I wonder if it would be useful to reallocate the arrays as needed so
> there isn't really a hard limit to the number of things you add, just
> an initial size.

Doubles the amount of palloc()s required. As lots of these sets are
actually going to be very short-lived (a single WaitLatch call), that's
something I'd rather avoid. Although I guess you could just allocate
separate arrays at that moment.


> A couple of typos:
> 
> + * Wait using linux' epoll_wait(2).
> 
> linux's
> 
> +       /*
> +        * If neither the event mask nor the associated latch changes, return
> +        * early. That's an important optimization for some sockets, were
> +        * ModifyWaitEvent is frequently used to switch from waiting
> for reads to
> +        * waiting on writes.
> +        */
> 
> s/were/where/
> 
> +       /*
> +        * Even though unused, we also poss epoll_ev as the data argument if
> +        * EPOLL_CTL_DELETE is passed as action.  There used to be an epoll bug
> +        * requiring that, and acutally it makes the code simpler...
> +        */
> 
> s/poss/pass/
> s/EPOLL_CTL_DELETE/EPOLL_CTL_DEL/
> s/acutally/actually/
>

Thanks!


> There is no code passing EPOLL_CTL_DEL, but maybe this comment is a
> clue that you have already implemented remove on some other branch...

No, I've not, sorry...


Andres



Re: Performance degradation in commit ac1d794

From
Thomas Munro
Date:
On Mon, Mar 21, 2016 at 3:46 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-03-21 01:31:30 +1300, Thomas Munro wrote:
>> I couldn't get the second patch to apply for some reason,
>
> Weird? Even efter appying the first one first?

Ah, I was using patch -p1.  I needed to use git am, which understands
how to rename stuff.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Performance degradation in commit ac1d794

From
Thomas Munro
Date:
On Mon, Mar 21, 2016 at 3:46 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-03-21 01:31:30 +1300, Thomas Munro wrote:
>> I couldn't get the second patch to apply for some reason,
>
> Weird? Even efter appying the first one first?
>
>
>> but I have been trying out your "latch" branch on some different OSs
>> and porting some code that does a bunch of waiting on many sockets
>> over to this API to try it out.
>
>> One thing that I want to do but can't with this interface is remove an
>> fd from the set.
>
> Yea, I didn't see an in-core need for that, so I didn't want to add
> that yet.
>
> Other than that, how did things go?

So far, so good.  No surprises.

>> I can AddWaitEventToSet returning a position, and I
>> can ModifyWait to provide new event mask by position including zero
>> mask, I can't actually remove the fd (for example to avoid future
>> error events that can't be masked, or to allow that fd to be closed
>> and perhaps allow that fd number to coincidentally be readded later,
>> and just generally to free the slot).  There is an underlying way to
>> remove an fd from a set with poll (sort of), epoll, kqueue.  (Not sure
>> about Windows.  But surely...).  I wonder if there should be
>> RemoveWaitEventFromSet(set, position) which recycles event slots,
>> sticking them on a freelist (and setting corresponding pollfd structs'
>> fd to -1).
>
> I'm inclined that if we add this, we memmmove everything later, and
> adjust the offsets. As we e.g. pass the entire pollfd array to poll() at
> once, not having gaps will be more important.
>
> What's the use case where you need that?

I was experimenting with a change to Append so that it could deal with
asynchronous subplans, and in particular support for asynchronous
foreign scans.  See attached patch which applies on top of your latch
branch (or your two patches above, I assume) which does that.  It is
not a very ambitious form of asynchrony and there are opportunities to
do so much more in this area (about which more later in some future
thread), but it is some relevant educational code I had to hand, so I
ported it to your new API as a way to try the API out.

The contract that I invented here is that an async-aware parent node
can ask any child node "are you ready?" and get back various answers
including an fd which means please don't call ExecProcNode until this
fd is ready to read.  But after ExecProcNode is called, the fd must
not be accessed again (the subplan has the right to close it, return a
different one next time etc), so it must not appear in any
WaitEventSet wait on after that.  As you can see, in
append_next_async_wait, it therefore needs to create an new
WaitEventSet every time it needs to wait, which makes it feel more
like select() than epoll().  Ideally it'd just have just one single
WaitEventSet for the lifetime of the append node, and just add and
remove fds as required.

>> I wonder if it would be useful to reallocate the arrays as needed so
>> there isn't really a hard limit to the number of things you add, just
>> an initial size.
>
> Doubles the amount of palloc()s required. As lots of these sets are
> actually going to be very short-lived (a single WaitLatch call), that's
> something I'd rather avoid. Although I guess you could just allocate
> separate arrays at that moment.

On the subject of short-lived and frequent calls to WaitLatchOrSocket,
I wonder if there would be some benefit in reusing a statically
allocated WaitEventSet for that.  That would become possible if you
could add and remove the latch and socket as discussed, with an
opportunity to skip the modification work completely if the reusable
WaitEventSet already happens to have the right stuff in it from the
last WaitLatchOrSocket call.  Or maybe the hot wait loops should
simply be rewritten to reuse a WaitEventSet explicitly so they can
manage that...

Some other assorted thoughts:

* It looks like epoll (and maybe kqueue?) can associate user data with
an event and give it back to you; if you have a mapping between fds
and some other object (in the case of the attached patch, subplan
nodes that are ready to be pulled), would it be possible and useful to
expose that (and simulate it where needed) rather than the caller
having to maintain associative data structures (see the linear search
in my patch)?

* I would be interested in writing a kqueue implementation of this for
*BSD (and MacOSX?) at some point if someone doesn't beat me to it.

* It would be very cool to get some view into which WaitEventSetWait
call backends are waiting in from a system view if it could be done
cheaply enough.  A name/ID for the call site, and an indication of
which latch and how many fds...  or something.

--
Thomas Munro
http://www.enterprisedb.com

Attachment

Re: Performance degradation in commit ac1d794

From
Amit Kapila
Date:
On Sun, Mar 20, 2016 at 7:13 AM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-03-19 15:43:27 +0530, Amit Kapila wrote:
> > On Sat, Mar 19, 2016 at 12:40 PM, Andres Freund <andres@anarazel.de> wrote:
> > >
> > > On March 18, 2016 11:52:08 PM PDT, Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> > > >> >Won't the new code needs to ensure that ResetEvent(latchevent)
> > > >should
> > > >> >get
> > > >> >called in case WaitForMultipleObjects() comes out when both
> > > >> >pgwin32_signal_event and latchevent are signalled at the same time?
>
> > > >> WaitForMultiple only reports the readiness of on event at a time, no?
> > > >>
> > > >
> > > >I don't think so, please read link [1] with a focus on below paragraph
> > > >which states how it reports the readiness or signaled state when
> > > >multiple
> > > >objects become signaled.
> > > >
> > > >"When *bWaitAll* is *FALSE*, this function checks the handles in the
> > > >array
> > > >in order starting with index 0, until one of the objects is signaled.
> > > >If
> > > >multiple objects become signaled, the function returns the index of the
> > > >first handle in the array whose object was signaled."
>
> I think this is just incredibly bad documentation. See
> https://blogs.msdn.microsoft.com/oldnewthing/20150409-00/?p=44273
> (Raymond Chen can be considered an authority here imo).
>

The article pointed by you justifies that the way ResetEvent is done by patch is correct.  I am not sure, but you can weigh, if there is a need of comment so that if we want enhance this part of code (or want to write something similar) in future, we don't need to rediscover this fact.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:

On March 21, 2016 5:12:38 AM GMT+01:00, Amit Kapila <amit.kapila16@gmail.com> wrote:

>The article pointed by you justifies that the way ResetEvent is done by
>patch is correct.  I am not sure, but you can weigh, if there is a need
>of
>comment so that if we want enhance this part of code (or want to write
>something similar) in future, we don't need to rediscover this fact.

I've added a reference in a comment.

Did you have a chance of running the patched versions on windows?

I plan to push this sometime today, so I can get on to some performance patches I was planning to look into
committing.

Andres

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: Performance degradation in commit ac1d794

From
Amit Kapila
Date:
On Mon, Mar 21, 2016 at 10:21 AM, Andres Freund <andres@anarazel.de> wrote:


On March 21, 2016 5:12:38 AM GMT+01:00, Amit Kapila <amit.kapila16@gmail.com> wrote:

>The article pointed by you justifies that the way ResetEvent is done by
>patch is correct.  I am not sure, but you can weigh, if there is a need
>of
>comment so that if we want enhance this part of code (or want to write
>something similar) in future, we don't need to rediscover this fact.

I've added a reference in a comment.

Did you have a chance of running the patched versions on windows?


I am planning to do it in next few hours.
 
I plan to push this sometime today, so I can get on to some performance patches I was planning to look into committing.


have we done testing to ensure that it actually mitigate the impact of performance degradation due to commit  ac1d794. I wanted to do that, but unfortunately the hight-end m/c on which this problem is reproducible is down.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
Hi,


On 2016-03-21 11:52:43 +1300, Thomas Munro wrote:
> The contract that I invented here is that an async-aware parent node
> can ask any child node "are you ready?" and get back various answers
> including an fd which means please don't call ExecProcNode until this
> fd is ready to read.  But after ExecProcNode is called, the fd must
> not be accessed again (the subplan has the right to close it, return a
> different one next time etc), so it must not appear in any
> WaitEventSet wait on after that.

Hm, why did you choose to go with that contract? Why do children need to
switch fds and such?


> As you can see, in append_next_async_wait, it therefore needs to
> create an new WaitEventSet every time it needs to wait, which makes it
> feel more like select() than epoll().  Ideally it'd just have just one
> single WaitEventSet for the lifetime of the append node, and just add
> and remove fds as required.

Yes, I see. I think that'd be less efficient than not manipulating the
set all the time, but still a lot more efficient than adding/removing
the postmaster death latch every round.

Anyway, I want to commit this ASAP, so I can get on to some patches in
the CF. But I'd welcome you playing with adding that ability.


> On the subject of short-lived and frequent calls to WaitLatchOrSocket,
> I wonder if there would be some benefit in reusing a statically
> allocated WaitEventSet for that.

Wondered the same.  It looks like there'd be a number (just latch, latch
+ postmaster death), for it to be beneficial. Most latch waits aren't
that frequent though, so I've decided not to initially go there.


> That would become possible if you
> could add and remove the latch and socket as discussed, with an
> opportunity to skip the modification work completely if the reusable
> WaitEventSet already happens to have the right stuff in it from the
> last WaitLatchOrSocket call.  Or maybe the hot wait loops should
> simply be rewritten to reuse a WaitEventSet explicitly so they can
> manage that...

Yea, I think that's better.


> * It looks like epoll (and maybe kqueue?) can associate user data with
> an event and give it back to you; if you have a mapping between fds
> and some other object (in the case of the attached patch, subplan
> nodes that are ready to be pulled), would it be possible and useful to
> expose that (and simulate it where needed) rather than the caller
> having to maintain associative data structures (see the linear search
> in my patch)?

We already use the pointer that epoll gives you. But note that the
'WaitEvent' struct already contains the original position the event was
registered at. That's filled in when you call WaitEventSetWait().  So
you could either just build an array based on ->pos, or we could also
add a void *private; or such to the definition.


> * I would be interested in writing a kqueue implementation of this for
> *BSD (and MacOSX?) at some point if someone doesn't beat me to it.

I hoped that somebody would do that - that'd afaics be the only major
API missing.


> * It would be very cool to get some view into which WaitEventSetWait
> call backends are waiting in from a system view if it could be done
> cheaply enough.  A name/ID for the call site, and an indication of
> which latch and how many fds...  or something.

That's not something I plan to tackle today; we don't have wait event
integration for latches atm. There's some threads somewhere about this,
but that seems mostly independent of the way latches are implemented.


Thanks,

Andres



Re: Performance degradation in commit ac1d794

From
Amit Kapila
Date:
On Mon, Mar 21, 2016 at 10:26 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Mar 21, 2016 at 10:21 AM, Andres Freund <andres@anarazel.de> wrote:
>>
>>
>>
>> On March 21, 2016 5:12:38 AM GMT+01:00, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> >The article pointed by you justifies that the way ResetEvent is done by
>> >patch is correct.  I am not sure, but you can weigh, if there is a need
>> >of
>> >comment so that if we want enhance this part of code (or want to write
>> >something similar) in future, we don't need to rediscover this fact.
>>
>> I've added a reference in a comment.
>>
>> Did you have a chance of running the patched versions on windows?
>>
>
> I am planning to do it in next few hours.
>

With 0002-Introduce-new-WaitEventSet-API, initdb is successful, but server start leads to below problem:

LOG:  database system was shut down at 2016-03-21 11:17:13 IST
LOG:  MultiXact member wraparound protections are now enabled
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started
FATAL:  failed to set up event for socket: error code 10022
LOG:  statistics collector process (PID 668) exited with exit code 1
LOG:  autovacuum launcher process (PID 3868) was terminated by exception 0xC0000005
HINT:  See C include file "ntstatus.h" for a description of the hexadecimal value.

I haven't investigated the problem still.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
Hi,m

On 2016-03-21 11:26:43 +0530, Amit Kapila wrote:
> LOG:  database system was shut down at 2016-03-21 11:17:13 IST
> LOG:  MultiXact member wraparound protections are now enabled
> LOG:  database system is ready to accept connections
> LOG:  autovacuum launcher started
> FATAL:  failed to set up event for socket: error code 10022
> LOG:  statistics collector process (PID 668) exited with exit code 1
> LOG:  autovacuum launcher process (PID 3868) was terminated by exception
> 0xC0000005
> HINT:  See C include file "ntstatus.h" for a description of the hexadecimal
> value.

We resolved this and two followup issues on IM; all localized issues.
Thanks again for the help!

I've attached two refreshed patches including the relevant fixes, and
the addition of a 'user_data' pointer to events; as desired by Thomas.

I plan to push these after some msvc animals turned green. We might want
to iterate on the API in some parts, but this will fix the regression,
and imo looks good.

Regards,

Andres

Attachment

Re: Performance degradation in commit ac1d794

From
Andres Freund
Date:
Hi,

I've pushed the new API. We might want to use it in more places...

On 2015-12-25 20:08:15 +0300, Васильев Дмитрий wrote:
> I suddenly found commit ac1d794 gives up to 3 times performance degradation.
> 
> I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70 CPU-core
> machine:
> commit ac1d794 gives me 363,474 tps
> and previous commit a05dc4d gives me 956,146
> and master( 3d0c50f ) with revert ac1d794 gives me 969,265

Could you please verify that the performance figures are good now? I've
tested this on a larger two socket system, and there I managed to found
a configuration showing the slowdown (-c64 -j4), and I confirmed the
fix.

Regards,

Andres



Re: Performance degradation in commit ac1d794

From
Thomas Munro
Date:
On Mon, Mar 21, 2016 at 6:09 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-03-21 11:52:43 +1300, Thomas Munro wrote:
>> * I would be interested in writing a kqueue implementation of this for
>> *BSD (and MacOSX?) at some point if someone doesn't beat me to it.
>
> I hoped that somebody would do that - that'd afaics be the only major
> API missing.

Here's a first swing at it, though I need to find some spare time to
test and figure out if some details like error conditions and EOF are
handled correctly.  It could in theory minimise the number of syscalls
it makes by buffering changes (additions and hopefully one day
removals) and then use a single kevent syscall to apply all
modifications to the set and begin waiting, but for now it mirrors the
epoll code.  One user-visible difference compared to epoll/poll/select
is that it delivers readable and writable events separately if both
conditions are true for a single fd.  It builds and appears to work
correctly on FreeBSD 10.2 and MacOSX 10.10.2.  Sharing this early
version in case any BSD users have any feedback.  I hope to do some
testing this weekend.

--
Thomas Munro
http://www.enterprisedb.com

Attachment

Re: Performance degradation in commit ac1d794

From
Noah Misch
Date:
On Fri, Dec 25, 2015 at 08:08:15PM +0300, Васильев Дмитрий wrote:
> I suddenly found commit ac1d794 gives up to 3 times performance degradation.
> 
> I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70 CPU-core
> machine:
> commit ac1d794 gives me 363,474 tps
> and previous commit a05dc4d gives me 956,146
> and master( 3d0c50f ) with revert ac1d794 gives me 969,265

[This is a generic notification.]

The above-described topic is currently a PostgreSQL 9.6 open item.  Robert,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
9.6 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within 72 hours of this
message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping 9.6rc1.  Consequently, I will appreciate your
efforts toward speedy resolution.  Thanks.

[1] http://www.postgresql.org/message-id/20160527025039.GA447393@tornado.leadboat.com



Re: Performance degradation in commit ac1d794

From
Robert Haas
Date:
On Sun, May 29, 2016 at 1:40 AM, Noah Misch <noah@leadboat.com> wrote:
> On Fri, Dec 25, 2015 at 08:08:15PM +0300, Васильев Дмитрий wrote:
>> I suddenly found commit ac1d794 gives up to 3 times performance degradation.
>>
>> I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70 CPU-core
>> machine:
>> commit ac1d794 gives me 363,474 tps
>> and previous commit a05dc4d gives me 956,146
>> and master( 3d0c50f ) with revert ac1d794 gives me 969,265
>
> [This is a generic notification.]
>
> The above-described topic is currently a PostgreSQL 9.6 open item.  Robert,
> since you committed the patch believed to have created it, you own this open
> item.  If some other commit is more relevant or if this does not belong as a
> 9.6 open item, please let us know.  Otherwise, please observe the policy on
> open item ownership[1] and send a status update within 72 hours of this
> message.  Include a date for your subsequent status update.  Testers may
> discover new open items at any time, and I want to plan to get them all fixed
> well in advance of shipping 9.6rc1.  Consequently, I will appreciate your
> efforts toward speedy resolution.  Thanks.

So, the reason this is back on the open items list is that Mithun Cy
re-reported this problem in:

https://www.postgresql.org/message-id/CAD__OuhPmc6XH=wYRm_+Q657yQE88DakN4=Ybh2oveFasHkoeA@mail.gmail.com

When I saw that, I moved this from CLOSE_WAIT back to open.  However,
subsequently, Ashutosh Sharma posted this, which suggests (not
conclusively) that in fact the problem has been fixed:

https://www.postgresql.org/message-id/CAE9k0PkFEhVq-Zg4MH0bZ-zt_oE5PAS6dAuxRCXwX9kEVWceag@mail.gmail.com

What I *think* is going on here is:

- ac1d794 lowered performance
- backend_flush_after with a non-zero default lowered performance with
a vengeance
- 98a64d0 repaired the damage done by ac1d794, or much of it, but
Mithun couldn't see it in his benchmarks because backend_flush_after>0
is so bad

That could be wrong, but I haven't seen any evidence that it's wrong.
So I'm inclined to say we should just move this open item back to the
CLOSE_WAIT list (adding a link to this email to explain why we did
so).  Does that work for you?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance degradation in commit ac1d794

From
Noah Misch
Date:
On Tue, May 31, 2016 at 10:09:05PM -0400, Robert Haas wrote:
> On Sun, May 29, 2016 at 1:40 AM, Noah Misch <noah@leadboat.com> wrote:
> > On Fri, Dec 25, 2015 at 08:08:15PM +0300, Васильев Дмитрий wrote:
> >> I suddenly found commit ac1d794 gives up to 3 times performance degradation.
> >>
> >> I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70 CPU-core
> >> machine:
> >> commit ac1d794 gives me 363,474 tps
> >> and previous commit a05dc4d gives me 956,146
> >> and master( 3d0c50f ) with revert ac1d794 gives me 969,265
> >
> > [This is a generic notification.]
> >
> > The above-described topic is currently a PostgreSQL 9.6 open item.  Robert,
> > since you committed the patch believed to have created it, you own this open
> > item.  If some other commit is more relevant or if this does not belong as a
> > 9.6 open item, please let us know.  Otherwise, please observe the policy on
> > open item ownership[1] and send a status update within 72 hours of this
> > message.  Include a date for your subsequent status update.  Testers may
> > discover new open items at any time, and I want to plan to get them all fixed
> > well in advance of shipping 9.6rc1.  Consequently, I will appreciate your
> > efforts toward speedy resolution.  Thanks.
> 
> So, the reason this is back on the open items list is that Mithun Cy
> re-reported this problem in:
> 
> https://www.postgresql.org/message-id/CAD__OuhPmc6XH=wYRm_+Q657yQE88DakN4=Ybh2oveFasHkoeA@mail.gmail.com
> 
> When I saw that, I moved this from CLOSE_WAIT back to open.  However,
> subsequently, Ashutosh Sharma posted this, which suggests (not
> conclusively) that in fact the problem has been fixed:
> 
> https://www.postgresql.org/message-id/CAE9k0PkFEhVq-Zg4MH0bZ-zt_oE5PAS6dAuxRCXwX9kEVWceag@mail.gmail.com
> 
> What I *think* is going on here is:
> 
> - ac1d794 lowered performance
> - backend_flush_after with a non-zero default lowered performance with
> a vengeance
> - 98a64d0 repaired the damage done by ac1d794, or much of it, but
> Mithun couldn't see it in his benchmarks because backend_flush_after>0
> is so bad

Ashutosh Sharma's measurements do bolster that conclusion.

> That could be wrong, but I haven't seen any evidence that it's wrong.
> So I'm inclined to say we should just move this open item back to the
> CLOSE_WAIT list (adding a link to this email to explain why we did
> so).  Does that work for you?

That works for me.  



Re: Performance degradation in commit ac1d794

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> On Tue, May 31, 2016 at 10:09:05PM -0400, Robert Haas wrote:
>> What I *think* is going on here is:
>> - ac1d794 lowered performance
>> - backend_flush_after with a non-zero default lowered performance with
>> a vengeance
>> - 98a64d0 repaired the damage done by ac1d794, or much of it, but
>> Mithun couldn't see it in his benchmarks because backend_flush_after>0
>> is so bad

> Ashutosh Sharma's measurements do bolster that conclusion.

>> That could be wrong, but I haven't seen any evidence that it's wrong.
>> So I'm inclined to say we should just move this open item back to the
>> CLOSE_WAIT list (adding a link to this email to explain why we did
>> so).  Does that work for you?

> That works for me.  

Can we make a note to re-examine this after the backend_flush_after
business is resolved?  Or at least get Mithun to redo his benchmarks
with backend_flush_after set to zero?
        regards, tom lane



Re: Performance degradation in commit ac1d794

From
Robert Haas
Date:
On Tue, May 31, 2016 at 10:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Noah Misch <noah@leadboat.com> writes:
>> On Tue, May 31, 2016 at 10:09:05PM -0400, Robert Haas wrote:
>>> What I *think* is going on here is:
>>> - ac1d794 lowered performance
>>> - backend_flush_after with a non-zero default lowered performance with
>>> a vengeance
>>> - 98a64d0 repaired the damage done by ac1d794, or much of it, but
>>> Mithun couldn't see it in his benchmarks because backend_flush_after>0
>>> is so bad
>
>> Ashutosh Sharma's measurements do bolster that conclusion.
>
>>> That could be wrong, but I haven't seen any evidence that it's wrong.
>>> So I'm inclined to say we should just move this open item back to the
>>> CLOSE_WAIT list (adding a link to this email to explain why we did
>>> so).  Does that work for you?
>
>> That works for me.
>
> Can we make a note to re-examine this after the backend_flush_after
> business is resolved?  Or at least get Mithun to redo his benchmarks
> with backend_flush_after set to zero?

Ashutosh Sharma already did pretty much that test in the email to
which I linked.

(Ashutosh Sharma and Mithun CY work in the same office and have access
to the same hardware.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company