Thread: Performance degradation in commit ac1d794
Hello hackers!
I suddenly found commit ac1d794 gives up to 3 times performance degradation.
I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70 CPU-core machine:
commit ac1d794 gives me 363,474 tps
and previous commit a05dc4d gives me 956,146
and master( 3d0c50f ) with revert ac1d794 gives me 969,265
it's shocking
On December 25, 2015 6:08:15 PM GMT+01:00, "Васильев Дмитрий" <d.vasilyev@postgrespro.ru> wrote: >Hello hackers! > >I suddenly found commit ac1d794 gives up to 3 times performance >degradation. > >I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70 >CPU-core >machine: >commit ac1d794 gives me 363,474 tps >and previous commit a05dc4d gives me 956,146 >and master( 3d0c50f ) with revert ac1d794 gives me 969,265 > >it's shocking You're talking about http://git.postgresql.org/gitweb/?p=postgresql.git;a=blobdiff;f=src/backend/libpq/be-secure.c;h=2ddcf428f89fd12c230d6f417c2f707fbd97bf39;hp=26d8faaf773a818b388b899b8d83d617bdf7af9b;hb=ac1d794;hpb=a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26e If so, could you provide a hierarchical before/after profile? Andres Hi, --- Please excuse brevity and formatting - I am writing this on my mobile phone.
On December 25, 2015 6:08:15 PM GMT+01:00, "Васильев Дмитрий" <d.vasilyev@postgrespro.ru> wrote:
>Hello hackers!
>
>I suddenly found commit ac1d794 gives up to 3 times performance
>degradation.
>
>I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70
>CPU-core
>machine:
>commit ac1d794 gives me 363,474 tps
>and previous commit a05dc4d gives me 956,146
>and master( 3d0c50f ) with revert ac1d794 gives me 969,265
>
>it's shockingYou're talking about http://git.postgresql.org/gitweb/?p=postgresql.git;a=blobdiff;f=src/backend/libpq/be-secure.c;h=2ddcf428f89fd12c230d6f417c2f707fbd97bf39;hp=26d8faaf773a818b388b899b8d83d617bdf7af9b;hb=ac1d794;hpb=a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26eIf so, could you provide a hierarchical before/after profile?
Andres
Hi,
---
Please excuse brevity and formatting - I am writing this on my mobile phone.
Yes, about this.
>If so, could you provide a hierarchical before/after profile?
Performance | Git hash commit | Date
~ 360k tps | c3e7c24a1d60dc6ad56e2a0723399f1570c54224 | Thu Nov 12 09:12:18 2015 -0500
~ 360k tps | ac1d7945f866b1928c2554c0f80fd52d7f977772 | Thu Nov 12 09:00:33 2015 -0500
~ 960k tps | a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26e | Thu Nov 12 07:40:31 2015 -0500
---
Dmitry Vasilyev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
On December 25, 2015 6:08:15 PM GMT+01:00, "Васильев Дмитрий" <d.vasilyev@postgrespro.ru> wrote:
>Hello hackers!
>
>I suddenly found commit ac1d794 gives up to 3 times performance
>degradation.
>
>I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70
>CPU-core
>machine:
>commit ac1d794 gives me 363,474 tps
>and previous commit a05dc4d gives me 956,146
>and master( 3d0c50f ) with revert ac1d794 gives me 969,265
>
>it's shockingYou're talking about http://git.postgresql.org/gitweb/?p=postgresql.git;a=blobdiff;f=src/backend/libpq/be-secure.c;h=2ddcf428f89fd12c230d6f417c2f707fbd97bf39;hp=26d8faaf773a818b388b899b8d83d617bdf7af9b;hb=ac1d794;hpb=a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26eIf so, could you provide a hierarchical before/after profile?
Andres
Hi,
---
Please excuse brevity and formatting - I am writing this on my mobile phone.Yes, about this.>If so, could you provide a hierarchical before/after profile?Performance | Git hash commit | Date~ 360k tps | c3e7c24a1d60dc6ad56e2a0723399f1570c54224 | Thu Nov 12 09:12:18 2015 -0500~ 360k tps | ac1d7945f866b1928c2554c0f80fd52d7f977772 | Thu Nov 12 09:00:33 2015 -0500~ 960k tps | a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26e | Thu Nov 12 07:40:31 2015 -0500---Dmitry VasilyevPostgres Professional: http://www.postgrespro.comRussian Postgres Company
I came across it by studying the results:
vanilla 9.5 = 30020c3fc3b6de5592978313df929d370f5770ce
vanilla 9.6 = c4a8812cf64b142685e39a69694c5276601f40e4
info | clients | tps -----------------------+---------+---------
vanilla 9.5 | 1 | 30321
vanilla 9.5 | 8 | 216542
vanilla 9.5 | 16 | 412526
vanilla 9.5 | 32 | 751331
vanilla 9.5 | 48 | 956146
<- this point
vanilla 9.5 | 56 | 990122
vanilla 9.5 | 64 | 842436
vanilla 9.5 | 72 | 913272
vanilla 9.5 | 82 | 659332
vanilla 9.5 | 92 | 630111
vanilla 9.5 | 96 | 616863
vanilla 9.5 | 110 | 592080
vanilla 9.5 | 120 | 575831
vanilla 9.5 | 130 | 557521
vanilla 9.5 | 140 | 537951
vanilla 9.5 | 150 | 517019
vanilla 9.5 | 160 | 502312
vanilla 9.5 | 170 | 489162
vanilla 9.5 | 180 | 477178
vanilla 9.5 | 190 | 464620
vanilla 9.6 | 1 | 31738
vanilla 9.6 | 8 | 219692
vanilla 9.6 | 16 | 422933
vanilla 9.6 | 32 | 375546
vanilla 9.6 | 48 | 363474
<- this point
vanilla 9.6 | 56 | 352943
vanilla 9.6 | 64 | 334498
vanilla 9.6 | 72 | 369802
vanilla 9.6 | 82 | 604867
vanilla 9.6 | 92 | 871048
vanilla 9.6 | 96 | 969265
vanilla 9.6 | 105 | 996794
vanilla 9.6 | 110 | 932853
vanilla 9.6 | 115 | 758485
vanilla 9.6 | 120 | 721365
vanilla 9.6 | 125 | 632265
vanilla 9.6 | 130 | 624666
vanilla 9.6 | 135 | 582120
vanilla 9.6 | 140 | 583080
vanilla 9.6 | 150 | 555608
vanilla 9.6 | 160 | 533340
vanilla 9.6 | 170 | 520308
vanilla 9.6 | 180 | 504536
vanilla 9.6 | 190 | 496967
On December 25, 2015 6:27:06 PM GMT+01:00, "Васильев Дмитрий" >>If so, could you provide a hierarchical before/after profile? > >Performance | Git hash commit | Date >~ 360k tps | c3e7c24a1d60dc6ad56e2a0723399f1570c54224 | Thu Nov 12 >09:12:18 >2015 -0500 >~ 360k tps | ac1d7945f866b1928c2554c0f80fd52d7f977772 | Thu Nov 12 >09:00:33 2015 -0500 >~ 960k tps | a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26e | Thu Nov 12 >07:40:31 2015 -0500 Profile as in perf oprofile or something. --- Please excuse brevity and formatting - I am writing this on my mobile phone.
2015-12-25 20:44 GMT+03:00 Andres Freund <andres@anarazel.de>:On December 25, 2015 6:27:06 PM GMT+01:00, "Васильев Дмитрий"
>>If so, could you provide a hierarchical before/after profile?
>
>Performance | Git hash commit | Date
>~ 360k tps | c3e7c24a1d60dc6ad56e2a0723399f1570c54224 | Thu Nov 12
>09:12:18
>2015 -0500
>~ 360k tps | ac1d7945f866b1928c2554c0f80fd52d7f977772 | Thu Nov 12
>09:00:33 2015 -0500
>~ 960k tps | a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26e | Thu Nov 12
>07:40:31 2015 -0500
Profile as in perf oprofile or something.
---
Please excuse brevity and formatting - I am writing this on my mobile phone.
ac1d794:
Samples: 1M of event 'cycles', Event count (approx.): 816922259995, UID: pgpro Overhead Shared Object Symbol
69,72% [kernel] [k] _raw_spin_lock_irqsave
1,43% postgres [.] _bt_compare
1,19% postgres [.] LWLockAcquire
0,99% postgres [.] hash_search_with_hash_value
0,61% postgres [.] PinBuffer
0,46% postgres [.] GetSnapshotData
a05dc4d:
Samples: 1M of event 'cycles', Event count (approx.): 508150718694, UID: pgpro
Overhead Shared Object Symbol
4,77% postgres [.] GetSnapshotData
4,30% postgres [.] _bt_compare
3,13% postgres [.] hash_search_with_hash_value
3,08% postgres [.] LWLockAcquire
2,09% postgres [.] LWLockRelease
2,03% postgres [.] PinBuffer
Perf record generate big traffic:
time perf record -u pgpro -g --call-graph=dwarf
^C[ perf record: Woken up 0 times to write data ]
Warning:
Processed 1078453 events and lost 18257 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 8507.985 MB perf.data (1055663 samples) ]
real 0m8.791s
user 0m0.678s
sys 0m8.120s
If you want i give you ssh-access.
Васильев Дмитрий <d.vasilyev@postgrespro.ru> writes: > Samples: 1M of event 'cycles', Event count (approx.): 816922259995, UID: > pgpro > Overhead Shared Object Symbol > 69,72% [kernel] [k] _raw_spin_lock_irqsave > 1,43% postgres [.] _bt_compare > 1,19% postgres [.] LWLockAcquire > 0,99% postgres [.] hash_search_with_hash_value > 0,61% postgres [.] PinBuffer Seems like what you've got here is a kernel bug. regards, tom lane
On December 25, 2015 7:10:23 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Васильев Дмитрий ><d.vasilyev@postgrespro.ru> writes: >> ��� Samples: 1M of event 'cycles', Event count (approx.): >816922259995, UID: >> pgpro >> Overhead Shared Object Symbol > >> 69,72% [kernel] [k] _raw_spin_lock_irqsave >> 1,43% postgres [.] _bt_compare >> 1,19% postgres [.] LWLockAcquire >> 0,99% postgres [.] hash_search_with_hash_value >> 0,61% postgres [.] PinBuffer > >Seems like what you've got here is a kernel bug. I wouldn't go as far as calling it a kernel bug. Were still doing 300k tps. And were triggering the performance degradationby adding another socket (IIRC) to the poll(2) call. It certainly be interesting to see the expanded tree below the spinlock. I wonder if this is related to directed wakeups. Andres --- Please excuse brevity and formatting - I am writing this on my mobile phone.
Andres Freund <andres@anarazel.de> writes: > On December 25, 2015 7:10:23 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Seems like what you've got here is a kernel bug. > I wouldn't go as far as calling it a kernel bug. Were still doing 300k tps. And were triggering the performance degradationby adding another socket (IIRC) to the poll(2) call. Hmm. And all those FDs point to the same pipe. I wonder if we're looking at contention for some pipe-related data structure inside the kernel. regards, tom lane
2015-12-25 21:28 GMT+03:00 Tom Lane <tgl@sss.pgh.pa.us>:Andres Freund <andres@anarazel.de> writes:
> On December 25, 2015 7:10:23 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Seems like what you've got here is a kernel bug.
> I wouldn't go as far as calling it a kernel bug. Were still doing 300k tps. And were triggering the performance degradation by adding another socket (IIRC) to the poll(2) call.
Hmm. And all those FDs point to the same pipe. I wonder if we're looking
at contention for some pipe-related data structure inside the kernel.
regards, tom lane
I did bt on backends and found it in following state:
#0 0x00007f77b0e5bb60 in __poll_nocancel () from /lib64/libc.so.6
#1 0x00000000006a7cd0 in WaitLatchOrSocket (latch=0x7f779e2e96c4, wakeEvents=wakeEvents@entry=19, sock=9, timeout=timeout@entry=0) at pg_latch.c:333
#2 0x0000000000612c7d in secure_read (port=0x17e6af0, ptr=0xcc94a0 <PqRecvBuffer>, len=8192) at be-secure.c:147
#3 0x000000000061be36 in pq_recvbuf () at pqcomm.c:915
#4 pq_getbyte () at pqcomm.c:958
#5 0x0000000000728ad5 in SocketBackend (inBuf=0x7ffd8b6b1460) at postgres.c:345
Perf shows _raw_spin_lock_irqsave call remove_wait_queue add_wait_queue
There’s screenshots: http://i.imgur.com/pux2bGJ.png http://i.imgur.com/LJQbm2V.png
2015-12-25 22:42 GMT+03:00 Васильев Дмитрий <d.vasilyev@postgrespro.ru>:2015-12-25 21:28 GMT+03:00 Tom Lane <tgl@sss.pgh.pa.us>:Andres Freund <andres@anarazel.de> writes:
> On December 25, 2015 7:10:23 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Seems like what you've got here is a kernel bug.
> I wouldn't go as far as calling it a kernel bug. Were still doing 300k tps. And were triggering the performance degradation by adding another socket (IIRC) to the poll(2) call.
Hmm. And all those FDs point to the same pipe. I wonder if we're looking
at contention for some pipe-related data structure inside the kernel.
regards, tom laneI did bt on backends and found it in following state:#0 0x00007f77b0e5bb60 in __poll_nocancel () from /lib64/libc.so.6#1 0x00000000006a7cd0 in WaitLatchOrSocket (latch=0x7f779e2e96c4, wakeEvents=wakeEvents@entry=19, sock=9, timeout=timeout@entry=0) at pg_latch.c:333#2 0x0000000000612c7d in secure_read (port=0x17e6af0, ptr=0xcc94a0 <PqRecvBuffer>, len=8192) at be-secure.c:147#3 0x000000000061be36 in pq_recvbuf () at pqcomm.c:915#4 pq_getbyte () at pqcomm.c:958#5 0x0000000000728ad5 in SocketBackend (inBuf=0x7ffd8b6b1460) at postgres.c:345Perf shows _raw_spin_lock_irqsave call remove_wait_queue add_wait_queueThere’s screenshots: http://i.imgur.com/pux2bGJ.png http://i.imgur.com/LJQbm2V.png
I’m sorry I meant remove_wait_queue and add_wait_queue functions call _raw_spin_lock_irqsave what holds main processor time.
uname -a: 3.10.0-229.20.1.el7.x86_64 #1 SMP Tue Nov 3 19:10:07 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
On 2015-12-25 13:28:55 -0500, Tom Lane wrote: > Hmm. And all those FDs point to the same pipe. I wonder if we're looking > at contention for some pipe-related data structure inside the kernel. Sounds fairly likely - and not too surprising. In this scenario we've a couple 100k registrations/unregistrations to a pretty fundamentally shared resource (the wait queue for changes to the pipe). Not that surprising that it becomes a problem. There's a couple solutions I can think of to that problem: 1) Use epoll()/kqueue, or other similar interfaces that don't require re-registering fds at every invocation. My guess isthat that'd be desirable for performance anyway. 2) Create a pair of fds between postmaster/backend for each backend. While obviously increasing the the number of FDs noticeably, it's interesting for other features as well: If we ever want to do FD passing from postmaster to existing backends,we're going to need that anyway. 3) Replace the postmaster_alive_fds socketpair by some other signalling mechanism. E.g. sending a procsignal to each backend,which sets the latch and a special flag in the latch structure. Andres
Andres Freund <andres@anarazel.de> writes: > There's a couple solutions I can think of to that problem: > 1) Use epoll()/kqueue, or other similar interfaces that don't require > re-registering fds at every invocation. My guess is that that'd be > desirable for performance anyway. Portability, on the other hand, would be problematic. > 2) Create a pair of fds between postmaster/backend for each > backend. While obviously increasing the the number of FDs noticeably, > it's interesting for other features as well: If we ever want to do FD > passing from postmaster to existing backends, we're going to need > that anyway. Maybe; it'd provide another limit on how many backends we could run. > 3) Replace the postmaster_alive_fds socketpair by some other signalling > mechanism. E.g. sending a procsignal to each backend, which sets the > latch and a special flag in the latch structure. And what would send the signal? The entire point here is to notice the situation where the postmaster has crashed. It can *not* depend on the postmaster taking some action. regards, tom lane
On 2015-12-25 16:29:53 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > There's a couple solutions I can think of to that problem: > > 1) Use epoll()/kqueue, or other similar interfaces that don't require > > re-registering fds at every invocation. My guess is that that'd be > > desirable for performance anyway. > > Portability, on the other hand, would be problematic. Indeed. But we might be able to get away with it because there's realistically just one platform on which people run four socket servers. Obviously we'd leave poll and select support in place. It'd be a genuine improvement for less extreme loads on linux, too. > > 3) Replace the postmaster_alive_fds socketpair by some other signalling > > mechanism. E.g. sending a procsignal to each backend, which sets the > > latch and a special flag in the latch structure. > > And what would send the signal? The entire point here is to notice the > situation where the postmaster has crashed. It can *not* depend on the > postmaster taking some action. Ahem. Um. Look, over there ---> I blame it on all the food. Andres
On 2015-12-26 12:22:48 +0100, Andres Freund wrote: > > > 3) Replace the postmaster_alive_fds socketpair by some other signalling > > > mechanism. E.g. sending a procsignal to each backend, which sets the > > > latch and a special flag in the latch structure. > > > > And what would send the signal? The entire point here is to notice the > > situation where the postmaster has crashed. It can *not* depend on the > > postmaster taking some action. > > Ahem. Um. Look, over there ---> > > I blame it on all the food. A unportable and easy version of this, actually making sense this time, would be to use prctl(PR_SET_PDEATHSIG, SIGQUIT). That'd send SIGQUIT to backends whenever postmaster dies. Obviously that's not portable either - doing this for linux only wouldn't be all that kludgey tho.
Andres Freund <andres@anarazel.de> writes: > A unportable and easy version of this, actually making sense this time, > would be to use prctl(PR_SET_PDEATHSIG, SIGQUIT). That'd send SIGQUIT to > backends whenever postmaster dies. Obviously that's not portable > either - doing this for linux only wouldn't be all that kludgey tho. Hmm. That would have semantics rather substantially different from the way that the WL_POSTMASTER_DEATH code behaves. But I don't know how much we care about that, since the whole scenario is something that should not happen under normal circumstances. Maybe cross-platform variation is OK as long as it doesn't make the code too hairy. regards, tom lane
On Sat, Dec 26, 2015 at 5:41 PM, Andres Freund <andres@anarazel.de> wrote:
On 2015-12-26 12:22:48 +0100, Andres Freund wrote:
> > > 3) Replace the postmaster_alive_fds socketpair by some other signalling
> > > mechanism. E.g. sending a procsignal to each backend, which sets the
> > > latch and a special flag in the latch structure.
> >
> > And what would send the signal? The entire point here is to notice the
> > situation where the postmaster has crashed. It can *not* depend on the
> > postmaster taking some action.
>
> Ahem. Um. Look, over there --->
>
> I blame it on all the food.
A unportable and easy version of this, actually making sense this time,
would be to use prctl(PR_SET_PDEATHSIG, SIGQUIT). That'd send SIGQUIT to
backends whenever postmaster dies. Obviously that's not portable
either - doing this for linux only wouldn't be all that kludgey tho.
There is a way to make backends exit in Windows as well by using
JobObjects and use limitFlags as JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE
for JOBOBJECT_BASIC_LIMIT_INFORMATION.
On 2015-12-26 12:22:48 +0100, Andres Freund wrote: > On 2015-12-25 16:29:53 -0500, Tom Lane wrote: > > Andres Freund <andres@anarazel.de> writes: > > > There's a couple solutions I can think of to that problem: > > > 1) Use epoll()/kqueue, or other similar interfaces that don't require > > > re-registering fds at every invocation. My guess is that that'd be > > > desirable for performance anyway. > > > > Portability, on the other hand, would be problematic. > > Indeed. But we might be able to get away with it because there's > realistically just one platform on which people run four socket > servers. Obviously we'd leave poll and select support in place. It'd be > a genuine improvement for less extreme loads on linux, too. I finally got back to working on this. Attached is a WIP patch series implementing: 0001: Allow to easily choose between the readiness primitives in unix_latch.c Pretty helpful for testing, not useful for anything else. 0002: Error out if waiting on socket readiness without a specified socket. 0003: Only clear unix_latch.c's self-pipe if it actually contains data. ~2% on high qps workloads 0004: Support using epoll as the polling primitive in unix_latch.c. ~3% on high qps workloads, massive scalability improvements (x3) on very large machines. With 0004 obviously being the relevant bit for this thread. I verified that using epoll addresses the performance problem, using the hardware the OP noticed the performance problem on. The reason I went with using epoll over the PR_SET_PDEATHSIG approach is that it provides semantics that are more similar to the other platforms, while being just as platform dependant as PR_SET_PDEATHSIG. It also is actually measurably faster, at least here. 0004 currently contains one debatable optimization, which I'd like to discuss: Currently the 'sock' passed to WaitLatchOrSocket is not removed/added to the epoll fd, if it's the numerically same as in the last call. That's good for performance, but would be wrong if the socket were close and a new one with the same value would be waited on. I think a big warning sign somewhere is sufficient to deal with that problem - it's not something we're likely to start doing. And even if it's done at some point, we can just offer an API to reset the last used socket fd. Unless somebody comes up with a platform independent way of addressing this, I'm inclined to press forward using epoll(). Opinions? Andres
Attachment
Andres Freund <andres@anarazel.de> writes: > 0004 currently contains one debatable optimization, which I'd like to > discuss: Currently the 'sock' passed to WaitLatchOrSocket is not > removed/added to the epoll fd, if it's the numerically same as in the > last call. That's good for performance, but would be wrong if the socket > were close and a new one with the same value would be waited on. I > think a big warning sign somewhere is sufficient to deal with that > problem - it's not something we're likely to start doing. And even if > it's done at some point, we can just offer an API to reset the last used > socket fd. Perhaps a cleaner API solution would be to remove the socket argument per se from the function altogether, instead providing a separate SetSocketToWaitOn() call. (Also, if there is a need for it, we could provide a function that still takes a socket argument, with the understanding that it's to be used for short-lived sockets where you don't want to change the process's main epoll state.) regards, tom lane
On Thu, Jan 14, 2016 at 9:39 AM, Andres Freund <andres@anarazel.de> wrote: > I finally got back to working on this. Attached is a WIP patch series > implementing: > 0001: Allow to easily choose between the readiness primitives in unix_latch.c > Pretty helpful for testing, not useful for anything else. Looks good. > 0002: Error out if waiting on socket readiness without a specified socket. Looks good. > 0003: Only clear unix_latch.c's self-pipe if it actually contains data. > ~2% on high qps workloads everytime -> every time + if ((wakeEvents & WL_LATCH_SET) && latch->is_set) + { + result |= WL_LATCH_SET; + } Excess braces. Doesn't this code make it possible for the self-pipe to fill up, self-deadlocking the process? Suppose we repeatedly enter WaitLatchOrSocket(). Each time we do, just after waiting = true is set, somebody sets the latch. We handle the signal and put a byte into the pipe. Returning from the signal handler, we then notice that is_set is true and return at once, without draining the pipe. Repeat until something bad happens. > 0004: Support using epoll as the polling primitive in unix_latch.c. > ~3% on high qps workloads, massive scalability improvements (x3) > on very large machines. > > With 0004 obviously being the relevant bit for this thread. I verified > that using epoll addresses the performance problem, using the hardware > the OP noticed the performance problem on. + /* + * Unnecessarily pass data for delete due to bug errorneously + * requiring it in the past. + */ This is pretty vague. And it has a spelling mistake. Further down, nodified -> notified. + if (wakeEvents != latch->lastmask || latch->lastwatchfd != sock) I don't like this very much. I think it's a bad idea to test latch->lastwatchfd != sock. That has an excellent change of letting people write code that appears to work but then doesn't. I think it would be better, if we're going to change the API contract, to make it a hard break, as I see Tom has also suggested while I've been writing this. Incidentally, if we're going to whack around the latch API, it would be nice to pick a design which wouldn't be too hard to extend to waiting on multiple sockets. The application I have in mind is to send of queries to several foreign servers at once and then wait until bytes come back from any of them. It's mostly pie in the sky at this point, but it seems highly likely to me that we'd want to do such a thing by waiting for bytes from any of the sockets involved OR a latch event. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Incidentally, if we're going to whack around the latch API, it would > be nice to pick a design which wouldn't be too hard to extend to > waiting on multiple sockets. The application I have in mind is to > send of queries to several foreign servers at once and then wait until > bytes come back from any of them. It's mostly pie in the sky at this > point, but it seems highly likely to me that we'd want to do such a > thing by waiting for bytes from any of the sockets involved OR a latch > event. Instead of SetSocketToWaitOn, maybe AddSocketToWaitSet and RemoveSocketFromWaitSet? And you'd need some way of identifying which socket came ready after a wait call... regards, tom lane
On 2016-01-14 10:39:55 -0500, Robert Haas wrote: > Doesn't this code make it possible for the self-pipe to fill up, > self-deadlocking the process? Suppose we repeatedly enter > WaitLatchOrSocket(). Each time we do, just after waiting = true is > set, somebody sets the latch. We handle the signal and put a byte > into the pipe. Returning from the signal handler, we then notice that > is_set is true and return at once, without draining the pipe. Repeat > until something bad happens. Should be fine because the self-pipe is marked as non-blockingif (fcntl(pipefd[1], F_SETFL, O_NONBLOCK) < 0) elog(FATAL,"fcntl() failed on write-end of self-pipe: %m"); and sendSelfPipeByte accepts the blocking case as success /* * If the pipe is full, we don't need to retry, the data that's there * already is enough to wake up WaitLatch. */ if (errno == EAGAIN || errno == EWOULDBLOCK) return; > > 0004: Support using epoll as the polling primitive in unix_latch.c. > > ~3% on high qps workloads, massive scalability improvements (x3) > > on very large machines. > > > > With 0004 obviously being the relevant bit for this thread. I verified > > that using epoll addresses the performance problem, using the hardware > > the OP noticed the performance problem on. > > + /* > + * Unnecessarily pass data for delete due to bug errorneously > + * requiring it in the past. > + */ > > This is pretty vague. And it has a spelling mistake. Will add a reference to the manpage (where that requirement is coming from). > Further down, nodified -> notified. > > + if (wakeEvents != latch->lastmask || latch->lastwatchfd != sock) > > I don't like this very much. Yea, me neither, which is why I called it out... I think it's not too likely to cause problems in practice though. But I think changing the API makes sense, so the likelihood shouldn't be a relevant issue. > Incidentally, if we're going to whack around the latch API, it would > be nice to pick a design which wouldn't be too hard to extend to > waiting on multiple sockets. Hm. That seems likely to make usage harder for users of the API. So it seems like it'd make sense to provide a simpler version anyway, for the majority of users. So, I'm wondering how we'd exactly use a hyptothetical SetSocketToWaitOn, or SetSocketsToWaitOn (or whatever). I mean it can make a fair bit of sense to sometimes wait on MyLatch/port->sock and sometimes on MyLatch/fdw connections. The simple proposed code would change the epoll set whenever switching between both, but with SetSocketsToWaitOn you'd probably end up switching this much more often? One way to address that would be to create a 'latch wait' datastructure, that'd then contain the epoll fd/win32 wait events/... That way you could have one 'LatchWait' for latch + client socket and one for latch + fdw sockets. Greetings, Andres Freund
On Thu, Jan 14, 2016 at 10:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Incidentally, if we're going to whack around the latch API, it would >> be nice to pick a design which wouldn't be too hard to extend to >> waiting on multiple sockets. The application I have in mind is to >> send of queries to several foreign servers at once and then wait until >> bytes come back from any of them. It's mostly pie in the sky at this >> point, but it seems highly likely to me that we'd want to do such a >> thing by waiting for bytes from any of the sockets involved OR a latch >> event. > > Instead of SetSocketToWaitOn, maybe AddSocketToWaitSet and > RemoveSocketFromWaitSet? And you'd need some way of identifying > which socket came ready after a wait call... Yeah. Although I think for now it would be fine to just error out if somebody tries to add a socket and there already is one. Then we could lift that limitation in a later commit. Of course if Andres wants to do the whole thing now I'm not going to get in the way, but since that will require Windows tinkering and so on it may be more than he wants to dive into. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On January 14, 2016 5:16:59 PM GMT+01:00, Robert Haas <robertmhaas@gmail.com> wrote: >Yeah. Although I think for now it would be fine to just error out if >somebody tries to add a socket and there already is one. Then we >could lift that limitation in a later commit. Of course if Andres >wants to do the whole thing now I'm not going to get in the way, but >since that will require Windows tinkering and so on it may be more >than he wants to dive into. Yea, I don't want to do anything really large at the moment. My primary interest is fixing the major performance regression. Andres --- Please excuse brevity and formatting - I am writing this on my mobile phone.
On Thu, Jan 14, 2016 at 10:56 AM, Andres Freund <andres@anarazel.de> wrote: > So, I'm wondering how we'd exactly use a hyptothetical > SetSocketToWaitOn, or SetSocketsToWaitOn (or whatever). I mean it can > make a fair bit of sense to sometimes wait on MyLatch/port->sock and > sometimes on MyLatch/fdw connections. The simple proposed code would > change the epoll set whenever switching between both, but with > SetSocketsToWaitOn you'd probably end up switching this much more often? > > One way to address that would be to create a 'latch wait' datastructure, > that'd then contain the epoll fd/win32 wait events/... That way you > could have one 'LatchWait' for latch + client socket and one for latch + > fdw sockets. I see your point. As far as I can see, it's currently true that, right now, the only places where we wait for a socket are places where the socket will live for the lifetime of the backend, but I think we should regard it as likely that, in the future, we'll want to use it anywhere we want to wait for a socket to become ready. There are getting to be a lot of places where we need to unstick some loop whenever the process latch gets set, and it seems likely to me that needs will only continue to grow. So the API should probably contemplate that sort of need. I think your idea of a data structure the encapsulates a set of events for which to wait is probably a good one. WaitLatch doesn't seem like a great name. Maybe WaitEventSet, and then we can have WaitLatch(&latch) and WaitEvents(&eventset). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-01-14 11:31:03 -0500, Robert Haas wrote: > On Thu, Jan 14, 2016 at 10:56 AM, Andres Freund <andres@anarazel.de> wrote: > I think your idea of a data structure the encapsulates a set of events > for which to wait is probably a good one. WaitLatch doesn't seem like > a great name. Maybe WaitEventSet, and then we can have > WaitLatch(&latch) and WaitEvents(&eventset). Hm, I'd like to have latch in the name. It seems far from improbably to have another wait data structure. LatchEventSet maybe? The wait would be implied by WaitLatch. So effectively we'd create a LatchEventSet feLatchSet; somewhere global (and update it from a backend local to the proc latch in SwitchToSharedLatch/SwitchBackToLocalLatch()). Then change all WaitLatch calls to refer to those. Do we want to provide a backward compatible API for all this? I'm fine either way. Andres
On Thu, Jan 14, 2016 at 12:06 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-01-14 11:31:03 -0500, Robert Haas wrote: >> On Thu, Jan 14, 2016 at 10:56 AM, Andres Freund <andres@anarazel.de> wrote: >> I think your idea of a data structure the encapsulates a set of events >> for which to wait is probably a good one. WaitLatch doesn't seem like >> a great name. Maybe WaitEventSet, and then we can have >> WaitLatch(&latch) and WaitEvents(&eventset). > > Hm, I'd like to have latch in the name. It seems far from improbably to > have another wait data structure. LatchEventSet maybe? The wait would be > implied by WaitLatch. I can live with that. > So effectively we'd create a LatchEventSet feLatchSet; somewhere global > (and update it from a backend local to the proc latch in > SwitchToSharedLatch/SwitchBackToLocalLatch()). Then change all WaitLatch > calls to refer to those. Sure. > Do we want to provide a backward compatible API for all this? I'm fine > either way. How would that work? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Jan 14, 2016 at 12:06 PM, Andres Freund <andres@anarazel.de> wrote: >> Do we want to provide a backward compatible API for all this? I'm fine >> either way. > How would that work? I see no great need to be backwards-compatible on this, especially if it would complicate matters at all. I doubt there's a lot of third-party code using WaitLatch right now. Just make sure there's an obvious compile failure for anyone who is. regards, tom lane
On 2016-01-14 12:07:23 -0500, Robert Haas wrote: > > Do we want to provide a backward compatible API for all this? I'm fine > > either way. > > How would that work? I'm thinking of something like; int WaitOnLatchSet(LatchEventSet *set, int wakeEvents, long timeout); int WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,long timeout) { LatchEventSet set; LatchEventSetInit(&set, latch); if (sock != PGINVALID_SOCKET) LatchEventSetAddSock(&set, sock); return WaitOnLatchSet(set, wakeEvents, timeout); } I think we'll need to continue having wakeEvents and timeout parameters for WaitOnLatchSet, we quite frequently want to wait socket readability/writability, not wait on the socket, or have/not have timeouts.
On 2016-01-14 18:14:21 +0100, Andres Freund wrote: > I'm thinking of something like; > > int WaitOnLatchSet(LatchEventSet *set, int wakeEvents, long timeout); > > int > WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,long timeout) > { > LatchEventSet set; > > LatchEventSetInit(&set, latch); > > if (sock != PGINVALID_SOCKET) > LatchEventSetAddSock(&set, sock); > > return WaitOnLatchSet(set, wakeEvents, timeout); > } > > I think we'll need to continue having wakeEvents and timeout parameters > for WaitOnLatchSet, we quite frequently want to wait socket > readability/writability, not wait on the socket, or have/not have > timeouts. This brings me to something related: I'm wondering if we shouldn't merge unix/win32_latch.c. If we go this route it seems like the amount of shared infrastructure will further increase. The difference between win32 and, say, the select code isn't much bigger than the difference between select/poll. epoll/win32 are probably more similar than that actually. Andres
Andres Freund <andres@anarazel.de> writes: > This brings me to something related: I'm wondering if we shouldn't merge > unix/win32_latch.c. Well, it's duplicated code on the one hand versus maze-of-ifdefs on the other. Feel free to try it and see, but I'm unsure it'd be an improvement. regards, tom lane
On 2016-01-14 18:14:21 +0100, Andres Freund wrote: > On 2016-01-14 12:07:23 -0500, Robert Haas wrote: > > > Do we want to provide a backward compatible API for all this? I'm fine > > > either way. > > > > How would that work? > > I'm thinking of something like; > > int WaitOnLatchSet(LatchEventSet *set, int wakeEvents, long timeout); > > int > WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,long timeout) > { > LatchEventSet set; > > LatchEventSetInit(&set, latch); > > if (sock != PGINVALID_SOCKET) > LatchEventSetAddSock(&set, sock); > > return WaitOnLatchSet(set, wakeEvents, timeout); > } > > I think we'll need to continue having wakeEvents and timeout parameters > for WaitOnLatchSet, we quite frequently want to wait socket > readability/writability, not wait on the socket, or have/not have > timeouts. Hm. If we really want to support multiple sockets at some point the above WaitOnLatchSet signature isn't going to fly, because it won't support figuring out which fd the event this triggered on. So it seems we'd need to return something like struct LatchEvent { enum LatchEventType {WL_LATCH_EVENT;WL_TIMEOUT;WL_SOCKET_EVENT; WL_POSTMASTER_EVENT;...} event_type; int mask; pgsocket event_sock; }; that'd also allow to extend this to return multiple events if we want that at some point. Alternatively we could add a pgsocket* argument, but that doesn't really seem much better. Not super happy about the above proposal. Andres
On Thu, Jan 14, 2016 at 12:14 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-01-14 12:07:23 -0500, Robert Haas wrote: >> > Do we want to provide a backward compatible API for all this? I'm fine >> > either way. >> >> How would that work? > > I'm thinking of something like; > > int WaitOnLatchSet(LatchEventSet *set, int wakeEvents, long timeout); > > int > WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,long timeout) > { > LatchEventSet set; > > LatchEventSetInit(&set, latch); > > if (sock != PGINVALID_SOCKET) > LatchEventSetAddSock(&set, sock); > > return WaitOnLatchSet(set, wakeEvents, timeout); > } > > I think we'll need to continue having wakeEvents and timeout parameters > for WaitOnLatchSet, we quite frequently want to wait socket > readability/writability, not wait on the socket, or have/not have > timeouts. Well, if we ever wanted to support multiple FDs, we'd need the readability/writeability thing to be per-fd, not per-set. Overall, if this is what you have in mind for backward compatibility, I rate it M for Meh. Let's just break compatibility and people will have to update their code. That shouldn't be hard, and if we don't make people do it when we make the change, then we'll be stuck with the backward-compatibility interface for a decade. I doubt it's worth it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello, I am one who wants waiting on many sockets at once. At Thu, 14 Jan 2016 18:55:51 +0100, Andres Freund <andres@anarazel.de> wrote in <20160114175551.GM10941@awork2.anarazel.de> > On 2016-01-14 18:14:21 +0100, Andres Freund wrote: > > On 2016-01-14 12:07:23 -0500, Robert Haas wrote: > > > > Do we want to provide a backward compatible API for all this? I'm fine > > > > either way. > > > > > > How would that work? > > > > I'm thinking of something like; > > > > int WaitOnLatchSet(LatchEventSet *set, int wakeEvents, long timeout); > > > > int > > WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,long timeout) > > { > > LatchEventSet set; > > > > LatchEventSetInit(&set, latch); > > > > if (sock != PGINVALID_SOCKET) > > LatchEventSetAddSock(&set, sock); > > > > return WaitOnLatchSet(set, wakeEvents, timeout); > > } > > > > I think we'll need to continue having wakeEvents and timeout parameters > > for WaitOnLatchSet, we quite frequently want to wait socket > > readability/writability, not wait on the socket, or have/not have > > timeouts. > > Hm. If we really want to support multiple sockets at some point the > above WaitOnLatchSet signature isn't going to fly, because it won't > support figuring out which fd the event this triggered on. > > So it seems we'd need to return something like > struct LatchEvent > { > enum LatchEventType {WL_LATCH_EVENT;WL_TIMEOUT;WL_SOCKET_EVENT; WL_POSTMASTER_EVENT;...} event_type; > int mask; > pgsocket event_sock; > }; > > that'd also allow to extend this to return multiple events if we want > that at some point. Alternatively we could add a pgsocket* argument, but > that doesn't really seem much better. > > Not super happy about the above proposal. How about allowing registration of a callback for every waiting socket. The signature of the callback function would be like enum LATCH_CALLBACK_STATELatchWaitCallback(pgsocket event_sock, enum LatchEventType, int mask?, void *bogus); It can return, for instance, LCB_CONTINUE, LCB_BREAK or LCB_IMMEDBREAK, and if any one of them returns LCB_BREAK, it will break after, maybe, calling all callbacks for fired events. We could have predefined callbacks for every event which does only setting a corresponding flag and returns LCB_BREAK. /* Waiting set has been constructed so far */ if (!WaitOnLatchSet(&set(?)) errorr() if (is_sock_readable[sockid]) {} /** is... would be global / if (is_sock_writable[sockid]) {} /** is... would be global / /* Any other types of trigger would processes elsewhere */ Although it might be slow if we have an enormous number of sockets fired at once, I suppose it returns only for a few sockets on most cases. I don't see the big picture of the whole latch/signalling mechanism with this, but callbacks might be usable for signalling on many sockets. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Jan 14, 2016 at 12:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andres Freund <andres@anarazel.de> writes: >> This brings me to something related: I'm wondering if we shouldn't merge >> unix/win32_latch.c. > > Well, it's duplicated code on the one hand versus maze-of-ifdefs on the > other. Feel free to try it and see, but I'm unsure it'd be an improvement. I think we should either get this fixed RSN or revert the problematic commit until we get it fixed. I'd be rather disappointed about the latter because I think this was a very good thing on the merits, but probably not good enough to justify taking the performance hit over the long term. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I think we should either get this fixed RSN or revert the problematic > commit until we get it fixed. I'd be rather disappointed about the > latter because I think this was a very good thing on the merits, but > probably not good enough to justify taking the performance hit over > the long term. Since it's only in HEAD, I'm not seeing the urgency of reverting it. However, it'd be a good idea to put this on the 9.6 open items list (have we got such a page yet?) to make sure it gets addressed before beta. regards, tom lane
On Thu, Feb 11, 2016 at 12:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> I think we should either get this fixed RSN or revert the problematic >> commit until we get it fixed. I'd be rather disappointed about the >> latter because I think this was a very good thing on the merits, but >> probably not good enough to justify taking the performance hit over >> the long term. > > Since it's only in HEAD, I'm not seeing the urgency of reverting it. > However, it'd be a good idea to put this on the 9.6 open items list > (have we got such a page yet?) to make sure it gets addressed before > beta. One problem is that it makes for misleading results if you try to benchmark 9.5 against 9.6. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-02-11 12:50:58 -0500, Robert Haas wrote: > On Thu, Feb 11, 2016 at 12:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > >> I think we should either get this fixed RSN or revert the problematic > >> commit until we get it fixed. I'd be rather disappointed about the > >> latter because I think this was a very good thing on the merits, but > >> probably not good enough to justify taking the performance hit over > >> the long term. > > > > Since it's only in HEAD, I'm not seeing the urgency of reverting it. > > However, it'd be a good idea to put this on the 9.6 open items list > > (have we got such a page yet?) to make sure it gets addressed before > > beta. > > One problem is that it makes for misleading results if you try to > benchmark 9.5 against 9.6. You need a really beefy box to show the problem. On a large/new 2 socket machine the performance regression in in the 1-3% range for a pgbench of SELECT 1. So it's not like it's immediately showing up for everyone. Putting it on the open items list sounds good to me. Regards, Andres
On Thu, Feb 11, 2016 at 12:53 PM, Andres Freund <andres@anarazel.de> wrote: >> One problem is that it makes for misleading results if you try to >> benchmark 9.5 against 9.6. > > You need a really beefy box to show the problem. On a large/new 2 socket > machine the performance regression in in the 1-3% range for a pgbench of > SELECT 1. So it's not like it's immediately showing up for everyone. > > Putting it on the open items list sounds good to me. Well, OK, I've done that then. I don't really agree that it's not a problem; the OP said he saw a 3x regression, and some of my colleagues doing benchmarking are complaining about this commit, too. It doesn't seem like much of a stretch to think that it might be affecting other people as well. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-02-11 13:09:27 -0500, Robert Haas wrote: > On Thu, Feb 11, 2016 at 12:53 PM, Andres Freund <andres@anarazel.de> wrote: > >> One problem is that it makes for misleading results if you try to > >> benchmark 9.5 against 9.6. > > > > You need a really beefy box to show the problem. On a large/new 2 socket > > machine the performance regression in in the 1-3% range for a pgbench of > > SELECT 1. So it's not like it's immediately showing up for everyone. > > > > Putting it on the open items list sounds good to me. > > Well, OK, I've done that then. I don't really agree that it's not a > problem; the OP said he saw a 3x regression, and some of my colleagues > doing benchmarking are complaining about this commit, too. It doesn't > seem like much of a stretch to think that it might be affecting other > people as well. Well, I can't do anything about that right now. I won't have the time to whip up the new/more complex API we discussed upthread in the next few days. So either we go with a simpler API (e.g. pretty much a cleaned up version of my earlier patch), revert the postmaster deatch check, or somebody else has to take lead in renovating, or we wait... Andres
On Thu, Feb 11, 2016 at 1:19 PM, Andres Freund <andres@anarazel.de> wrote: >> > Putting it on the open items list sounds good to me. >> >> Well, OK, I've done that then. I don't really agree that it's not a >> problem; the OP said he saw a 3x regression, and some of my colleagues >> doing benchmarking are complaining about this commit, too. It doesn't >> seem like much of a stretch to think that it might be affecting other >> people as well. > > Well, I can't do anything about that right now. I won't have the time to > whip up the new/more complex API we discussed upthread in the next few > days. So either we go with a simpler API (e.g. pretty much a cleaned up > version of my earlier patch), revert the postmaster deatch check, or > somebody else has to take lead in renovating, or we wait... Well, I thought we could just revert the patch until you had time to deal with it, and then put it back in. That seemed like a simple and practical option from here, and I don't think I quite understand why you and Tom don't like it. I don't have a problem with deferring to the majority will here, but I would sort of like to understand the reason for the majority will. BTW, if need be, I can look for an EnterpriseDB resource to work on this. It won't likely be me, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Feb 11, 2016 at 1:19 PM, Andres Freund <andres@anarazel.de> wrote: >> Well, I can't do anything about that right now. I won't have the time to >> whip up the new/more complex API we discussed upthread in the next few >> days. So either we go with a simpler API (e.g. pretty much a cleaned up >> version of my earlier patch), revert the postmaster deatch check, or >> somebody else has to take lead in renovating, or we wait... > Well, I thought we could just revert the patch until you had time to > deal with it, and then put it back in. That seemed like a simple and > practical option from here, and I don't think I quite understand why > you and Tom don't like it. Don't particularly want the git history churn, if we expect that the patch will ship as-committed in 9.6. If it becomes clear that the performance fix is unlikely to happen, we can revert then. If the performance change were an issue for a lot of testing, I'd agree with a temporary revert, but I concur with Andres that it's not blocking much. Anybody who does have an issue there can revert locally, no? regards, tom lane
On Thu, Feb 11, 2016 at 1:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Thu, Feb 11, 2016 at 1:19 PM, Andres Freund <andres@anarazel.de> wrote: >>> Well, I can't do anything about that right now. I won't have the time to >>> whip up the new/more complex API we discussed upthread in the next few >>> days. So either we go with a simpler API (e.g. pretty much a cleaned up >>> version of my earlier patch), revert the postmaster deatch check, or >>> somebody else has to take lead in renovating, or we wait... > >> Well, I thought we could just revert the patch until you had time to >> deal with it, and then put it back in. That seemed like a simple and >> practical option from here, and I don't think I quite understand why >> you and Tom don't like it. > > Don't particularly want the git history churn, if we expect that the > patch will ship as-committed in 9.6. If it becomes clear that the > performance fix is unlikely to happen, we can revert then. > > If the performance change were an issue for a lot of testing, I'd agree > with a temporary revert, but I concur with Andres that it's not blocking > much. Anybody who does have an issue there can revert locally, no? True. Maybe we'll just have to start doing that for EnterpriseDB benchmarking as standard practice. Not sure everybody who is benchmarking will realize the issue though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
[PoC] WaitLatchOrSocketMulti (Re: Performance degradation in commit ac1d794)
From
Kyotaro HORIGUCHI
Date:
Hello, I don't see how ac1d794 will be dealt, but I tried an example implement of multi-socket version of WaitLatchOrSocket using callbacks on top of the current master where ac1d794 has not been removed yet. At Thu, 14 Jan 2016 13:46:44 -0500, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoYBa8TJRGS07JCSLKpqGkrRd5hLpirvwp36s=83ChmQDA@mail.gmail.com> > On Thu, Jan 14, 2016 at 12:14 PM, Andres Freund <andres@anarazel.de> wrote: > > On 2016-01-14 12:07:23 -0500, Robert Haas wrote: > >> > Do we want to provide a backward compatible API for all this? I'm fine > >> > either way. > >> > >> How would that work? > > > > I'm thinking of something like; > > > > int WaitOnLatchSet(LatchEventSet *set, int wakeEvents, long timeout); > > > > int > > WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,long timeout) > > { > > LatchEventSet set; > > > > LatchEventSetInit(&set, latch); > > > > if (sock != PGINVALID_SOCKET) > > LatchEventSetAddSock(&set, sock); > > > > return WaitOnLatchSet(set, wakeEvents, timeout); > > } > > > > I think we'll need to continue having wakeEvents and timeout parameters > > for WaitOnLatchSet, we quite frequently want to wait socket > > readability/writability, not wait on the socket, or have/not have > > timeouts. > > Well, if we ever wanted to support multiple FDs, we'd need the > readability/writeability thing to be per-fd, not per-set. > > Overall, if this is what you have in mind for backward compatibility, > I rate it M for Meh. Let's just break compatibility and people will > have to update their code. That shouldn't be hard, and if we don't > make people do it when we make the change, then we'll be stuck with > the backward-compatibility interface for a decade. I doubt it's worth > it. The API is similar to what Robert suggested but different because it would too complicate a bit for the most cases. So this example implement has an intermediate style of the current API and the Robert's suggestion, and using callbacks as I proposed. int WaitLatchOrSocketMulti(pgwaitobject *wobjs, int nobjs, long timeout); This is implemented only for poll, not for select. A sample usage is seen in secure_read(). > pgwaitobject objs[3]; ... > InitWaitLatch(objs[0], MyLatch); > InitWaitPostmasterDeath(objs[1]); > InitWaitSocket(objs[2], port->sock, waitfor); > > w = WaitLatchOrSocketMulti(objs, 3, 0); > // w = WaitLatchOrSocket(MyLatch, > // WL_LATCH_SET | WL_POSTMASTER_DEATH | waitfor, > // port->sock, 0); The core of the function looks as the following. It runs callbacks for every fired events. > rc = poll(pfds, nfds, (int) cur_timeout); ... > if (rc < 0) ... > else > { > for (i = 0 ; i < nfds ; i++) > { > wobjs[i].retEvents = 0; > if (pfds[i].revents && wobjs[i].cb) > result |= wobjs[i].cb(&wobjs[i], pfds[i].revents); > > if (result & WL_IMMEDIATELY_BREAK) > break; > } > } In the above part, poll()'s event is passed the callbacks so callbacks may have a different inplement for select(). Having a callback for sockets. The initializer could be as the following. > InitWaitSocketCB(wait_obj, sock, event, your_callback); If we want to have the waiting-object array independently from specific functions to achieve asynchronous handling of socket events. It could be realised by providing a set of wrapper functions as exactly what Robert said as above. Does this make sense? Does anyone have any opinion? or thoughts? regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On 2016-02-08 17:49:18 +0900, Kyotaro HORIGUCHI wrote: > How about allowing registration of a callback for every waiting > socket. The signature of the callback function would be like I don't think a a callback based API is going to serve us well. Most of the current latch callers would get noticeably more complex that way. And a number of them will benefit from latches using epoll internally. Greetings, Andres Freund
On 2016-01-14 12:07:23 -0500, Robert Haas wrote: > On Thu, Jan 14, 2016 at 12:06 PM, Andres Freund <andres@anarazel.de> wrote: > > On 2016-01-14 11:31:03 -0500, Robert Haas wrote: > >> On Thu, Jan 14, 2016 at 10:56 AM, Andres Freund <andres@anarazel.de> wrote: > >> I think your idea of a data structure the encapsulates a set of events > >> for which to wait is probably a good one. WaitLatch doesn't seem like > >> a great name. Maybe WaitEventSet, and then we can have > >> WaitLatch(&latch) and WaitEvents(&eventset). > > > > Hm, I'd like to have latch in the name. It seems far from improbably to > > have another wait data structure. LatchEventSet maybe? The wait would be > > implied by WaitLatch. > > I can live with that. How about the following sketch of an API typedef struct LatchEvent {uint32 events; /* interesting events */uint32 revents; /* returned events */int fd; /* fd associated with event */ } LatchEvent; typedef struct LatchEventSet {int nevents;LatchEvent events[FLEXIBLE_ARRAY_MEMBER]; } LatchEventSet; /** Create a LatchEventSet with space for nevents different events to wait for.** latch may be NULL.*/ extern LatchEventSet *CreateLatchEventSet(int nevents, Latch *latch); /* ---* Add an event to the set. Possible events are:* - WL_LATCH_SET: Wait for the latch to be set* - WL_SOCKET_READABLE:Wait for socket to become readable* can be combined in one event with WL_SOCKET_WRITEABLE* - WL_SOCKET_WRITABLE:Wait for socket to become readable* can be combined with WL_SOCKET_READABLE* - WL_POSTMASTER_DEATH:Wait for postmaster to die*/ extern void AddLatchEventToSet(LatchEventSet *set, uint32 events, int fd); /** Wait for any events added to the set to happen, or until the timeout is* reached.** The return value is the union ofall the events that were detected. This* allows to avoid having to look into the associated events[i].revents* fields.*/ extern uint32 WaitLatchEventSet(LatchEventSet *set, long timeout); I've two questions: - Is there any benefit of being able to wait for more than one latch? I'm inclined to not allow that for now, that'd makethe patch bigger, and I don't see a use-case right now. - Given current users we don't need a large amount of events, so having to iterate through the registered events doesn'tseem bothersome. We could however change the api to be something like int WaitLatchEventSet(LatchEventSet *set, OccurredEvents*, int nevents, long timeout); which would return the number of events that happened, and would basically "fill" one of the (usually stack allocated) OccurredEventstructures with what happened. Comments?
On Wed, Mar 16, 2016 at 2:52 PM, Andres Freund <andres@anarazel.de> wrote: > How about the following sketch of an API > > typedef struct LatchEvent > { > uint32 events; /* interesting events */ > uint32 revents; /* returned events */ > int fd; /* fd associated with event */ > } LatchEvent; > > typedef struct LatchEventSet > { > int nevents; > LatchEvent events[FLEXIBLE_ARRAY_MEMBER]; > } LatchEventSet; > > /* > * Create a LatchEventSet with space for nevents different events to wait for. > * > * latch may be NULL. > */ > extern LatchEventSet *CreateLatchEventSet(int nevents, Latch *latch); We might be able to rejigger this so that it didn't require palloc, if we got rid of FLEXIBLE_ARRAY_MEMBER and passed int nevents and LatchEvent * separately to WaitLatchThingy(). But I guess maybe this will be infrequent enough not to matter. > I've two questions: > - Is there any benefit of being able to wait for more than one latch? > I'm inclined to not allow that for now, that'd make the patch bigger, > and I don't see a use-case right now. I don't see a use case, either. > - Given current users we don't need a large amount of events, so having > to iterate through the registered events doesn't seem bothersome. We > could however change the api to be something like > > int WaitLatchEventSet(LatchEventSet *set, OccurredEvents *, int nevents, long timeout); > > which would return the number of events that happened, and would > basically "fill" one of the (usually stack allocated) OccurredEvent > structures with what happened. I definitely think something along these lines is useful. I want to be able to have an Append node with 100 ForeignScans under it and kick off all the scans asynchronously and wait for all of the FDs at once. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-03-16 15:08:07 -0400, Robert Haas wrote: > On Wed, Mar 16, 2016 at 2:52 PM, Andres Freund <andres@anarazel.de> wrote: > > How about the following sketch of an API > > > > typedef struct LatchEvent > > { > > uint32 events; /* interesting events */ > > uint32 revents; /* returned events */ > > int fd; /* fd associated with event */ > > } LatchEvent; > > > > typedef struct LatchEventSet > > { > > int nevents; > > LatchEvent events[FLEXIBLE_ARRAY_MEMBER]; > > } LatchEventSet; > > > > /* > > * Create a LatchEventSet with space for nevents different events to wait for. > > * > > * latch may be NULL. > > */ > > extern LatchEventSet *CreateLatchEventSet(int nevents, Latch *latch); > > We might be able to rejigger this so that it didn't require palloc, if > we got rid of FLEXIBLE_ARRAY_MEMBER and passed int nevents and > LatchEvent * separately to WaitLatchThingy(). But I guess maybe this > will be infrequent enough not to matter. I think we'll basically end up allocating them once for the frequent callsites. > > - Given current users we don't need a large amount of events, so having > > to iterate through the registered events doesn't seem bothersome. We > > could however change the api to be something like > > > > int WaitLatchEventSet(LatchEventSet *set, OccurredEvents *, int nevents, long timeout); > > > > which would return the number of events that happened, and would > > basically "fill" one of the (usually stack allocated) OccurredEvent > > structures with what happened. > > I definitely think something along these lines is useful. I want to > be able to have an Append node with 100 ForeignScans under it and kick > off all the scans asynchronously and wait for all of the FDs at once. So you'd like to get only an event for the FD with data back? Or are you ok with iterating through hundred elements in an array, to see which are ready? Andres
On Wed, Mar 16, 2016 at 3:25 PM, Andres Freund <andres@anarazel.de> wrote: >> > - Given current users we don't need a large amount of events, so having >> > to iterate through the registered events doesn't seem bothersome. We >> > could however change the api to be something like >> > >> > int WaitLatchEventSet(LatchEventSet *set, OccurredEvents *, int nevents, long timeout); >> > >> > which would return the number of events that happened, and would >> > basically "fill" one of the (usually stack allocated) OccurredEvent >> > structures with what happened. >> >> I definitely think something along these lines is useful. I want to >> be able to have an Append node with 100 ForeignScans under it and kick >> off all the scans asynchronously and wait for all of the FDs at once. > > So you'd like to get only an event for the FD with data back? Or are you > ok with iterating through hundred elements in an array, to see which are > ready? I'd like to get an event back for the FD with data. Iterating sounds like it could be really slow. Say you get lots of little packets back from the same connection, while the others are idle. Now you've got to keep iterating through them all over and over again. Blech. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-03-16 15:41:28 -0400, Robert Haas wrote: > On Wed, Mar 16, 2016 at 3:25 PM, Andres Freund <andres@anarazel.de> wrote: > >> > - Given current users we don't need a large amount of events, so having > >> > to iterate through the registered events doesn't seem bothersome. We > >> > could however change the api to be something like > >> > > >> > int WaitLatchEventSet(LatchEventSet *set, OccurredEvents *, int nevents, long timeout); > >> > > >> > which would return the number of events that happened, and would > >> > basically "fill" one of the (usually stack allocated) OccurredEvent > >> > structures with what happened. > >> > >> I definitely think something along these lines is useful. I want to > >> be able to have an Append node with 100 ForeignScans under it and kick > >> off all the scans asynchronously and wait for all of the FDs at once. > > > > So you'd like to get only an event for the FD with data back? Or are you > > ok with iterating through hundred elements in an array, to see which are > > ready? > > I'd like to get an event back for the FD with data. Iterating sounds > like it could be really slow. Say you get lots of little packets back > from the same connection, while the others are idle. Now you've got > to keep iterating through them all over and over again. Blech. Well, that's what poll() and select() require you to do internally anyway, even if we abstract it away. But most platforms have better implementations (epoll, kqueue, ...), so it seems fair to design for those. Andres
Hi, On 2016-03-16 15:41:28 -0400, Robert Haas wrote: > On Wed, Mar 16, 2016 at 3:25 PM, Andres Freund <andres@anarazel.de> wrote: > >> > - Given current users we don't need a large amount of events, so having > >> > to iterate through the registered events doesn't seem bothersome. We > >> > could however change the api to be something like > >> > > >> > int WaitLatchEventSet(LatchEventSet *set, OccurredEvents *, int nevents, long timeout); > >> > > >> > which would return the number of events that happened, and would > >> > basically "fill" one of the (usually stack allocated) OccurredEvent > >> > structures with what happened. > >> > >> I definitely think something along these lines is useful. I want to > >> be able to have an Append node with 100 ForeignScans under it and kick > >> off all the scans asynchronously and wait for all of the FDs at once. > > > > So you'd like to get only an event for the FD with data back? Or are you > > ok with iterating through hundred elements in an array, to see which are > > ready? > > I'd like to get an event back for the FD with data. Iterating sounds > like it could be really slow. Say you get lots of little packets back > from the same connection, while the others are idle. Now you've got > to keep iterating through them all over and over again. Blech. I've a (working) WIP version that works like I think you want. It's implemented in patch 05 (rework with just poll() support) and (add epoll suppport). It's based on patches posted here earlier, but these aren't interesting for the discussion. The API is now: typedef struct WaitEventSet WaitEventSet; typedef struct WaitEvent { int pos; /* position in the event data structure */ uint32 events; /* tripped events */ int fd; /* fd associated with event */ } WaitEvent; /* * Create a WaitEventSet with space for nevents different events to wait for. * * latch may be NULL. */ extern WaitEventSet *CreateWaitEventSet(int nevents); /* --- * Add an event to the set. Possible events are: * - WL_LATCH_SET: Wait for the latch to be set * - WL_POSTMASTER_DEATH: Wait for postmaster to die * - WL_SOCKET_READABLE: Wait for socket to become readable * can be combined in one event with WL_SOCKET_WRITEABLE * - WL_SOCKET_WRITABLE: Wait for socket to become readable * can be combined with WL_SOCKET_READABLE * * Returns the offset in WaitEventSet->events (starting from 0), which can be * used to modify previously added wait events. */ extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, int fd, Latch *latch); /* * Change the event mask and, if applicable, the associated latch of of a * WaitEvent. */ extern void ModifyWaitEvent(WaitEventSet *set, int pos, uint32 events, Latch *latch); /* * Wait for events added to the set to happen, or until the timeout is * reached. At most nevents occurrent events are returned. * * Returns the number of events occurred, or 0 if the timeout was reached. */ extern int WaitEventSetWait(WaitEventSet *set, long timeout, WaitEvent* occurred_events, int nevents); I've for now left the old latch API in place, and only converted be-secure.c to the new style. I'd appreciate some feedback before I go around and convert and polish everything. Questions: * I'm kinda inclined to merge the win32 and unix latch implementations. There's already a fair bit in common, and this is just going to increase that amount. * Right now the caller has to allocate the WaitEvents he's waiting for locally (likely on the stack), but we also could allocate them as part of the WaitEventSet. Not sure if that'd be a benefit. * I can do a blind rewrite of the windows implementation, but I'm obviously not going to get that entirely right. So I need some help from a windows person to test this. * This approach, with a 'stateful' wait event data structure, will actually allow to fix a couple linering bugs we have on the windows port. C.f. http://www.postgresql.org/message-id/4351.1336927207@sss.pgh.pa.us - Andres
Attachment
- 0001-Access-the-correct-pollfd-when-checking-for-socket-e.patch
- 0002-Make-it-easier-to-choose-the-used-waiting-primitive-.patch
- 0003-Error-out-if-waiting-on-socket-readiness-without-a-s.patch
- 0004-Only-clear-unix_latch.c-s-self-pipe-if-it-actually-c.patch
- 0005-WIP-WaitEvent-API.patch
- 0006-WIP-Use-epoll-for-Wait-Event-API-if-available.patch
On Thu, Mar 17, 2016 at 7:34 AM, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> * I can do a blind rewrite of the windows implementation, but I'm
> obviously not going to get that entirely right. So I need some help
> from a windows person to test this.
>
> Hi,
>
> * I can do a blind rewrite of the windows implementation, but I'm
> obviously not going to get that entirely right. So I need some help
> from a windows person to test this.
>
I can help you verifying the windows implementation.
On Wed, Mar 16, 2016 at 10:04 PM, Andres Freund <andres@anarazel.de> wrote: > Questions: > * I'm kinda inclined to merge the win32 and unix latch > implementations. There's already a fair bit in common, and this is > just going to increase that amount. Don't care either way. > * Right now the caller has to allocate the WaitEvents he's waiting for > locally (likely on the stack), but we also could allocate them as part > of the WaitEventSet. Not sure if that'd be a benefit. I'm not seeing this. What do you mean? 0001: Looking at this again, I'm no longer sure this is a bug. Doesn't your patch just check the same conditions in the opposite order? 0002: I think I reviewed this before. Boring. Just commit it already. 0003: Mostly boring. But the change to win32_latch.c seems to remove an unrelated check. 0004: + * drain it everytime WaitLatchOrSocket() is used. Should the + * pipe-buffer fill up in some scenarios - widly unlikely - we're every time wildly Why is it wildly (or widly) unlikely? The rejiggering this does between what is on which element of pfds[] appears to be unrelated to the ostensible purpose of the patch. + * Check again wether latch is set, the arrival of a signal/self-byte whether. Also not clearly related to the patch's main purpose. /* at least one event occurred, so check masks */ + if (FD_ISSET(selfpipe_readfd, &input_mask)) + { + /* There's data in the self-pipe, clear it. */ + drainSelfPipe(); + } The comment just preceding this added hunk now seems to be out of place, and maybe inaccurate as well. I think the new code could have a bit more detailed comment. My understanding is something like /* Since we didn't clear the self-pipe before attempting to wait, select() may have returned immediately even though there has been no recent change to the state of the latch. To prevent busy-looping, we must clear the pipe before attempting to wait again. */ I'll look at 0005 next, but thought I would send these comments along first. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Mar 17, 2016 at 9:01 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I'll look at 0005 next, but thought I would send these comments along first. 0005: This is obviously very much WIP, but I think the overall direction of it is good. 0006: Same. I think you should use PGINVALID_SOCKET rather than -1 in various places in various patches in this series, especially if you are going to try to merge the Windows code path. I wonder if CreateEventSet should accept a MemoryContext argument. It seems like callers will usually want TopMemoryContext, and just being able to pass that might be more convenient than having to switch back and forth in the calling code. I wonder if there's a way to refactor this code to avoid having so much cut-and-paste duplication. When iterating over the returned events, maybe check whether events is 0 at the top of the loop and skip it forthwith if so. That's all I've got for now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-03-17 09:40:08 -0400, Robert Haas wrote: > On Thu, Mar 17, 2016 at 9:01 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > I'll look at 0005 next, but thought I would send these comments along first. > > 0005: This is obviously very much WIP, but I think the overall > direction of it is good. > 0006: Same. > > I think you should use PGINVALID_SOCKET rather than -1 in various > places in various patches in this series, especially if you are going > to try to merge the Windows code path. Sure. > I wonder if CreateEventSet should accept a MemoryContext argument. It > seems like callers will usually want TopMemoryContext, and just being > able to pass that might be more convenient than having to switch back > and forth in the calling code. Makes sense. > I wonder if there's a way to refactor this code to avoid having so > much cut-and-paste duplication. I guess you mean WaitEventSetWait() and WaitEventAdjust*? I've tried, and my attempt ended up look nearly unreadable, because of the number of ifdefs. I've not found a good attempt. Which is sad, because adding back select support is going to increase the duplication further :( - but it's also further away from poll etc. (different type of timestamp, entirely different way of returming events). > When iterating over the returned events, maybe check whether events is > 0 at the top of the loop and skip it forthwith if so. You mean in WaitEventSetWait()? There's else if (rc == 0) { break; } which is the timeout case. There should never be any other case of returning 0 elements? > That's all I've got for now. Thanks for looking. Greetings, Andres Freund
Hi, On 2016-03-17 09:01:36 -0400, Robert Haas wrote: > 0001: Looking at this again, I'm no longer sure this is a bug. > Doesn't your patch just check the same conditions in the opposite > order? Yes, that's what's required > 0004: > > + * drain it everytime WaitLatchOrSocket() is used. Should the > + * pipe-buffer fill up in some scenarios - widly unlikely - we're > > every time > wildly > > Why is it wildly (or widly) unlikely? > > The rejiggering this does between what is on which element of pfds[] > appears to be unrelated to the ostensible purpose of the patch. Well, not really. We need to know when to do drainSelfPipe(); Which gets more complicated if pfds[0] is registered optionally. I'm actually considering to drop this entirely, given the much heavier rework in the WaitEvent set patch; making these details a bit obsolete. Greetings, Andres Freund
On 2016-03-17 09:01:36 -0400, Robert Haas wrote: > > * Right now the caller has to allocate the WaitEvents he's waiting for > > locally (likely on the stack), but we also could allocate them as part > > of the WaitEventSet. Not sure if that'd be a benefit. > > I'm not seeing this. What do you mean? Right now, do use a WaitEventSet you'd do something like WaitEvent event; ModifyWaitEvent(FeBeWaitSet, 0, waitfor, NULL); WaitEventSetWait(FeBeWaitSet, 0 /* no timeout */, &event, 1); i.e. use a WaitEvent on the stack to receive the changes. If you wanted to get more changes than just one, you could end up allocating a fair bit of stack space. We could instead allocate the returned events as part of the event set, and return them. Either by returning a NULL terminated array, or by continuing to return the number of events as now, and additionally return the event data structure via a pointer. So the above would be WaitEvent *events; int nevents; ModifyWaitEvent(FeBeWaitSet, 0, waitfor, NULL); nevents = WaitEventSetWait(FeBeWaitSet, 0 /* no timeout */, events, 10); for (int off = 0; off <= nevents; nevents++) ; // stuff Andres
Hi, On 2016-03-17 09:01:36 -0400, Robert Haas wrote: > 0001: Looking at this again, I'm no longer sure this is a bug. > Doesn't your patch just check the same conditions in the opposite > order? Which is important, because what's in what pfds[x] depends on wakeEvents. Folded it into a later patch; it's not harmful as long as we're only ever testing pfds[0]. > 0003: Mostly boring. But the change to win32_latch.c seems to remove > an unrelated check. Argh. > 0004: > > + * drain it everytime WaitLatchOrSocket() is used. Should the > + * pipe-buffer fill up in some scenarios - widly unlikely - we're > > every time > wildly > > Why is it wildly (or widly) unlikely? Because SetLatch (if the owner) check latch->is_set before adding to the pipe, and latch_sigusr1_handler() only writes to the pipe if the current process is in WaitLatchOrSocket's loop (via the waiting check). Expanded comment. > > + * Check again wether latch is set, the arrival of a signal/self-byte > > whether. Also not clearly related to the patch's main purpose. After the change there's no need to re-compute the current timestamp anymore, that does seem beneficial and kinda related. > /* at least one event occurred, so check masks */ > + if (FD_ISSET(selfpipe_readfd, &input_mask)) > + { > + /* There's data in the self-pipe, clear it. */ > + drainSelfPipe(); > + } > > The comment just preceding this added hunk now seems to be out of > place, and maybe inaccurate as well. Hm. Which comment are you exactly referring to? /* at least one event occurred, so check masks */ seems not to fit the bill? > I think the new code could have > a bit more detailed comment. My understanding is something like /* > Since we didn't clear the self-pipe before attempting to wait, > select() may have returned immediately even though there has been no > recent change to the state of the latch. To prevent busy-looping, we > must clear the pipe before attempting to wait again. */ Isn't that explained at the top, in /* * Check if the latch is set already. If so, leave loop immediately, * avoid blocking again. We don't attempt to report any other events * that might also be satisfied. * * If someone sets the latch between this and the poll()/select() * below, the setter will write a byte to the pipe (or signal us and * the signal handler will do that), and the poll()/select() will * return immediately. * * If there's a pending byte in the self pipe, we'll notice whenever * blocking. Only clearing the pipe in that case avoids having to * drain it every time WaitLatchOrSocket() is used. Should the * pipe-buffer fill up in some scenarios - wildly unlikely - we're * still ok, because the pipe is in nonblocking mode. ? I've updated the last paragraph to * If there's a pending byte in the self pipe, we'll notice whenever * blocking. Only clearing the pipe in that case avoids having to * drain it every time WaitLatchOrSocket() is used. Should the * pipe-buffer fill up we're still ok, because the pipe is in * nonblocking mode. It's unlikely for that to happen, because the * self pipe isn't filled unless we're blocking (waiting = true), or * from inside a signal handler in latch_sigusr1_handler(). I've also applied the same optimization to windows. Less because I found that interesting in itself, and more because it makes the WaitEventSet easier. Attached is a significantly revised version of the earlier series. Most importantly I have: * Unified the window/unix latch implementation into one file (0004) * Provided a select(2) implementation for the WaitEventSet API * Provided a windows implementation for the WaitEventSet API * Reduced duplication between the implementations a good bit by splitting WaitEventSetWait into WaitEventSetWait and WaitEventSetWaitBlock. Only the latter is implemented separately for each readiness primitive * Added a backward-compatibility implementation of WaitLatchOrSocket using the WaitEventSet stuff. Less because I thought that to be terribly important, and more because it makes the patch a *lot* smaller. We collected a fair amount of latch users. This is still not fully ready. The main reamining items are testing (the windows stuff I've only verified using cross-compiling with mingw) and documentation. I'd greatly appreciate a look. Amit, you offered testing on windows; could you check whether 3/4/5 work? It's quite likely that I've screwed up something. Robert you'd mentioned on IM that you've a use-case for this somewhere around multiple FDWs. If somebody has started working on that, could you ask that person to check whether the API makes sense? Greetings, Andres Freund
Attachment
On Thu, Mar 17, 2016 at 10:57 AM, Andres Freund <andres@anarazel.de> wrote: > On 2016-03-17 09:01:36 -0400, Robert Haas wrote: >> 0001: Looking at this again, I'm no longer sure this is a bug. >> Doesn't your patch just check the same conditions in the opposite >> order? > > Yes, that's what's required I mean, they are just variables. You can check them in either order and get the same results, no? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Mar 17, 2016 at 10:53 AM, Andres Freund <andres@anarazel.de> wrote: >> I wonder if there's a way to refactor this code to avoid having so >> much cut-and-paste duplication. > > I guess you mean WaitEventSetWait() and WaitEventAdjust*? I've tried, > and my attempt ended up look nearly unreadable, because of the number of > ifdefs. I've not found a good attempt. Which is sad, because adding back > select support is going to increase the duplication further :( - but > it's also further away from poll etc. (different type of timestamp, > entirely different way of returming events). I was more thinking of stuff like this: + /* + * We expect an EPOLLHUP when the remote end is closed, but + * because we don't expect the pipe to become readable or to have + * any errors either, treat those cases as postmaster death, too. + * + * According to the select(2) man page on Linux, select(2) may + * spuriously return and report a file descriptor as readable, + * when it's not; and presumably so can epoll_wait(2). It's not + * clear that the relevant cases would ever apply to the + * postmaster pipe, but since the consequences of falsely + * returning WL_POSTMASTER_DEATH could be pretty unpleasant, we + * take the trouble to positively verify EOF with + * PostmasterIsAlive(). + */ >> 0 at the top of the loop and skip it forthwith if so. > > You mean in WaitEventSetWait()? There's > else if (rc == 0) > { > break; > } > which is the timeout case. There should never be any other case of > returning 0 elements? No, I meant if (cur_event->events == 0) continue; -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Mar 17, 2016 at 11:17 AM, Andres Freund <andres@anarazel.de> wrote: > Right now, do use a WaitEventSet you'd do something like > WaitEvent event; > > ModifyWaitEvent(FeBeWaitSet, 0, waitfor, NULL); > > WaitEventSetWait(FeBeWaitSet, 0 /* no timeout */, &event, 1); > > i.e. use a WaitEvent on the stack to receive the changes. If you wanted > to get more changes than just one, you could end up allocating a fair > bit of stack space. > > We could instead allocate the returned events as part of the event set, > and return them. Either by returning a NULL terminated array, or by > continuing to return the number of events as now, and additionally > return the event data structure via a pointer. > > So the above would be > > WaitEvent *events; > int nevents; > > ModifyWaitEvent(FeBeWaitSet, 0, waitfor, NULL); > > nevents = WaitEventSetWait(FeBeWaitSet, 0 /* no timeout */, events, 10); > > for (int off = 0; off <= nevents; nevents++) > ; // stuff I don't see this as being particularly better. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Mar 18, 2016 at 1:34 PM, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2016-03-17 09:01:36 -0400, Robert Haas wrote:
> > 0001: Looking at this again, I'm no longer sure this is a bug.
> > Doesn't your patch just check the same conditions in the opposite
> > order?
>
> Which is important, because what's in what pfds[x] depends on
> wakeEvents. Folded it into a later patch; it's not harmful as long as
> we're only ever testing pfds[0].
>
>
> > 0003: Mostly boring. But the change to win32_latch.c seems to remove
> > an unrelated check.
>
> Argh.
>
>
> Hi,
>
> On 2016-03-17 09:01:36 -0400, Robert Haas wrote:
> > 0001: Looking at this again, I'm no longer sure this is a bug.
> > Doesn't your patch just check the same conditions in the opposite
> > order?
>
> Which is important, because what's in what pfds[x] depends on
> wakeEvents. Folded it into a later patch; it's not harmful as long as
> we're only ever testing pfds[0].
>
>
> > 0003: Mostly boring. But the change to win32_latch.c seems to remove
> > an unrelated check.
>
> Argh.
>
+ * from inside a signal handler in latch_sigusr1_handler().
*
* Note: we assume that the kernel calls involved in drainSelfPipe()
* and SetLatch() will provide adequate synchronization on machines
* with weak memory ordering, so that we cannot miss seeing is_set if
* the signal byte is already in the pipe when we drain it.
*/
- drainSelfPipe();
-
Above part of comment looks redundant after this patch. I have done some tests on Windows with 0003 patch which includes running the regressions (vcregress check) and it passes. Will look into it tomorrow once again and share if I find anything wrong with it, but feel to proceed if you want.
On 2016-03-18 05:56:41 -0400, Robert Haas wrote: > >> 0 at the top of the loop and skip it forthwith if so. > > > > You mean in WaitEventSetWait()? There's > > else if (rc == 0) > > { > > break; > > } > > which is the timeout case. There should never be any other case of > > returning 0 elements? > > No, I meant if (cur_event->events == 0) continue; I'm not following. Why would there be an event without an empty event mask? Ok, you can disable all notifications for a socket using ModifyWaitEvent(), but that's not particularly common, right? At least for epoll, it'd not play a role anyway, since epoll_wait() will actually return pointers to the elements we're waiting on; for windows we get the offset in ->handles. I guess we could do so in the select/poll case, but adding another if for something infrequent doesn't strike me as a great benefit. - Andres
On Fri, Mar 18, 2016 at 1:53 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-03-18 05:56:41 -0400, Robert Haas wrote: >> >> 0 at the top of the loop and skip it forthwith if so. >> > >> > You mean in WaitEventSetWait()? There's >> > else if (rc == 0) >> > { >> > break; >> > } >> > which is the timeout case. There should never be any other case of >> > returning 0 elements? >> >> No, I meant if (cur_event->events == 0) continue; > > I'm not following. Why would there be an event without an empty event > mask? Ok, you can disable all notifications for a socket using > ModifyWaitEvent(), but that's not particularly common, right? At least > for epoll, it'd not play a role anyway, since epoll_wait() will actually > return pointers to the elements we're waiting on; for windows we get the > offset in ->handles. I guess we could do so in the select/poll case, > but adding another if for something infrequent doesn't strike me as a > great benefit. No, I mean it should be quite common for a particular fd to have no events reported. If we're polling on 100 fds and 1 of them is active and the other 99 are just sitting there, we want to skip over the other 99 as quickly as possible. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-03-18 14:00:58 -0400, Robert Haas wrote: > No, I mean it should be quite common for a particular fd to have no > events reported. If we're polling on 100 fds and 1 of them is active > and the other 99 are just sitting there, we want to skip over the > other 99 as quickly as possible. cur_events points to the event that's registered with the set, not the one that's "returned" or "modified" by poll/select, that's where the confusion is originating from. I'll add a fastpath. Greetings, Andres Freund
On 2016-03-18 20:14:07 +0530, Amit Kapila wrote: > + * from inside a signal handler in latch_sigusr1_handler(). > > * > > * Note: we assume that the kernel calls involved in drainSelfPipe() > > * and SetLatch() will provide adequate synchronization on machines > > * with weak memory ordering, so that we cannot miss seeing is_set if > > * the signal byte is already in the pipe when we drain it. > > */ > > - drainSelfPipe(); > > - > > > Above part of comment looks redundant after this patch. Don't think so. Moving it closer to the drainSelfPipe() call might be neat, but since there's several callsites... > I have done some > tests on Windows with 0003 patch which includes running the regressions > (vcregress check) and it passes. Will look into it tomorrow once again and > share if I find anything wrong with it, but feel to proceed if you want. Thanks for the testing thus far! Let's see what the buildfarm has to say. Andres
Andres Freund wrote: > From 1d444b0855dbf65d66d73beb647b772fff3404c8 Mon Sep 17 00:00:00 2001 > From: Andres Freund <andres@anarazel.de> > Date: Fri, 18 Mar 2016 00:52:07 -0700 > Subject: [PATCH 4/5] Combine win32 and unix latch implementations. > > Previously latches for windows and unix had been implemented in > different files. The next patch in this series will introduce an > expanded wait infrastructure, keeping the implementation separate would > introduce too much duplication. > > This basically just moves the functions, without too much change. The > reason to keep this separate is that it allows blame to continue working > a little less badly; and to make review a tiny bit easier. This seems a reasonable change, but I think that the use of WIN32 vs. LATCH_USE_WIN32 is pretty confusing. In particular, LATCH_USE_WIN32 isn't actually used for anything ... I suppose we don't care since this is a temporary state of affairs only? In 0005: In latch.c you typedef WaitEventSet, but the typedef already appears in latch.h. You need only declare the struct in latch.c, without typedef'ing. Haven't really reviewed anything here yet, just skimming ATM. Having so many #ifdefs all over the place in this file looks really bad, but I guess there's no way around that because this is very platform-specific. I hope pgindent doesn't choke on it. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2016-03-18 15:49:51 -0300, Alvaro Herrera wrote: > This seems a reasonable change, but I think that the use of WIN32 vs. > LATCH_USE_WIN32 is pretty confusing. In particular, LATCH_USE_WIN32 > isn't actually used for anything ... I suppose we don't care since this > is a temporary state of affairs only? Hm, I guess we could make some more of those use LATCH_USE_WIN32. There's essentially two axes here a) latch notification method (self-pipe vs windows events) b) readiness notification (epoll vs poll vs select vs WaitForMultipleObjects). > In 0005: In latch.c you typedef WaitEventSet, but the typedef already > appears in latch.h. You need only declare the struct in latch.c, > without typedef'ing. Good catch. It's even important, some compilers choke on that. > Haven't really reviewed anything here yet, just skimming ATM. Having so > many #ifdefs all over the place in this file looks really bad, but I > guess there's no way around that because this is very platform-specific. I think it's hard to further reduce the number of ifdefs, but if you have ideas... > I hope pgindent doesn't choke on it. The patch is pgindented (I'd personally never decrease indentation just to fit a line into 79 chars, as pgindent does...). Thanks for looking! Greetings, Andres Freund
On Sat, Mar 19, 2016 at 12:00 AM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-03-18 20:14:07 +0530, Amit Kapila wrote:
>
> > I have done some
> > tests on Windows with 0003 patch which includes running the regressions
> > (vcregress check) and it passes. Will look into it tomorrow once again and
> > share if I find anything wrong with it, but feel to proceed if you want.
>
> Thanks for the testing thus far! Let's see what the buildfarm has to
> say.
>
>
> On 2016-03-18 20:14:07 +0530, Amit Kapila wrote:
>
> > I have done some
> > tests on Windows with 0003 patch which includes running the regressions
> > (vcregress check) and it passes. Will look into it tomorrow once again and
> > share if I find anything wrong with it, but feel to proceed if you want.
>
> Thanks for the testing thus far! Let's see what the buildfarm has to
> say.
>
Won't the new code needs to ensure that ResetEvent(latchevent) should get called in case WaitForMultipleObjects() comes out when both pgwin32_signal_event and latchevent are signalled at the same time?
On March 18, 2016 11:32:53 PM PDT, Amit Kapila <amit.kapila16@gmail.com> wrote: >On Sat, Mar 19, 2016 at 12:00 AM, Andres Freund <andres@anarazel.de> >wrote: >> >> On 2016-03-18 20:14:07 +0530, Amit Kapila wrote: >> >> > I have done some >> > tests on Windows with 0003 patch which includes running the >regressions >> > (vcregress check) and it passes. Will look into it tomorrow once >again >and >> > share if I find anything wrong with it, but feel to proceed if you >want. >> >> Thanks for the testing thus far! Let's see what the buildfarm has to >> say. >> > >Won't the new code needs to ensure that ResetEvent(latchevent) should >get >called in case WaitForMultipleObjects() comes out when both >pgwin32_signal_event and latchevent are signalled at the same time? WaitForMultiple only reports the readiness of on event at a time, no? -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Sat, Mar 19, 2016 at 12:14 PM, Andres Freund <andres@anarazel.de> wrote:
>
>
>
> On March 18, 2016 11:32:53 PM PDT, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >On Sat, Mar 19, 2016 at 12:00 AM, Andres Freund <andres@anarazel.de>
> >wrote:
> >>
> >> On 2016-03-18 20:14:07 +0530, Amit Kapila wrote:
> >>
> >> > I have done some
> >> > tests on Windows with 0003 patch which includes running the
> >regressions
> >> > (vcregress check) and it passes. Will look into it tomorrow once
> >again
> >and
> >> > share if I find anything wrong with it, but feel to proceed if you
> >want.
> >>
> >> Thanks for the testing thus far! Let's see what the buildfarm has to
> >> say.
> >>
> >
> >Won't the new code needs to ensure that ResetEvent(latchevent) should
> >get
> >called in case WaitForMultipleObjects() comes out when both
> >pgwin32_signal_event and latchevent are signalled at the same time?
>
> WaitForMultiple only reports the readiness of on event at a time, no?
>
>
>
>
> On March 18, 2016 11:32:53 PM PDT, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >On Sat, Mar 19, 2016 at 12:00 AM, Andres Freund <andres@anarazel.de>
> >wrote:
> >>
> >> On 2016-03-18 20:14:07 +0530, Amit Kapila wrote:
> >>
> >> > I have done some
> >> > tests on Windows with 0003 patch which includes running the
> >regressions
> >> > (vcregress check) and it passes. Will look into it tomorrow once
> >again
> >and
> >> > share if I find anything wrong with it, but feel to proceed if you
> >want.
> >>
> >> Thanks for the testing thus far! Let's see what the buildfarm has to
> >> say.
> >>
> >
> >Won't the new code needs to ensure that ResetEvent(latchevent) should
> >get
> >called in case WaitForMultipleObjects() comes out when both
> >pgwin32_signal_event and latchevent are signalled at the same time?
>
> WaitForMultiple only reports the readiness of on event at a time, no?
>
I don't think so, please read link [1] with a focus on below paragraph which states how it reports the readiness or signaled state when multiple objects become signaled.
"When bWaitAll is FALSE, this function checks the handles in the array in order starting with index 0, until one of the objects is signaled. If multiple objects become signaled, the function returns the index of the first handle in the array whose object was signaled."
On March 18, 2016 11:52:08 PM PDT, Amit Kapila <amit.kapila16@gmail.com> wrote: >On Sat, Mar 19, 2016 at 12:14 PM, Andres Freund <andres@anarazel.de> >wrote: >> >> >> >> On March 18, 2016 11:32:53 PM PDT, Amit Kapila ><amit.kapila16@gmail.com> >wrote: >> >On Sat, Mar 19, 2016 at 12:00 AM, Andres Freund <andres@anarazel.de> >> >wrote: >> >> >> >> On 2016-03-18 20:14:07 +0530, Amit Kapila wrote: >> >> >> >> > I have done some >> >> > tests on Windows with 0003 patch which includes running the >> >regressions >> >> > (vcregress check) and it passes. Will look into it tomorrow >once >> >again >> >and >> >> > share if I find anything wrong with it, but feel to proceed if >you >> >want. >> >> >> >> Thanks for the testing thus far! Let's see what the buildfarm has >to >> >> say. >> >> >> > >> >Won't the new code needs to ensure that ResetEvent(latchevent) >should >> >get >> >called in case WaitForMultipleObjects() comes out when both >> >pgwin32_signal_event and latchevent are signalled at the same time? >> >> WaitForMultiple only reports the readiness of on event at a time, no? >> > >I don't think so, please read link [1] with a focus on below paragraph >which states how it reports the readiness or signaled state when >multiple >objects become signaled. > >"When *bWaitAll* is *FALSE*, this function checks the handles in the >array >in order starting with index 0, until one of the objects is signaled. >If >multiple objects become signaled, the function returns the index of the >first handle in the array whose object was signaled." I think that's OK. We'll just get the next event the next time we call waitfor*. It's also not different to the way the routineis currently handling normal socket and postmaster events, no? It's be absurdly broken if it handled edge triggeredenemy's like FD_CLOSE in a way you can't discover. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Sat, Mar 19, 2016 at 12:40 PM, Andres Freund <andres@anarazel.de> wrote:
>
> On March 18, 2016 11:52:08 PM PDT, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >> >Won't the new code needs to ensure that ResetEvent(latchevent)
> >should
> >> >get
> >> >called in case WaitForMultipleObjects() comes out when both
> >> >pgwin32_signal_event and latchevent are signalled at the same time?
> >>
> >> WaitForMultiple only reports the readiness of on event at a time, no?
> >>
> >
> >I don't think so, please read link [1] with a focus on below paragraph
> >which states how it reports the readiness or signaled state when
> >multiple
> >objects become signaled.
> >
> >"When *bWaitAll* is *FALSE*, this function checks the handles in the
> >array
> >in order starting with index 0, until one of the objects is signaled.
> >If
> >multiple objects become signaled, the function returns the index of the
> >first handle in the array whose object was signaled."
>
> I think that's OK. We'll just get the next event the next time we call waitfor*. It's also not different to the way the routine is currently handling normal socket and postmaster events, no?
>
> On March 18, 2016 11:52:08 PM PDT, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >> >Won't the new code needs to ensure that ResetEvent(latchevent)
> >should
> >> >get
> >> >called in case WaitForMultipleObjects() comes out when both
> >> >pgwin32_signal_event and latchevent are signalled at the same time?
> >>
> >> WaitForMultiple only reports the readiness of on event at a time, no?
> >>
> >
> >I don't think so, please read link [1] with a focus on below paragraph
> >which states how it reports the readiness or signaled state when
> >multiple
> >objects become signaled.
> >
> >"When *bWaitAll* is *FALSE*, this function checks the handles in the
> >array
> >in order starting with index 0, until one of the objects is signaled.
> >If
> >multiple objects become signaled, the function returns the index of the
> >first handle in the array whose object was signaled."
>
> I think that's OK. We'll just get the next event the next time we call waitfor*. It's also not different to the way the routine is currently handling normal socket and postmaster events, no?
>
I think the primary difference with socket and postmaster event as compare to latch event is that it won't allow to start waiting with the waitevent in signalled state. For socket event, it will close the event in the end and create again before entring the wait loop in WaitLatchOrSocket. I could not see any major problem apart from may be spurious wake ups in few cases (as we haven't reset the event to non signalled state for latch event before entering wait, so it can just return immediately) even if we don't Reset the latch event.
On Fri, Mar 18, 2016 at 1:34 PM, Andres Freund <andres@anarazel.de> wrote:
>
>
> Attached is a significantly revised version of the earlier series. Most
> importantly I have:
> * Unified the window/unix latch implementation into one file (0004)
>
>
> Attached is a significantly revised version of the earlier series. Most
> importantly I have:
> * Unified the window/unix latch implementation into one file (0004)
>
After applying patch 0004* on HEAD, using command patch -p1 < <path_of_patch>, I am getting build failure:
I think it could not rename port/unix_latch.c => storage/ipc/latch.c. I have tried with git apply, but no success. Am I doing something wrong?
One minor suggestion about patch:
+#ifndef WIN32
void
latch_sigusr1_handler(void)
{
if (waiting)
sendSelfPipeByte();
}
+#endif /* !WIN32 */
/* Send one byte to the self-pipe, to wake up WaitLatch */
+#ifndef WIN32
static void
sendSelfPipeByte(void)
Instead of individually defining these functions under #ifndef WIN32, isn't it better to combine them all as they are at end of file.
On 2016-03-19 15:43:27 +0530, Amit Kapila wrote: > On Sat, Mar 19, 2016 at 12:40 PM, Andres Freund <andres@anarazel.de> wrote: > > > > On March 18, 2016 11:52:08 PM PDT, Amit Kapila <amit.kapila16@gmail.com> > wrote: > > >> >Won't the new code needs to ensure that ResetEvent(latchevent) > > >should > > >> >get > > >> >called in case WaitForMultipleObjects() comes out when both > > >> >pgwin32_signal_event and latchevent are signalled at the same time? > > >> WaitForMultiple only reports the readiness of on event at a time, no? > > >> > > > > > >I don't think so, please read link [1] with a focus on below paragraph > > >which states how it reports the readiness or signaled state when > > >multiple > > >objects become signaled. > > > > > >"When *bWaitAll* is *FALSE*, this function checks the handles in the > > >array > > >in order starting with index 0, until one of the objects is signaled. > > >If > > >multiple objects become signaled, the function returns the index of the > > >first handle in the array whose object was signaled." I think this is just incredibly bad documentation. See https://blogs.msdn.microsoft.com/oldnewthing/20150409-00/?p=44273 (Raymond Chen can be considered an authority here imo). Andres
On 2016-03-19 16:44:49 +0530, Amit Kapila wrote: > On Fri, Mar 18, 2016 at 1:34 PM, Andres Freund <andres@anarazel.de> wrote: > > > > > > Attached is a significantly revised version of the earlier series. Most > > importantly I have: > > * Unified the window/unix latch implementation into one file (0004) > > > > After applying patch 0004* on HEAD, using command patch -p1 < > <path_of_patch>, I am getting build failure: > > c1 : fatal error C1083: Cannot open source file: > 'src/backend/storage/ipc/latch.c': No such file or directory > > I think it could not rename port/unix_latch.c => storage/ipc/latch.c. I > have tried with git apply, but no success. Am I doing something wrong? You skipped applying 0003. I'll send an updated version - with all the docs and such - in the next hours. > One minor suggestion about patch: > > +#ifndef WIN32 > void > latch_sigusr1_handler(void) > { > if (waiting) > sendSelfPipeByte(); > } > +#endif /* !WIN32 */ > > /* Send one byte to the self-pipe, to wake up WaitLatch */ > +#ifndef WIN32 > static void > sendSelfPipeByte(void) > > > Instead of individually defining these functions under #ifndef WIN32, isn't > it better to combine them all as they are at end of file. They're all at the end of the file. I just found it easier to reason about if both #if and #endif are visible in one screen full of code. Don't feel super strong about it tho. Greetings, Andres Freund
On 2016-03-19 18:45:36 -0700, Andres Freund wrote: > On 2016-03-19 16:44:49 +0530, Amit Kapila wrote: > > On Fri, Mar 18, 2016 at 1:34 PM, Andres Freund <andres@anarazel.de> wrote: > > > > > > > > > Attached is a significantly revised version of the earlier series. Most > > > importantly I have: > > > * Unified the window/unix latch implementation into one file (0004) > > > > > > > After applying patch 0004* on HEAD, using command patch -p1 < > > <path_of_patch>, I am getting build failure: > > > > c1 : fatal error C1083: Cannot open source file: > > 'src/backend/storage/ipc/latch.c': No such file or directory > > > > I think it could not rename port/unix_latch.c => storage/ipc/latch.c. I > > have tried with git apply, but no success. Am I doing something wrong? > > You skipped applying 0003. > > I'll send an updated version - with all the docs and such - in the next > hours. Here we go. I think this is getting pretty clos eto being committable, minus a bit of testing edge cases on unix (postmaster death, disconnecting clients in various ways (especially with COPY)) and windows (uh, does it even work at all?). There's no large code changes in this revision, mainly some code polishing and a good bit more comment improvements. Regards, Andres
Attachment
On Sun, Mar 20, 2016 at 5:14 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-03-19 18:45:36 -0700, Andres Freund wrote: >> On 2016-03-19 16:44:49 +0530, Amit Kapila wrote: >> > On Fri, Mar 18, 2016 at 1:34 PM, Andres Freund <andres@anarazel.de> wrote: >> > > >> > > >> > > Attached is a significantly revised version of the earlier series. Most >> > > importantly I have: >> > > * Unified the window/unix latch implementation into one file (0004) >> > > >> > >> > After applying patch 0004* on HEAD, using command patch -p1 < >> > <path_of_patch>, I am getting build failure: >> > >> > c1 : fatal error C1083: Cannot open source file: >> > 'src/backend/storage/ipc/latch.c': No such file or directory >> > >> > I think it could not rename port/unix_latch.c => storage/ipc/latch.c. I >> > have tried with git apply, but no success. Am I doing something wrong? >> >> You skipped applying 0003. >> >> I'll send an updated version - with all the docs and such - in the next >> hours. > > Here we go. I think this is getting pretty clos eto being committable, > minus a bit of testing edge cases on unix (postmaster death, > disconnecting clients in various ways (especially with COPY)) and > windows (uh, does it even work at all?). > > There's no large code changes in this revision, mainly some code > polishing and a good bit more comment improvements. Hi I couldn't get the second patch to apply for some reason, but I have been trying out your "latch" branch on some different OSs and porting some code that does a bunch of waiting on many sockets over to this API to try it out. One thing that I want to do but can't with this interface is remove an fd from the set. I can AddWaitEventToSet returning a position, and I can ModifyWait to provide new event mask by position including zero mask, I can't actually remove the fd (for example to avoid future error events that can't be masked, or to allow that fd to be closed and perhaps allow that fd number to coincidentally be readded later, and just generally to free the slot). There is an underlying way to remove an fd from a set with poll (sort of), epoll, kqueue. (Not sure about Windows. But surely...). I wonder if there should be RemoveWaitEventFromSet(set, position) which recycles event slots, sticking them on a freelist (and setting corresponding pollfd structs' fd to -1). I wonder if it would be useful to reallocate the arrays as needed so there isn't really a hard limit to the number of things you add, just an initial size. A couple of typos: + * Wait using linux' epoll_wait(2). linux's + /* + * If neither the event mask nor the associated latch changes, return + * early. That's an important optimization for some sockets, were + * ModifyWaitEvent is frequently used to switch from waiting for reads to + * waiting on writes. + */ s/were/where/ + /* + * Even though unused, we also poss epoll_ev as the data argument if + * EPOLL_CTL_DELETE is passed as action. There used to be an epoll bug + * requiring that, and acutally it makes the code simpler... + */ s/poss/pass/ s/EPOLL_CTL_DELETE/EPOLL_CTL_DEL/ s/acutally/actually/ There is no code passing EPOLL_CTL_DEL, but maybe this comment is a clue that you have already implemented remove on some other branch... -- Thomas Munro http://www.enterprisedb.com
On 2016-03-21 01:31:30 +1300, Thomas Munro wrote: > I couldn't get the second patch to apply for some reason, Weird? Even efter appying the first one first? > but I have been trying out your "latch" branch on some different OSs > and porting some code that does a bunch of waiting on many sockets > over to this API to try it out. > One thing that I want to do but can't with this interface is remove an > fd from the set. Yea, I didn't see an in-core need for that, so I didn't want to add that yet. Other than that, how did things go? > I can AddWaitEventToSet returning a position, and I > can ModifyWait to provide new event mask by position including zero > mask, I can't actually remove the fd (for example to avoid future > error events that can't be masked, or to allow that fd to be closed > and perhaps allow that fd number to coincidentally be readded later, > and just generally to free the slot). There is an underlying way to > remove an fd from a set with poll (sort of), epoll, kqueue. (Not sure > about Windows. But surely...). I wonder if there should be > RemoveWaitEventFromSet(set, position) which recycles event slots, > sticking them on a freelist (and setting corresponding pollfd structs' > fd to -1). I'm inclined that if we add this, we memmmove everything later, and adjust the offsets. As we e.g. pass the entire pollfd array to poll() at once, not having gaps will be more important. What's the use case where you need that? > I wonder if it would be useful to reallocate the arrays as needed so > there isn't really a hard limit to the number of things you add, just > an initial size. Doubles the amount of palloc()s required. As lots of these sets are actually going to be very short-lived (a single WaitLatch call), that's something I'd rather avoid. Although I guess you could just allocate separate arrays at that moment. > A couple of typos: > > + * Wait using linux' epoll_wait(2). > > linux's > > + /* > + * If neither the event mask nor the associated latch changes, return > + * early. That's an important optimization for some sockets, were > + * ModifyWaitEvent is frequently used to switch from waiting > for reads to > + * waiting on writes. > + */ > > s/were/where/ > > + /* > + * Even though unused, we also poss epoll_ev as the data argument if > + * EPOLL_CTL_DELETE is passed as action. There used to be an epoll bug > + * requiring that, and acutally it makes the code simpler... > + */ > > s/poss/pass/ > s/EPOLL_CTL_DELETE/EPOLL_CTL_DEL/ > s/acutally/actually/ > Thanks! > There is no code passing EPOLL_CTL_DEL, but maybe this comment is a > clue that you have already implemented remove on some other branch... No, I've not, sorry... Andres
On Mon, Mar 21, 2016 at 3:46 AM, Andres Freund <andres@anarazel.de> wrote: > On 2016-03-21 01:31:30 +1300, Thomas Munro wrote: >> I couldn't get the second patch to apply for some reason, > > Weird? Even efter appying the first one first? Ah, I was using patch -p1. I needed to use git am, which understands how to rename stuff. -- Thomas Munro http://www.enterprisedb.com
On Mon, Mar 21, 2016 at 3:46 AM, Andres Freund <andres@anarazel.de> wrote: > On 2016-03-21 01:31:30 +1300, Thomas Munro wrote: >> I couldn't get the second patch to apply for some reason, > > Weird? Even efter appying the first one first? > > >> but I have been trying out your "latch" branch on some different OSs >> and porting some code that does a bunch of waiting on many sockets >> over to this API to try it out. > >> One thing that I want to do but can't with this interface is remove an >> fd from the set. > > Yea, I didn't see an in-core need for that, so I didn't want to add > that yet. > > Other than that, how did things go? So far, so good. No surprises. >> I can AddWaitEventToSet returning a position, and I >> can ModifyWait to provide new event mask by position including zero >> mask, I can't actually remove the fd (for example to avoid future >> error events that can't be masked, or to allow that fd to be closed >> and perhaps allow that fd number to coincidentally be readded later, >> and just generally to free the slot). There is an underlying way to >> remove an fd from a set with poll (sort of), epoll, kqueue. (Not sure >> about Windows. But surely...). I wonder if there should be >> RemoveWaitEventFromSet(set, position) which recycles event slots, >> sticking them on a freelist (and setting corresponding pollfd structs' >> fd to -1). > > I'm inclined that if we add this, we memmmove everything later, and > adjust the offsets. As we e.g. pass the entire pollfd array to poll() at > once, not having gaps will be more important. > > What's the use case where you need that? I was experimenting with a change to Append so that it could deal with asynchronous subplans, and in particular support for asynchronous foreign scans. See attached patch which applies on top of your latch branch (or your two patches above, I assume) which does that. It is not a very ambitious form of asynchrony and there are opportunities to do so much more in this area (about which more later in some future thread), but it is some relevant educational code I had to hand, so I ported it to your new API as a way to try the API out. The contract that I invented here is that an async-aware parent node can ask any child node "are you ready?" and get back various answers including an fd which means please don't call ExecProcNode until this fd is ready to read. But after ExecProcNode is called, the fd must not be accessed again (the subplan has the right to close it, return a different one next time etc), so it must not appear in any WaitEventSet wait on after that. As you can see, in append_next_async_wait, it therefore needs to create an new WaitEventSet every time it needs to wait, which makes it feel more like select() than epoll(). Ideally it'd just have just one single WaitEventSet for the lifetime of the append node, and just add and remove fds as required. >> I wonder if it would be useful to reallocate the arrays as needed so >> there isn't really a hard limit to the number of things you add, just >> an initial size. > > Doubles the amount of palloc()s required. As lots of these sets are > actually going to be very short-lived (a single WaitLatch call), that's > something I'd rather avoid. Although I guess you could just allocate > separate arrays at that moment. On the subject of short-lived and frequent calls to WaitLatchOrSocket, I wonder if there would be some benefit in reusing a statically allocated WaitEventSet for that. That would become possible if you could add and remove the latch and socket as discussed, with an opportunity to skip the modification work completely if the reusable WaitEventSet already happens to have the right stuff in it from the last WaitLatchOrSocket call. Or maybe the hot wait loops should simply be rewritten to reuse a WaitEventSet explicitly so they can manage that... Some other assorted thoughts: * It looks like epoll (and maybe kqueue?) can associate user data with an event and give it back to you; if you have a mapping between fds and some other object (in the case of the attached patch, subplan nodes that are ready to be pulled), would it be possible and useful to expose that (and simulate it where needed) rather than the caller having to maintain associative data structures (see the linear search in my patch)? * I would be interested in writing a kqueue implementation of this for *BSD (and MacOSX?) at some point if someone doesn't beat me to it. * It would be very cool to get some view into which WaitEventSetWait call backends are waiting in from a system view if it could be done cheaply enough. A name/ID for the call site, and an indication of which latch and how many fds... or something. -- Thomas Munro http://www.enterprisedb.com
Attachment
On Sun, Mar 20, 2016 at 7:13 AM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-03-19 15:43:27 +0530, Amit Kapila wrote:
> > On Sat, Mar 19, 2016 at 12:40 PM, Andres Freund <andres@anarazel.de> wrote:
> > >
> > > On March 18, 2016 11:52:08 PM PDT, Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> > > >> >Won't the new code needs to ensure that ResetEvent(latchevent)
> > > >should
> > > >> >get
> > > >> >called in case WaitForMultipleObjects() comes out when both
> > > >> >pgwin32_signal_event and latchevent are signalled at the same time?
>
> > > >> WaitForMultiple only reports the readiness of on event at a time, no?
> > > >>
> > > >
> > > >I don't think so, please read link [1] with a focus on below paragraph
> > > >which states how it reports the readiness or signaled state when
> > > >multiple
> > > >objects become signaled.
> > > >
> > > >"When *bWaitAll* is *FALSE*, this function checks the handles in the
> > > >array
> > > >in order starting with index 0, until one of the objects is signaled.
> > > >If
> > > >multiple objects become signaled, the function returns the index of the
> > > >first handle in the array whose object was signaled."
>
> I think this is just incredibly bad documentation. See
> https://blogs.msdn.microsoft.com/oldnewthing/20150409-00/?p=44273
> (Raymond Chen can be considered an authority here imo).
>
>
> On 2016-03-19 15:43:27 +0530, Amit Kapila wrote:
> > On Sat, Mar 19, 2016 at 12:40 PM, Andres Freund <andres@anarazel.de> wrote:
> > >
> > > On March 18, 2016 11:52:08 PM PDT, Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> > > >> >Won't the new code needs to ensure that ResetEvent(latchevent)
> > > >should
> > > >> >get
> > > >> >called in case WaitForMultipleObjects() comes out when both
> > > >> >pgwin32_signal_event and latchevent are signalled at the same time?
>
> > > >> WaitForMultiple only reports the readiness of on event at a time, no?
> > > >>
> > > >
> > > >I don't think so, please read link [1] with a focus on below paragraph
> > > >which states how it reports the readiness or signaled state when
> > > >multiple
> > > >objects become signaled.
> > > >
> > > >"When *bWaitAll* is *FALSE*, this function checks the handles in the
> > > >array
> > > >in order starting with index 0, until one of the objects is signaled.
> > > >If
> > > >multiple objects become signaled, the function returns the index of the
> > > >first handle in the array whose object was signaled."
>
> I think this is just incredibly bad documentation. See
> https://blogs.msdn.microsoft.com/oldnewthing/20150409-00/?p=44273
> (Raymond Chen can be considered an authority here imo).
>
The article pointed by you justifies that the way ResetEvent is done by patch is correct. I am not sure, but you can weigh, if there is a need of comment so that if we want enhance this part of code (or want to write something similar) in future, we don't need to rediscover this fact.
On March 21, 2016 5:12:38 AM GMT+01:00, Amit Kapila <amit.kapila16@gmail.com> wrote: >The article pointed by you justifies that the way ResetEvent is done by >patch is correct. I am not sure, but you can weigh, if there is a need >of >comment so that if we want enhance this part of code (or want to write >something similar) in future, we don't need to rediscover this fact. I've added a reference in a comment. Did you have a chance of running the patched versions on windows? I plan to push this sometime today, so I can get on to some performance patches I was planning to look into committing. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Mon, Mar 21, 2016 at 10:21 AM, Andres Freund <andres@anarazel.de> wrote:
On March 21, 2016 5:12:38 AM GMT+01:00, Amit Kapila <amit.kapila16@gmail.com> wrote:
>The article pointed by you justifies that the way ResetEvent is done by
>patch is correct. I am not sure, but you can weigh, if there is a need
>of
>comment so that if we want enhance this part of code (or want to write
>something similar) in future, we don't need to rediscover this fact.
I've added a reference in a comment.
Did you have a chance of running the patched versions on windows?
I am planning to do it in next few hours.
I plan to push this sometime today, so I can get on to some performance patches I was planning to look into committing.
have we done testing to ensure that it actually mitigate the impact of performance degradation due to commit ac1d794. I wanted to do that, but unfortunately the hight-end m/c on which this problem is reproducible is down.
Hi, On 2016-03-21 11:52:43 +1300, Thomas Munro wrote: > The contract that I invented here is that an async-aware parent node > can ask any child node "are you ready?" and get back various answers > including an fd which means please don't call ExecProcNode until this > fd is ready to read. But after ExecProcNode is called, the fd must > not be accessed again (the subplan has the right to close it, return a > different one next time etc), so it must not appear in any > WaitEventSet wait on after that. Hm, why did you choose to go with that contract? Why do children need to switch fds and such? > As you can see, in append_next_async_wait, it therefore needs to > create an new WaitEventSet every time it needs to wait, which makes it > feel more like select() than epoll(). Ideally it'd just have just one > single WaitEventSet for the lifetime of the append node, and just add > and remove fds as required. Yes, I see. I think that'd be less efficient than not manipulating the set all the time, but still a lot more efficient than adding/removing the postmaster death latch every round. Anyway, I want to commit this ASAP, so I can get on to some patches in the CF. But I'd welcome you playing with adding that ability. > On the subject of short-lived and frequent calls to WaitLatchOrSocket, > I wonder if there would be some benefit in reusing a statically > allocated WaitEventSet for that. Wondered the same. It looks like there'd be a number (just latch, latch + postmaster death), for it to be beneficial. Most latch waits aren't that frequent though, so I've decided not to initially go there. > That would become possible if you > could add and remove the latch and socket as discussed, with an > opportunity to skip the modification work completely if the reusable > WaitEventSet already happens to have the right stuff in it from the > last WaitLatchOrSocket call. Or maybe the hot wait loops should > simply be rewritten to reuse a WaitEventSet explicitly so they can > manage that... Yea, I think that's better. > * It looks like epoll (and maybe kqueue?) can associate user data with > an event and give it back to you; if you have a mapping between fds > and some other object (in the case of the attached patch, subplan > nodes that are ready to be pulled), would it be possible and useful to > expose that (and simulate it where needed) rather than the caller > having to maintain associative data structures (see the linear search > in my patch)? We already use the pointer that epoll gives you. But note that the 'WaitEvent' struct already contains the original position the event was registered at. That's filled in when you call WaitEventSetWait(). So you could either just build an array based on ->pos, or we could also add a void *private; or such to the definition. > * I would be interested in writing a kqueue implementation of this for > *BSD (and MacOSX?) at some point if someone doesn't beat me to it. I hoped that somebody would do that - that'd afaics be the only major API missing. > * It would be very cool to get some view into which WaitEventSetWait > call backends are waiting in from a system view if it could be done > cheaply enough. A name/ID for the call site, and an indication of > which latch and how many fds... or something. That's not something I plan to tackle today; we don't have wait event integration for latches atm. There's some threads somewhere about this, but that seems mostly independent of the way latches are implemented. Thanks, Andres
On Mon, Mar 21, 2016 at 10:26 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Mar 21, 2016 at 10:21 AM, Andres Freund <andres@anarazel.de> wrote:
>>
>>
>>
>> On March 21, 2016 5:12:38 AM GMT+01:00, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> >The article pointed by you justifies that the way ResetEvent is done by
>> >patch is correct. I am not sure, but you can weigh, if there is a need
>> >of
>> >comment so that if we want enhance this part of code (or want to write
>> >something similar) in future, we don't need to rediscover this fact.
>>
>> I've added a reference in a comment.
>>
>> Did you have a chance of running the patched versions on windows?
>>
>
> I am planning to do it in next few hours.
>
> On Mon, Mar 21, 2016 at 10:21 AM, Andres Freund <andres@anarazel.de> wrote:
>>
>>
>>
>> On March 21, 2016 5:12:38 AM GMT+01:00, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> >The article pointed by you justifies that the way ResetEvent is done by
>> >patch is correct. I am not sure, but you can weigh, if there is a need
>> >of
>> >comment so that if we want enhance this part of code (or want to write
>> >something similar) in future, we don't need to rediscover this fact.
>>
>> I've added a reference in a comment.
>>
>> Did you have a chance of running the patched versions on windows?
>>
>
> I am planning to do it in next few hours.
>
With 0002-Introduce-new-WaitEventSet-API, initdb is successful, but server start leads to below problem:
LOG: database system was shut down at 2016-03-21 11:17:13 IST
LOG: MultiXact member wraparound protections are now enabled
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
FATAL: failed to set up event for socket: error code 10022
LOG: statistics collector process (PID 668) exited with exit code 1
LOG: autovacuum launcher process (PID 3868) was terminated by exception 0xC0000005
HINT: See C include file "ntstatus.h" for a description of the hexadecimal value.
I haven't investigated the problem still.
Hi,m On 2016-03-21 11:26:43 +0530, Amit Kapila wrote: > LOG: database system was shut down at 2016-03-21 11:17:13 IST > LOG: MultiXact member wraparound protections are now enabled > LOG: database system is ready to accept connections > LOG: autovacuum launcher started > FATAL: failed to set up event for socket: error code 10022 > LOG: statistics collector process (PID 668) exited with exit code 1 > LOG: autovacuum launcher process (PID 3868) was terminated by exception > 0xC0000005 > HINT: See C include file "ntstatus.h" for a description of the hexadecimal > value. We resolved this and two followup issues on IM; all localized issues. Thanks again for the help! I've attached two refreshed patches including the relevant fixes, and the addition of a 'user_data' pointer to events; as desired by Thomas. I plan to push these after some msvc animals turned green. We might want to iterate on the API in some parts, but this will fix the regression, and imo looks good. Regards, Andres
Attachment
Hi, I've pushed the new API. We might want to use it in more places... On 2015-12-25 20:08:15 +0300, Васильев Дмитрий wrote: > I suddenly found commit ac1d794 gives up to 3 times performance degradation. > > I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70 CPU-core > machine: > commit ac1d794 gives me 363,474 tps > and previous commit a05dc4d gives me 956,146 > and master( 3d0c50f ) with revert ac1d794 gives me 969,265 Could you please verify that the performance figures are good now? I've tested this on a larger two socket system, and there I managed to found a configuration showing the slowdown (-c64 -j4), and I confirmed the fix. Regards, Andres
On Mon, Mar 21, 2016 at 6:09 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-03-21 11:52:43 +1300, Thomas Munro wrote: >> * I would be interested in writing a kqueue implementation of this for >> *BSD (and MacOSX?) at some point if someone doesn't beat me to it. > > I hoped that somebody would do that - that'd afaics be the only major > API missing. Here's a first swing at it, though I need to find some spare time to test and figure out if some details like error conditions and EOF are handled correctly. It could in theory minimise the number of syscalls it makes by buffering changes (additions and hopefully one day removals) and then use a single kevent syscall to apply all modifications to the set and begin waiting, but for now it mirrors the epoll code. One user-visible difference compared to epoll/poll/select is that it delivers readable and writable events separately if both conditions are true for a single fd. It builds and appears to work correctly on FreeBSD 10.2 and MacOSX 10.10.2. Sharing this early version in case any BSD users have any feedback. I hope to do some testing this weekend. -- Thomas Munro http://www.enterprisedb.com
Attachment
On Fri, Dec 25, 2015 at 08:08:15PM +0300, Васильев Дмитрий wrote: > I suddenly found commit ac1d794 gives up to 3 times performance degradation. > > I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70 CPU-core > machine: > commit ac1d794 gives me 363,474 tps > and previous commit a05dc4d gives me 956,146 > and master( 3d0c50f ) with revert ac1d794 gives me 969,265 [This is a generic notification.] The above-described topic is currently a PostgreSQL 9.6 open item. Robert, since you committed the patch believed to have created it, you own this open item. If some other commit is more relevant or if this does not belong as a 9.6 open item, please let us know. Otherwise, please observe the policy on open item ownership[1] and send a status update within 72 hours of this message. Include a date for your subsequent status update. Testers may discover new open items at any time, and I want to plan to get them all fixed well in advance of shipping 9.6rc1. Consequently, I will appreciate your efforts toward speedy resolution. Thanks. [1] http://www.postgresql.org/message-id/20160527025039.GA447393@tornado.leadboat.com
On Sun, May 29, 2016 at 1:40 AM, Noah Misch <noah@leadboat.com> wrote: > On Fri, Dec 25, 2015 at 08:08:15PM +0300, Васильев Дмитрий wrote: >> I suddenly found commit ac1d794 gives up to 3 times performance degradation. >> >> I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70 CPU-core >> machine: >> commit ac1d794 gives me 363,474 tps >> and previous commit a05dc4d gives me 956,146 >> and master( 3d0c50f ) with revert ac1d794 gives me 969,265 > > [This is a generic notification.] > > The above-described topic is currently a PostgreSQL 9.6 open item. Robert, > since you committed the patch believed to have created it, you own this open > item. If some other commit is more relevant or if this does not belong as a > 9.6 open item, please let us know. Otherwise, please observe the policy on > open item ownership[1] and send a status update within 72 hours of this > message. Include a date for your subsequent status update. Testers may > discover new open items at any time, and I want to plan to get them all fixed > well in advance of shipping 9.6rc1. Consequently, I will appreciate your > efforts toward speedy resolution. Thanks. So, the reason this is back on the open items list is that Mithun Cy re-reported this problem in: https://www.postgresql.org/message-id/CAD__OuhPmc6XH=wYRm_+Q657yQE88DakN4=Ybh2oveFasHkoeA@mail.gmail.com When I saw that, I moved this from CLOSE_WAIT back to open. However, subsequently, Ashutosh Sharma posted this, which suggests (not conclusively) that in fact the problem has been fixed: https://www.postgresql.org/message-id/CAE9k0PkFEhVq-Zg4MH0bZ-zt_oE5PAS6dAuxRCXwX9kEVWceag@mail.gmail.com What I *think* is going on here is: - ac1d794 lowered performance - backend_flush_after with a non-zero default lowered performance with a vengeance - 98a64d0 repaired the damage done by ac1d794, or much of it, but Mithun couldn't see it in his benchmarks because backend_flush_after>0 is so bad That could be wrong, but I haven't seen any evidence that it's wrong. So I'm inclined to say we should just move this open item back to the CLOSE_WAIT list (adding a link to this email to explain why we did so). Does that work for you? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, May 31, 2016 at 10:09:05PM -0400, Robert Haas wrote: > On Sun, May 29, 2016 at 1:40 AM, Noah Misch <noah@leadboat.com> wrote: > > On Fri, Dec 25, 2015 at 08:08:15PM +0300, Васильев Дмитрий wrote: > >> I suddenly found commit ac1d794 gives up to 3 times performance degradation. > >> > >> I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70 CPU-core > >> machine: > >> commit ac1d794 gives me 363,474 tps > >> and previous commit a05dc4d gives me 956,146 > >> and master( 3d0c50f ) with revert ac1d794 gives me 969,265 > > > > [This is a generic notification.] > > > > The above-described topic is currently a PostgreSQL 9.6 open item. Robert, > > since you committed the patch believed to have created it, you own this open > > item. If some other commit is more relevant or if this does not belong as a > > 9.6 open item, please let us know. Otherwise, please observe the policy on > > open item ownership[1] and send a status update within 72 hours of this > > message. Include a date for your subsequent status update. Testers may > > discover new open items at any time, and I want to plan to get them all fixed > > well in advance of shipping 9.6rc1. Consequently, I will appreciate your > > efforts toward speedy resolution. Thanks. > > So, the reason this is back on the open items list is that Mithun Cy > re-reported this problem in: > > https://www.postgresql.org/message-id/CAD__OuhPmc6XH=wYRm_+Q657yQE88DakN4=Ybh2oveFasHkoeA@mail.gmail.com > > When I saw that, I moved this from CLOSE_WAIT back to open. However, > subsequently, Ashutosh Sharma posted this, which suggests (not > conclusively) that in fact the problem has been fixed: > > https://www.postgresql.org/message-id/CAE9k0PkFEhVq-Zg4MH0bZ-zt_oE5PAS6dAuxRCXwX9kEVWceag@mail.gmail.com > > What I *think* is going on here is: > > - ac1d794 lowered performance > - backend_flush_after with a non-zero default lowered performance with > a vengeance > - 98a64d0 repaired the damage done by ac1d794, or much of it, but > Mithun couldn't see it in his benchmarks because backend_flush_after>0 > is so bad Ashutosh Sharma's measurements do bolster that conclusion. > That could be wrong, but I haven't seen any evidence that it's wrong. > So I'm inclined to say we should just move this open item back to the > CLOSE_WAIT list (adding a link to this email to explain why we did > so). Does that work for you? That works for me.
Noah Misch <noah@leadboat.com> writes: > On Tue, May 31, 2016 at 10:09:05PM -0400, Robert Haas wrote: >> What I *think* is going on here is: >> - ac1d794 lowered performance >> - backend_flush_after with a non-zero default lowered performance with >> a vengeance >> - 98a64d0 repaired the damage done by ac1d794, or much of it, but >> Mithun couldn't see it in his benchmarks because backend_flush_after>0 >> is so bad > Ashutosh Sharma's measurements do bolster that conclusion. >> That could be wrong, but I haven't seen any evidence that it's wrong. >> So I'm inclined to say we should just move this open item back to the >> CLOSE_WAIT list (adding a link to this email to explain why we did >> so). Does that work for you? > That works for me. Can we make a note to re-examine this after the backend_flush_after business is resolved? Or at least get Mithun to redo his benchmarks with backend_flush_after set to zero? regards, tom lane
On Tue, May 31, 2016 at 10:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Noah Misch <noah@leadboat.com> writes: >> On Tue, May 31, 2016 at 10:09:05PM -0400, Robert Haas wrote: >>> What I *think* is going on here is: >>> - ac1d794 lowered performance >>> - backend_flush_after with a non-zero default lowered performance with >>> a vengeance >>> - 98a64d0 repaired the damage done by ac1d794, or much of it, but >>> Mithun couldn't see it in his benchmarks because backend_flush_after>0 >>> is so bad > >> Ashutosh Sharma's measurements do bolster that conclusion. > >>> That could be wrong, but I haven't seen any evidence that it's wrong. >>> So I'm inclined to say we should just move this open item back to the >>> CLOSE_WAIT list (adding a link to this email to explain why we did >>> so). Does that work for you? > >> That works for me. > > Can we make a note to re-examine this after the backend_flush_after > business is resolved? Or at least get Mithun to redo his benchmarks > with backend_flush_after set to zero? Ashutosh Sharma already did pretty much that test in the email to which I linked. (Ashutosh Sharma and Mithun CY work in the same office and have access to the same hardware.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company