Thread: WaitLatchOrSocket optimization

WaitLatchOrSocket optimization

From
Konstantin Knizhnik
Date:
Hi hackers,

Right now function WaitLatchOrSocket is implemented in very inefficient 
way: for each invocation it creates epoll instance, registers events and 
then closes this instance.
Certainly it is possible to create wait event set once with 
CreateWaitEventSet and then use WaitEventSetWait.
And it is done in most performance critical places.
But there are still lot of places in Postgres where WaitLatchOrSocket or 
WaitLatch are used.

One of them is postgres_fdw.
If we run pgbench through postgres_fdw (just redirect pgbench tables 
using postgres_fdw to the localhost),
then at the system with large number (72) of CPU cores, "pgbench -S -M 
prepared -c 100 -j 32" shows performance about 38k TPS with the 
following profile:

-   73.83%     0.12%  postgres         postgres [.] PostgresMain ▒
    - 73.82% PostgresMain ▒
       + 28.18% PortalRun ▒
       + 26.48% finish_xact_command ▒
       + 13.66% PortalStart ▒
       + 1.83% exec_simple_query ▒
       + 1.22% pq_getbyte ▒
       + 0.89% ReadyForQuery ▒
-   66.04%     0.03%  postgres         [kernel.kallsyms] [k] 
entry_SYSCALL_64_fastpath ▒
    - 66.01% entry_SYSCALL_64_fastpath ▒
       + 61.39% syscall_return_slowpath ▒
       + 1.52% sys_epoll_create1 ▒
       + 1.30% SYSC_sendto ▒
       + 0.94% sys_epoll_pwait ▒
         0.57% SYSC_recvfrom ▒
-   65.61%     0.03%  postgres         postgres_fdw.so [.] 
pgfdw_get_result ▒
    - 65.58% pgfdw_get_result ▒
       - 65.00% WaitLatchOrSocket ▒
          + 62.60% __close ▒
          + 1.62% CreateWaitEventSet ▒
-   65.09%     0.02%  postgres         postgres [.] WaitLatchOrSocket ▒
    - 65.08% WaitLatchOrSocket ▒
       + 62.68% __close ▒
       + 1.62% CreateWaitEventSet ▒
+   62.69%     0.02%  postgres         libpthread-2.26.so [.] __close ▒


So, you can see that more than 60% of CPU is spent in close.

If we cache used wait event sets, then performance is increased to 225k 
TPS: five times!
At the systems with smaller number of cores effect of this patch is not 
so large: at my desktop with 4 cores I get just about 10% improvement at 
the same test.

There are two possible ways of fixing this issue:
1. Patch postgres_fdw to store WaitEventSet in connection.
2. Patch WaitLatchOrSocket to cache created wait event sets.

Second approach is more generic and cover all cases of WaitLatch usages.
Attached patch implements with approach.
The most challenging  of this approach is using  socket descriptor as 
part of hash code.
Socket can be closed be already and reused, so cached epoll set will not 
work any more.
To solve this issue I always try to add socket to the epoll set, 
ignoring EEXIST error.
Certainly it will add some extra overhead, but looks like it is 
negligible comparing with overhead of close (if I comment this branch, 
then pgbench performance is almost the same - 227k TPS).
But if there are some other arguments against using cache in 
WaitLatchOrSocket, we have a patch particularly for postgres_fdw.



-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: WaitLatchOrSocket optimization

From
Andres Freund
Date:
Hi,

On 2018-03-15 19:01:40 +0300, Konstantin Knizhnik wrote:
> Right now function WaitLatchOrSocket is implemented in very inefficient way:
> for each invocation it creates epoll instance, registers events and then
> closes this instance.

Right, everything performance critical should be migrated to using a
persistent wait even set.


> But there are still lot of places in Postgres where WaitLatchOrSocket or
> WaitLatch are used.

Most don't matter performancewise though.



> There are two possible ways of fixing this issue:
> 1. Patch postgres_fdw to store WaitEventSet in connection.
> 2. Patch WaitLatchOrSocket to cache created wait event sets.

I'm strongly opposed to 2).  The lifetime of sockets will make this an
absolute mess, it'll potentially explode the number of open file
handles, and some OSs don't handle a large number of sets at the same
time.

Greetings,

Andres Freund


Re: WaitLatchOrSocket optimization

From
Konstantin Knizhnik
Date:
Hi,

On 15.03.2018 20:25, Andres Freund wrote:
> Hi,
>
> On 2018-03-15 19:01:40 +0300, Konstantin Knizhnik wrote:
>> Right now function WaitLatchOrSocket is implemented in very inefficient way:
>> for each invocation it creates epoll instance, registers events and then
>> closes this instance.
> Right, everything performance critical should be migrated to using a
> persistent wait even set.
>
>
>> But there are still lot of places in Postgres where WaitLatchOrSocket or
>> WaitLatch are used.
> Most don't matter performancewise though.
Yes, I didn't see significant difference in performance on workloads not 
including postgres_fdw.
But greeping for all occurrences of WaitLatch I found some places which 
are used to be called quite frequently, for example
waiting for latch in ProcSleep , waiting for more data in shared memory 
message queue, waiting for parallel workers to attach,
waiting for synchronous replica,...
There are about hundred of usages of WaitLatch and I am not sure that 
all of them do not have impact on performance.

>> There are two possible ways of fixing this issue:
>> 1. Patch postgres_fdw to store WaitEventSet in connection.
>> 2. Patch WaitLatchOrSocket to cache created wait event sets.
> I'm strongly opposed to 2).  The lifetime of sockets will make this an
> absolute mess, it'll potentially explode the number of open file
> handles, and some OSs don't handle a large number of sets at the same
> time.
I understand your concern.
But you are not completely right here.
First of all life time of socket doesn't actually matter: if socket is 
closed, then it is automatically removed from event set.
Cite from epoll manual:

       Q6  Will closing a file descriptor cause it to be removed from 
all epoll sets automatically?

        A6  Yes,  but  be  aware  of the following point.  A file 
descriptor is a reference to an open file description (see open(2)).  
Whenever a file
            descriptor is duplicated via dup(2), dup2(2), fcntl(2) 
F_DUPFD, or fork(2), a new file descriptor referring to the same open 
file  descrip-
            tion is created.  An open file description continues to 
exist until all file descriptors referring to it have been closed. A 
file descrip-
            tor is removed from an epoll set only after all the file 
descriptors referring to the underlying open file description have been 
closed (or
            before  if  the file descriptor is explicitly removed using 
epoll_ctl(2) EPOLL_CTL_DEL).  This means that even after a file 
descriptor that
            is part of an epoll set has been closed, events may be 
reported for that file descriptor if other file descriptors referring  
to  the  same
            underlying file description remain open.

So, only file descriptors created by epoll_create affect number of 
descriptors opened by a process.
But this number is limited by size of the cache. Right now I hardcoded 
cache size 113, but in most cases even much smaller cache will be enough.
I do not see that extra let's say 13 open file descriptors can cause 
open file descriptor limit exhaustion.

 From my point of view we should either rewrite WaitLatchOrSocket to not 
use epoll at all (use poll instead), either cache created event set 
descriptors.
Result of pgbench with poll instead epoll is 140k TPS, which is 
certainly much better than 38k TPS with current master, but much worser 
than with cached epoll - 225k TPS.


> Greetings,
>
> Andres Freund

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company