MPTCP - multiplexing many TCP connections through one socket to get better bandwidth - Mailing list pgsql-hackers
From | Jakub Wartak |
---|---|
Subject | MPTCP - multiplexing many TCP connections through one socket to get better bandwidth |
Date | |
Msg-id | CAKZiRmy6j9PBzDHZwdgwHavwKDzv5GWtRSWOTj6-jv6SCOZ=YA@mail.gmail.com Whole thread Raw |
List | pgsql-hackers |
Hi -hackers, With the attached patch PostgreSQL could possibly gain built-in MPTCP support which would allow multiplexing (aggregating) multiple kernel-based TCP streams into one MPTCP socket. This allows bypassing any "chokepoints" on the network transparently for libpq, especially if having *multiple* TCP streams could achieve higher bandwidth than single one. One can think of transparent aggregation of bandwidth over multiple WAN links/tunnels and so. In short it works like this: libpq_client <--MPTCP--> client_kernel <==multiple TCP connections==> server_kernel <--MPTCP--> server_kernel Without much rework of PostgreSQL, this means accelerating any libpq-based use case. Most obvious beneficiaries could be any libpq-based heavy network transfers, especially in enterprise networks. Those come to my mind: - pg_basebackup (over e.g. WAN or multiple interfaces; but also one can think of using 2x 10GigE over LAN) - streaming replication or logical replication [years ago I've was able to use MPTCP with colleagues on production to bypass single TCP stream limitation of streaming replication] - COPY (both upload and download) - postgres_fdw/dblinks? MPTCP is IETF standard and included from Linux kernels from some time (realistically 5.16+?) and it's *enabled* by default in most modern distributions. One could use it with mptcpize (LD_PRELOAD wrapper to hijack socket()), but it's not elegant and would require altering systemd startup scripts (the same story like with NUMA: literally nobody hacking those to just include numactl --interleave there or with adjusting ulimits). The patch right now just assumes IPPROTO_MPTCP is there, so it is not portable, but not that many OSes support it at all -- I think #ifdef would be good enough for now. I dont have access to MacOS to develop this more there, nor I think it would add benefit there, but I may be wrong. So as such the proposed patch is trivial and Linux-only, although there is RFC8684[1][2]. I suspect it is way easier and simpler to support it , rather than try to solve the same problem for each of the listed use-cases. Simulation, basic-use and tests: 1. Strictly for demo purposes here, we need to ARTIFICIALLY limit outbound bandwidth for each new flow (TCP connection) to 10 Mbit/s using `tc` on the server where PostgreSQL is going to be running later on (this simulates some chokepoints, multiple WAN paths): DEV=enp0s31f6 tc qdisc add dev $DEV root handle 1: htb tc class add dev $DEV parent 1: classid 1:1 htb rate 100mbit for i in `seq 1 9`; do tc class add dev $DEV parent 1:1 classid 1:$i htb rate 10mbit ceil 10mbit done # see tc-flow(8) for details, classify each flow with port into separate class (1:X) tc filter add dev $DEV parent 1: protocol ip prio 1 handle 1 flow hash keys src,dst,proto,proto-src,proto-dst divisor 8 baseclass 1 2. From client, verify that single TCP bandwidth is really limited: verify using iperf3 -P 1 -R -c <server> # if you really getting limited single-stream TCP connection instead of full verify using iperf3 -P 8 -R -c <server> # if you really getting more bandwidth than above 3. Check if MPTCP is enabled and configured on both sides uname -r # at least 5.10+ according [4] to get this balancing working, but 6.1+ LTS highly recommended (I've used 6.14.x) sysctl net.mptcp.enabled # should be 1 on both sides by default ip mptcp limits set subflows 8 add_addr_accepted 8 # but feel free to setup max limits 4. Configure MPTCP endpoints on the server (registers some dedicated listening ports for MPTCP use so that there's no need to use multiple IP aliases or PBR): ps uaxw | grep -i mptcpd # check if mptcp daemon (path manager is running or not), it is NOT required in this case ip addr ls # let's assume 10.0.1.240 is my main IP on eno1 device, no need to add new IPs thanks to below trick: ip mptcp endpoint show # to verify #ip mptcp endpoint flush # if necessary # below registers ports 5202..5205 as LISTENing by kernel and dedicated for MPTCP subflows ip mptcp endpoint add 10.0.1.240 dev eno1 port 5202 signal ip mptcp endpoint add 10.0.1.240 dev eno1 port 5203 signal ip mptcp endpoint add 10.0.1.240 dev eno1 port 5204 signal ip mptcp endpoint add 10.0.1.240 dev eno1 port 5205 signal ip mptcp endpoint show # to verify 5. Configure the client: ip addr ls # here I got 10.0.1.250 ip mptcp endpoint show ip mptcp endpoint add 10.0.1.250 dev enp0s31f6 subflow fullmesh # not sure fullmesh is necessary, probably not ip mptcp limits set add_addr_accepted 8 subflows 8 6. Verify that MPTCP works, rerun tests with mptcpize, e.g.: on server: mptcpize run iperf3 -s on client: mptcpize run -d iperf3 -P 1 -R -c <server> # should get better bandwidth but using just 1 MPTCP connection on server run PostgreSQL with listen_mptcp='on' on server: ss -Mtlnp sport 5432 # mptcp should be displayed on client: run basebackup/psql/.. Sample results for 82MB table copy, it's 3x: $ time PGMPTCP=0 /usr/pgsql19/bin/psql -h 10.0.1.240 -c '\copy pgbench_accounts TO '/dev/null';' COPY 500000 real 0m42.123s $ time PGMPTCP=1 /usr/pgsql19/bin/psql -h 10.0.1.240 -c '\copy pgbench_accounts TO '/dev/null';' enabling MPTCP client COPY 500000 real 0m14.416s Sample results for pgbench of DB created with: pgbench -i -s 5, ~1076MB total due to WALs $ time /usr/pgsql19/bin/pg_basebackup -h 10.0.1.240 -c fast -D /tmp/test -v pg_basebackup: initiating base backup, waiting for checkpoint to complete pg_basebackup: checkpoint completed [..] pg_basebackup: base backup completed real 1m26.786s With PGMPTCP=1 set, it gets ~3x $ time PGMPTCP=1 /usr/pgsql19/bin/pg_basebackup -h 10.0.1.240 -c fast -D /tmp/test -v enabling MPTCP client pg_basebackup: initiating base backup, waiting for checkpoint to complete [..] pg_basebackup: starting background WAL receiver enabling MPTCP client [..] pg_basebackup: base backup completed real 0m30.460s Because in the above case, we have advertised 4 IP addresses/port of server to the client, we got the bump on a single socket (note: flows end up being hashed into various HTB classes is random depending on ports used you can get usually 2x .. 4x here). Also as there are two independent application-based connections here in basebackup (transfer + WALs), both get multiplexed (each with 4 subflows). If I would add more ip mptcp ports (server-side), we could get even more juice of course there, but it assumes one has that many paths. Some more advanced setups including separate policy-based-routed (ip rule) things are possible, and stuff like keeping the TCP connection highly available 0 even across ISP/interface (WiFi?) outages - is possible. It works transparently with SSL/TLS too - tested. Of course it won't remove the single CPU limitation of the tools involved (that's completely different problem). If it sounds interesting I was thinking about adding to the patch something like contrib/mptcpinfo (pg_stat_mptcp view to mimic pg_stat_ssl). Also as for the patch there were some places where socket() is being created (libpq cancel packet), but there's no purpose of adding MPTCP there I think. It is important to mention there are two implementations of MPTCP on Linux, so when someone will be googling there's lots of conflicting information: 1) Earlier one, required kernel patching up to <= 5.6, had "ndiffports" multiplexer built-in which worked mostly out of the box. 2) Newer one [3], already merged one into kernel today, a little bit different does not come with built-in ndiffports path manager. In this newer one, as shown above some more manual steps (ip mptcp endpoints) may be required, but mptcpd daemon which is managing (sub)flows seems to be evolving as the usage of this protocol is rising. So I hope in future all of those mptcp commands would be probably optional. Thoughts? -Jakub Wartak. [1] - https://en.wikipedia.org/wiki/Multipath_TCP [2] - https://www.rfc-editor.org/rfc/rfc8684.html [3] - https://www.mptcp.dev/ [4] - https://github.com/multipath-tcp/mptcp_net-next/wiki/#changelog
Attachment
pgsql-hackers by date: