Re: Introducing PgVA aka PostgresVectorAcceleration using SIMD vector instructions starting with hex_encode - Mailing list pgsql-hackers
From | John Naylor |
---|---|
Subject | Re: Introducing PgVA aka PostgresVectorAcceleration using SIMD vector instructions starting with hex_encode |
Date | |
Msg-id | CAFBsxsG4OWHBbSDM=sSeXrQGOtkPiOEOuME4yD7Ce41NtaAD9g@mail.gmail.com Whole thread Raw |
In response to | AW: Introducing PgVA aka PostgresVectorAcceleration using SIMD vector instructions starting with hex_encode (Hans Buschmann <buschmann@nidsa.net>) |
List | pgsql-hackers |
On Fri, Dec 31, 2021 at 9:32 AM Hans Buschmann <buschmann@nidsa.net> wrote: > Inspired by the effort to integrate JIT for executor acceleration I thought selected simple algorithms working with array-orienteddata should be drastically accelerated by using SIMD instructions on modern hardware. Hi Hans, I have experimented with SIMD within Postgres last year, so I have some idea of the benefits and difficulties. I do think we can profit from SIMD more, but we must be very careful to manage complexity and maximize usefulness. Hopefully I can offer some advice. > - restrict on 64 -bit architectures > These are the dominant server architectures, have the necessary data formats and corresponding registers and operatinginstructions > - start with Intel x86-64 SIMD instructions: > This is the vastly most used platform, available for development and in practical use > - don’t restrict the concept to only Intel x86-64, so that later people with more experience on other architectures canjump in and implement comparable algorithms > - fallback to the established implementation in postgres in non appropriate cases or on user request (GUC) These are all reasonable goals, except GUCs are the wrong place to choose hardware implementations -- it should Just Work. > - coding for maximum hardware usage instead of elegant programming > Once tested, the simple algorithm works as advertised and is used to replace most execution parts of the standardimplementaion in C -1 Maintaining good programming style is a key goal of the project. There are certainly non-elegant parts in the code, but that has a cost and we must consider tradeoffs carefully. I have read some of the optimized code in glibc and it is not fun. They at least know they are targeting one OS and one compiler -- we don't have that luxury. > - focus optimization for the most advanced SIMD instruction set: AVX512 > This provides the most advanced instructions and quite a lot of large registers to aid in latency avoiding -1 AVX512 is a hodge-podge of different instruction subsets and are entirely lacking on some recent Intel server hardware. Also only available from a single chipmaker thus far. > - The loops implementing the algorithm are written in NASM assembler: > NASM is actively maintained, has many output formats, follows the Intel style, has all current instrucions implementedand is fast. > - The loops are mostly independent of operating systems, so all OS’s basing on a NASM obj output format are supported: > This includes Linux and Windows as the most important ones > - The algorithms use advanced techniques (constant and temporary registers) to avoid most unnessary memory accesses: > The assembly implementation gives you the full control over the registers (unlike intrinsics) On the other hand, intrinsics are easy to integrate into a C codebase and relieve us from thinking about object formats. A performance feature that happens to work only on common OS's is probably fine from the user point of view, but if we have to add a lot of extra stuff to make it work at all, that's not a good trade off. "Mostly independent" of the OS is not acceptable -- we shouldn't have to think about the OS at all when the coding does not involve OS facilities (I/O, processes, etc). > As an example I think of pg_dump to dump a huge amount of bytea data (not uncommon in real applications). Most of thesedata are in toast tables, often uncompressed due to their inherant structure. The dump must read the toast pages intomemory, decompose the page, hexdump the content, put the result in an output buffer and trigger the I/O. By integratingall these steps into one big performance improvements can be achieved (but naturally not here in my first implementation!). Seems like a reasonable area to work on, but I've never measured. > The best result I could achieve was roughly 95 seconds for 1 Million dumps of 1718 KB on a Tigerlake laptop using AVX512.This gives about 18 GB/s source-hexdumping rate on a single core! > > In another run with postgres the time to hexdump about half a million tuples with a bytea column yeilding about 6 GB ofoutput reduced the time from about 68 seconds to 60 seconds, which clearly shows the postgres overhead for executing thecopy command on such a data set. I don't quite follow -- is this patched vs. unpatched Postgres? I'm not sure what's been demonstrated. > The assembler routines should work on most x86-64 operating systems, but for the moment only elf64 and WIN64 output formatsare supported. > > The standard calling convention is followed mostly in the LINUX style, on Windows the parameters are moved around accordingly.The same assembler-source-code can be used on both platforms. > I have updated the makefile to include the nasm command and the nasm flags, but I need help to make these based on configure. > > I also have no knowledge on other operating systems (MAC-OS etc.) > > The calling conventions can be easily adopted if they differ but somebody else should jump in for testing. As I implied earlier, this is way too low-level. If we have to worry about obj formats and calling conventions, we'd better be getting something *really* amazing in return. > But I really need help by an expert to integrate it in the perl building process. > I would much appreciate if someone else could jump in for a patch to configure-integration and another patch for .vcxobjintegration. It's a bit presumptuous to enlist others for specific help without general agreement on the design, especially on the most tedious parts. Also, here's a general engineering tip: If the non-fun part is too complex for you to figure out, that might indicate the fun part is too ambitious. I suggest starting with a simple patch with SSE2 (always present on x86-64) intrinsics, one that anyone can apply and test without any additional work. Then we can evaluate if the speed-up in the hex encoding case is worth some additional complexity. As part of that work, it might be good to see if some portable improved algorithm is already available somewhere. > There is much room for other implementations (checksum verification/setting, aggregation, numeric datatype, merging, generate_series,integer and floating point output …) which could be addressed later on. Float output has already been pretty well optimized. CRC checksums already have a hardware implementation on x86 and Arm. I don't know of any practical workload where generate_series() is too slow. Aggregation is an interesting case, but I'm not sure what the current bottlenecks are. -- John Naylor EDB: http://www.enterprisedb.com
pgsql-hackers by date: