Hi Alvaro,
Am 24.01.2017 um 19:36 schrieb Alvaro Herrera:
> Tobias Oberstein wrote:
>
>> I am benchmarking IOPS, and while doing so, it becomes apparent that at
>> these scales it does matter _how_ IO is done.
>>
>> The most efficient way is libaio. I get 9.7 million/sec IOPS with low CPU
>> load. Using any synchronous IO engine is slower and produces higher load.
>>
>> I do understand that switching to libaio isn't going to fly for PG
>> (completely different approach).
>
> Maybe it is possible to write a new f_smgr implementation (parallel to
> md.c) that uses libaio. There is no "seek" in that interface, at least,
> though the interface does assume that the implementation is blocking.
>
FWIW, I now systematically compared the IO performance when normalized
for system load induced over different IO methods.
I use the FIO ioengine terminology:
sync = lseek/read/write
psync = pread/pwrite
Here:
https://github.com/oberstet/scratchbox/raw/master/cruncher/engines-compared/normalized-iops.pdf
Conclusion:
psync has 1.15x the normalized IOPS compared to sync
libaio has up to 6.5x the normalized IOPS compared to sync
---
These measurements where done on 16 NVMe block devices.
As mentioned, when Linux MD comes into the game, the difference between
sync and psync is much higher - the is a lock contention in MD.
The reason for that is: when MD comes into the game, even our massive
CPU cannot hide the inefficiency of the double syscalls anymore.
This MD issue is our bigger problem (compared to PG using sync/psync). I
am going to post to the linux-raid list about that, as being advised by
FIO developers.
---
That being said, regarding getting maximum performance out of NVMes with
minimal system load, the real deal probably isn't libaio either, but
kernel bypass (hinted to my by FIO devs):
http://www.spdk.io/
FIO has a plugin for SPDK, which I am going to explore to establish a
final conclusive baseline for maximum IOPS normalized for load.
There are similar approaches in networking (BSD netmap, DPDK) to bypass
the kernel altogether (zero copy to userland, no interrupts but polling
etc). With hardware like this (NVMe, 100GbE etc), the kernel gets in the
way ..
Anyway, this is now probably OT as for PG;)
Cheers,
/Tobias