Re: [HACKERS] lseek/read/write overhead becomes visible at scale .. - Mailing list pgsql-hackers

From Tobias Oberstein
Subject Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..
Date
Msg-id 61dc9e79-721a-85c1-c7a9-c6024390d436@gmail.com
Whole thread Raw
In response to Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..  (Alvaro Herrera <alvherre@2ndquadrant.com>)
List pgsql-hackers
Hi Alvaro,

Am 24.01.2017 um 19:36 schrieb Alvaro Herrera:
> Tobias Oberstein wrote:
>
>> I am benchmarking IOPS, and while doing so, it becomes apparent that at
>> these scales it does matter _how_ IO is done.
>>
>> The most efficient way is libaio. I get 9.7 million/sec IOPS with low CPU
>> load. Using any synchronous IO engine is slower and produces higher load.
>>
>> I do understand that switching to libaio isn't going to fly for PG
>> (completely different approach).
>
> Maybe it is possible to write a new f_smgr implementation (parallel to
> md.c) that uses libaio.  There is no "seek" in that interface, at least,
> though the interface does assume that the implementation is blocking.
>

FWIW, I now systematically compared the IO performance when normalized 
for system load induced over different IO methods.

I use the FIO ioengine terminology:

sync = lseek/read/write
psync = pread/pwrite

Here:

https://github.com/oberstet/scratchbox/raw/master/cruncher/engines-compared/normalized-iops.pdf

Conclusion:

psync has 1.15x the normalized IOPS compared to sync
libaio has up to 6.5x the normalized IOPS compared to sync

---

These measurements where done on 16 NVMe block devices.

As mentioned, when Linux MD comes into the game, the difference between 
sync and psync is much higher - the is a lock contention in MD.

The reason for that is: when MD comes into the game, even our massive 
CPU cannot hide the inefficiency of the double syscalls anymore.

This MD issue is our bigger problem (compared to PG using sync/psync). I 
am going to post to the linux-raid list about that, as being advised by 
FIO developers.

---

That being said, regarding getting maximum performance out of NVMes with 
minimal system load, the real deal probably isn't libaio either, but 
kernel bypass (hinted to my by FIO devs):

http://www.spdk.io/

FIO has a plugin for SPDK, which I am going to explore to establish a 
final conclusive baseline for maximum IOPS normalized for load.

There are similar approaches in networking (BSD netmap, DPDK) to bypass 
the kernel altogether (zero copy to userland, no interrupts but polling 
etc). With hardware like this (NVMe, 100GbE etc), the kernel gets in the 
way ..

Anyway, this is now probably OT as for PG;)

Cheers,
/Tobias







pgsql-hackers by date:

Previous
From: Fabien COELHO
Date:
Subject: Re: [HACKERS] pgbench more operators & functions
Next
From: Tobias Oberstein
Date:
Subject: Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..