Thread: [HACKERS] lseek/read/write overhead becomes visible at scale ..

[HACKERS] lseek/read/write overhead becomes visible at scale ..

From

Tobias Oberstein

Date:

24 January 2017, 20:11:09

Hi guys,

pls bare with me, this is my first post here. Pls also excuse the length
.. I was trying to do all my homework before posting here;)

The overhead of lseek/read/write vs pread/pwrite (or even
pvread/pvwrite) was previously discussed here

https://www.postgresql.org/message-id/flat/CABUevEzZ%3DCGdmwSZwW9oNuf4pQZMExk33jcNO7rseqrAgKzj5Q%40mail.gmail.com#CABUevEzZ=CGdmwSZwW9oNuf4pQZMExk33jcNO7rseqrAgKzj5Q@mail.gmail.com

The thread ends with

"Well, my point remains that I see little value in messing with
long-established code if you can't demonstrate a benefit that's clearly
above the noise level."

I have done lots of benchmarking over the last days on a massive box,
and I can provide numbers that I think show that the impact can be
significant.

Our storage tops out at 9.4 million random 4kB read IOPS.

Storage consists of 8 x Intel P3608 4TB NVMe (which logically is 16 NVMe
block devices).

Above number was using psync FIO engine .. with libaio, it's at 9.7 mio
with much lower CPU load - but this doesn't apply to PG of course.

Switching to sync engine, it drops to 9.1 mio - but the system load then
is also much higher!

In a way, our massive CPU 4 x E7 8880 with 88 cores / 176 threads) hides
the impact of sync vs psync.

So, with less CPU, the syscall overhead kicks in (we are CPU bound).

It also becomes much more visible with Linux MD in the mix, because MD
comes with it's own overhead/bottleneck, and our then CPU cannot hide
the overhead of sync vs psync anymore:

sync on MD: IOPS=1619k
psync on MD: IOPS=4289k
sync on non-MD: IOPS=9165k
psync on non-MD: IOPS=9410k

Please find all the details here

https://github.com/oberstet/scratchbox/tree/master/cruncher/sync-engines

Note: MD has a lock contention (lock_qsc) - I am going down that rabbit
hole too. But this is only related to PG in that the negative impacts
multiply.

What I am trying to say is: the syscall overhead of doing
lseek/read/write instead of pread/pwrite do become visible and hurt at a
certain point.

I totally agree with the entry citation ("show up numbers first!"), but
I think I have shown numbers;)

I'd love to get the 9.4 mio IOPS right through MD and XFS up to PG
(yeah, I know, PG does 8kB, but it'll be similar).

Cheers,
/Tobias

PS:
This isn't academic, as we have experience (in prod) with a similarily
designed box and PostgreSQL used as a data-warehouse.

We are using an internal tool to parallelize via sessions and this box
is completely CPU bound (same NVMes, 3TB RAM as the new one, but only 48
cores and no HT).

Squeezing out CPU and imrpoving CPU usage efficiency is hence very
important for us.

Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..

From

Andres Freund

Date:

24 January 2017, 20:18:41

Hi,

On 2017-01-24 18:11:09 +0100, Tobias Oberstein wrote:
> I have done lots of benchmarking over the last days on a massive box, and I
> can provide numbers that I think show that the impact can be significant.

> Above number was using psync FIO engine .. with libaio, it's at 9.7 mio with
> much lower CPU load - but this doesn't apply to PG of course.

> Switching to sync engine, it drops to 9.1 mio - but the system load then is
> also much higher!

I doubt those have very much to do with postgres - I'd quite strongly
assume that it'd get more than swamped with doing actualy work, and with
buffering the frequently accessed stuff in memory.

> What I am trying to say is: the syscall overhead of doing lseek/read/write
> instead of pread/pwrite do become visible and hurt at a certain point.

Sure - but the question is whether it's measurable when you do actual
work.

I'm much less against this change than Tom, but doing artificial syscall
microbenchmark seems unlikely to make a big case for using it in
postgres, where it's part of vastly more expensive operations (like
actually reading data afterwards, exclusive locks, ...).

> This isn't academic, as we have experience (in prod) with a similarily
> designed box and PostgreSQL used as a data-warehouse.
> 
> We are using an internal tool to parallelize via sessions and this box is
> completely CPU bound (same NVMes, 3TB RAM as the new one, but only 48 cores
> and no HT).

I'd welcome seeing profiles of that - I'm working quite heavily on
speeding up analytics workloads for pg.

Greetings,

Andres Freund

Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..

From

Tobias Oberstein

Date:

24 January 2017, 20:37:14

Hi,

>> Switching to sync engine, it drops to 9.1 mio - but the system load then is
>> also much higher!
>
> I doubt those have very much to do with postgres - I'd quite strongly

In the machine in production, we see 8kB reads in the 300k-650k/s range. 
In spikes, because, yes, due to the 3TB RAM, we have high buffer hit 
ratios as well.

> assume that it'd get more than swamped with doing actualy work, and with
> buffering the frequently accessed stuff in memory.
>
>
>> What I am trying to say is: the syscall overhead of doing lseek/read/write
>> instead of pread/pwrite do become visible and hurt at a certain point.
>
> Sure - but the question is whether it's measurable when you do actual
> work.

The syscall overhead is visible in production too .. I watched PG using 
perf live, and lseeks regularily appear at the top of the list.

> I'm much less against this change than Tom, but doing artificial syscall
> microbenchmark seems unlikely to make a big case for using it in

This isn't a syscall benchmark, but FIO.

> postgres, where it's part of vastly more expensive operations (like
> actually reading data afterwards, exclusive locks, ...).

PG is very CPU hungry, yes. But there are quite some system related 
effects too .. eg we've managed to get down the system load with huge 
pages (big improvement).

>> This isn't academic, as we have experience (in prod) with a similarily
>> designed box and PostgreSQL used as a data-warehouse.
>>
>> We are using an internal tool to parallelize via sessions and this box is
>> completely CPU bound (same NVMes, 3TB RAM as the new one, but only 48 cores
>> and no HT).
>
> I'd welcome seeing profiles of that - I'm working quite heavily on
> speeding up analytics workloads for pg.

Here:

https://github.com/oberstet/scratchbox/raw/master/cruncher/adr_stats/ADR-PostgreSQL-READ-Statistics.pdf

https://github.com/oberstet/scratchbox/tree/master/cruncher/adr_stats

Cheers,
/Tobias

>
>
> Greetings,
>
> Andres Freund
>

Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..

From

Andres Freund

Date:

24 January 2017, 20:41:58

Hi,

On 2017-01-24 18:37:14 +0100, Tobias Oberstein wrote:
> > assume that it'd get more than swamped with doing actualy work, and with
> > buffering the frequently accessed stuff in memory.
> > 
> > 
> > > What I am trying to say is: the syscall overhead of doing lseek/read/write
> > > instead of pread/pwrite do become visible and hurt at a certain point.
> > 
> > Sure - but the question is whether it's measurable when you do actual
> > work.
> 
> The syscall overhead is visible in production too .. I watched PG using perf
> live, and lseeks regularily appear at the top of the list.

Could you show such perf profiles? That'll help us.


> > I'm much less against this change than Tom, but doing artificial syscall
> > microbenchmark seems unlikely to make a big case for using it in
> 
> This isn't a syscall benchmark, but FIO.

There's not really a difference between those, when you use fio to
benchmark seek vs pseek.


> > postgres, where it's part of vastly more expensive operations (like
> > actually reading data afterwards, exclusive locks, ...).
> 
> PG is very CPU hungry, yes.

Indeed - working on it ;)


> But there are quite some system related effects
> too .. eg we've managed to get down the system load with huge pages (big
> improvement).

Glad to hear it.


> > I'd welcome seeing profiles of that - I'm working quite heavily on
> > speeding up analytics workloads for pg.
> 
> Here:
> 
> https://github.com/oberstet/scratchbox/raw/master/cruncher/adr_stats/ADR-PostgreSQL-READ-Statistics.pdf
> 
> https://github.com/oberstet/scratchbox/tree/master/cruncher/adr_stats

Thanks, unfortunately those appear to mostly have io / cache hit ratio
related stats?

Greetings,

Andres Freund

Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..

From

Tobias Oberstein

Date:

24 January 2017, 20:57:47

Hi,

Am 24.01.2017 um 18:41 schrieb Andres Freund:
> Hi,
>
> On 2017-01-24 18:37:14 +0100, Tobias Oberstein wrote:
>>> assume that it'd get more than swamped with doing actualy work, and with
>>> buffering the frequently accessed stuff in memory.
>>>
>>>
>>>> What I am trying to say is: the syscall overhead of doing lseek/read/write
>>>> instead of pread/pwrite do become visible and hurt at a certain point.
>>>
>>> Sure - but the question is whether it's measurable when you do actual
>>> work.
>>
>> The syscall overhead is visible in production too .. I watched PG using perf
>> live, and lseeks regularily appear at the top of the list.
>
> Could you show such perf profiles? That'll help us.

oberstet@bvr-sql18:~$ psql -U postgres -d adr
psql (9.5.4)
Type "help" for help.

adr=# select * from svc_sqlbalancer.f_perf_syscalls();
NOTICE:  starting Linux perf syscalls sampling - be patient, this can 
take some time ..
NOTICE:  sudo /usr/bin/perf stat -e "syscalls:sys_enter_*"      -x ";" -a sleep 30 2>&1 pid |                syscall
           |   cnt   | cnt_per_sec
 
-----+---------------------------------------+---------+-------------     | syscalls:sys_enter_lseek              |
4091584|      136386     | syscalls:sys_enter_newfstat           | 2054988 |       68500     | syscalls:sys_enter_read
            |  767990 |       25600     | syscalls:sys_enter_close              |  503803 |       16793     |
syscalls:sys_enter_newstat           |  434080 |       14469     | syscalls:sys_enter_open               |  380382 |
  12679     | syscalls:sys_enter_mmap               |  301491 |       10050     | syscalls:sys_enter_munmap
| 182313 |        6077     | syscalls:sys_enter_getdents           |  162443 |        5415     |
syscalls:sys_enter_rt_sigaction      |  158947 |        5298     | syscalls:sys_enter_openat             |   85325 |
   2844     | syscalls:sys_enter_readlink           |   77439 |        2581     | syscalls:sys_enter_rt_sigprocmask
|  60929 |        2031     | syscalls:sys_enter_mprotect           |   58372 |        1946     |
syscalls:sys_enter_futex             |   49726 |        1658     | syscalls:sys_enter_access             |   40845 |
   1362     | syscalls:sys_enter_write              |   39513 |        1317     | syscalls:sys_enter_brk
|  33656 |        1122     | syscalls:sys_enter_epoll_wait         |   23776 |         793     |
syscalls:sys_enter_ioctl             |   19764 |         659     | syscalls:sys_enter_wait4              |   17371 |
    579     | syscalls:sys_enter_newlstat           |   13008 |         434     | syscalls:sys_enter_exit_group
|  10135 |         338     | syscalls:sys_enter_recvfrom           |    8595 |         286     |
syscalls:sys_enter_sendto            |    8448 |         282     | syscalls:sys_enter_poll               |    7200 |
    240     | syscalls:sys_enter_lgetxattr          |    6477 |         216     | syscalls:sys_enter_dup2
|   5790 |         193
 

<snip>

Note: there isn't a lot of load currently (this is from production).

>>> I'm much less against this change than Tom, but doing artificial syscall
>>> microbenchmark seems unlikely to make a big case for using it in
>>
>> This isn't a syscall benchmark, but FIO.
>
> There's not really a difference between those, when you use fio to
> benchmark seek vs pseek.

Sorry, I don't understand what you are talking about.

>>> postgres, where it's part of vastly more expensive operations (like
>>> actually reading data afterwards, exclusive locks, ...).
>>
>> PG is very CPU hungry, yes.
>
> Indeed - working on it ;)
>
>
>> But there are quite some system related effects
>> too .. eg we've managed to get down the system load with huge pages (big
>> improvement).
>
> Glad to hear it.

With 3TB RAM, huge pages is absolutely essential (otherwise, the system 
bogs down in TLB etc overhead).

>>> I'd welcome seeing profiles of that - I'm working quite heavily on
>>> speeding up analytics workloads for pg.
>>
>> Here:
>>
>> https://github.com/oberstet/scratchbox/raw/master/cruncher/adr_stats/ADR-PostgreSQL-READ-Statistics.pdf
>>
>> https://github.com/oberstet/scratchbox/tree/master/cruncher/adr_stats
>
> Thanks, unfortunately those appear to mostly have io / cache hit ratio
> related stats?

Yep, this was just to proof that we are really running a DWH workload at 
scale;)

Cheers,
/Tobias

>
> Greetings,
>
> Andres Freund
>

Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..

From

Andres Freund

Date:

24 January 2017, 21:11:21

Hi,

On 2017-01-24 18:57:47 +0100, Tobias Oberstein wrote:
> Am 24.01.2017 um 18:41 schrieb Andres Freund:
> > On 2017-01-24 18:37:14 +0100, Tobias Oberstein wrote:
> > > The syscall overhead is visible in production too .. I watched PG using perf
> > > live, and lseeks regularily appear at the top of the list.
> > 
> > Could you show such perf profiles? That'll help us.
> 
> oberstet@bvr-sql18:~$ psql -U postgres -d adr
> psql (9.5.4)
> Type "help" for help.
> 
> adr=# select * from svc_sqlbalancer.f_perf_syscalls();
> NOTICE:  starting Linux perf syscalls sampling - be patient, this can take
> some time ..
> NOTICE:  sudo /usr/bin/perf stat -e "syscalls:sys_enter_*"      -x ";" -a
> sleep 30 2>&1
>  pid |                syscall                |   cnt   | cnt_per_sec
> -----+---------------------------------------+---------+-------------
>      | syscalls:sys_enter_lseek              | 4091584 |      136386
>      | syscalls:sys_enter_newfstat           | 2054988 |       68500
>      | syscalls:sys_enter_read               |  767990 |       25600
>      | syscalls:sys_enter_close              |  503803 |       16793
>      | syscalls:sys_enter_newstat            |  434080 |       14469
>      | syscalls:sys_enter_open               |  380382 |       12679
>      | syscalls:sys_enter_mmap               |  301491 |       10050
>      | syscalls:sys_enter_munmap             |  182313 |        6077
>      | syscalls:sys_enter_getdents           |  162443 |        5415
>      | syscalls:sys_enter_rt_sigaction       |  158947 |        5298
>      | syscalls:sys_enter_openat             |   85325 |        2844
>      | syscalls:sys_enter_readlink           |   77439 |        2581
>      | syscalls:sys_enter_rt_sigprocmask     |   60929 |        2031
>      | syscalls:sys_enter_mprotect           |   58372 |        1946
>      | syscalls:sys_enter_futex              |   49726 |        1658
>      | syscalls:sys_enter_access             |   40845 |        1362
>      | syscalls:sys_enter_write              |   39513 |        1317
>      | syscalls:sys_enter_brk                |   33656 |        1122
>      | syscalls:sys_enter_epoll_wait         |   23776 |         793
>      | syscalls:sys_enter_ioctl              |   19764 |         659
>      | syscalls:sys_enter_wait4              |   17371 |         579
>      | syscalls:sys_enter_newlstat           |   13008 |         434
>      | syscalls:sys_enter_exit_group         |   10135 |         338
>      | syscalls:sys_enter_recvfrom           |    8595 |         286
>      | syscalls:sys_enter_sendto             |    8448 |         282
>      | syscalls:sys_enter_poll               |    7200 |         240
>      | syscalls:sys_enter_lgetxattr          |    6477 |         216
>      | syscalls:sys_enter_dup2               |    5790 |         193
> 
> <snip>
> 
> Note: there isn't a lot of load currently (this is from production).

That doesn't really mean that much - sure it shows that lseek is
frequent, but it doesn't tell you how much impact this has to the
overall workload.  For that'd you'd need a generic (i.e. not syscall
tracepoint, but cpu cycle) perf profile, and look in the call graph (via
perf report --children) how much of that is below the lseek syscall.


> > > > I'm much less against this change than Tom, but doing artificial syscall
> > > > microbenchmark seems unlikely to make a big case for using it in
> > > 
> > > This isn't a syscall benchmark, but FIO.
> > 
> > There's not really a difference between those, when you use fio to
> > benchmark seek vs pseek.
> 
> Sorry, I don't understand what you are talking about.

Fio as you appear to have used is a microbenchmark benchmarking
individual syscalls.


> > > > postgres, where it's part of vastly more expensive operations (like
> > > > actually reading data afterwards, exclusive locks, ...).
> > > 
> > > PG is very CPU hungry, yes.
> > 
> > Indeed - working on it ;)
> > 
> > 
> > > But there are quite some system related effects
> > > too .. eg we've managed to get down the system load with huge pages (big
> > > improvement).
> > 
> > Glad to hear it.
> 
> With 3TB RAM, huge pages is absolutely essential (otherwise, the system bogs
> down in TLB etc overhead).

I was one of the people working on adding hugepage support to pg, that's
why I was glad ;)


Regards,

Andres

Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..

From

Tobias Oberstein

Date:

24 January 2017, 21:25:52

Hi,

>>  pid |                syscall                |   cnt   | cnt_per_sec
>> -----+---------------------------------------+---------+-------------
>>      | syscalls:sys_enter_lseek              | 4091584 |      136386
>>      | syscalls:sys_enter_newfstat           | 2054988 |       68500
>>      | syscalls:sys_enter_read               |  767990 |       25600
>>      | syscalls:sys_enter_close              |  503803 |       16793
>>      | syscalls:sys_enter_newstat            |  434080 |       14469
>>      | syscalls:sys_enter_open               |  380382 |       12679
>>
>> Note: there isn't a lot of load currently (this is from production).
>
> That doesn't really mean that much - sure it shows that lseek is
> frequent, but it doesn't tell you how much impact this has to the

Above is on a mostly idle system ("idle" for our loads) .. when things 
get hot, lseek calls can reach into the millions/sec.

Doing 5 million syscalls per sec comes with overhead no matter how 
lightweight the syscall is, doesn't it?

Using pread instead of lseek+read halfes the syscalls.

I really don't understand what you are fighting here ..

> overall workload.  For that'd you'd need a generic (i.e. not syscall
> tracepoint, but cpu cycle) perf profile, and look in the call graph (via
> perf report --children) how much of that is below the lseek syscall.

I see. I might find time to extend our helper function f_perf_syscalls.

>>>>> I'm much less against this change than Tom, but doing artificial syscall
>>>>> microbenchmark seems unlikely to make a big case for using it in
>>>>
>>>> This isn't a syscall benchmark, but FIO.
>>>
>>> There's not really a difference between those, when you use fio to
>>> benchmark seek vs pseek.
>>
>> Sorry, I don't understand what you are talking about.
>
> Fio as you appear to have used is a microbenchmark benchmarking
> individual syscalls.

I am benchmarking IOPS, and while doing so, it becomes apparent that at 
these scales it does matter _how_ IO is done.

The most efficient way is libaio. I get 9.7 million/sec IOPS with low 
CPU load. Using any synchronous IO engine is slower and produces higher 
load.

I do understand that switching to libaio isn't going to fly for PG 
(completely different approach). But doing pread instead of lseek+read 
seems simple enough. But then, I don't know about the PG codebase ..

Among the synchronous methods of doing IO, psync is much better than sync.

pvsync, pvsync2 and pvsync2 + hipri (busy polling, no interrupts) are 
better, but the gain is smaller, and all of them are inferior to libaio.

>>> Glad to hear it.
>>
>> With 3TB RAM, huge pages is absolutely essential (otherwise, the system bogs
>> down in TLB etc overhead).
>
> I was one of the people working on adding hugepage support to pg, that's
> why I was glad ;)

Ahh;) Sorry, wasn't aware. This is really invaluable. Thanks for that!

Cheers,
/Tobias

Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..

From

Alvaro Herrera

Date:

24 January 2017, 21:36:13

Tobias Oberstein wrote:

> I am benchmarking IOPS, and while doing so, it becomes apparent that at
> these scales it does matter _how_ IO is done.
> 
> The most efficient way is libaio. I get 9.7 million/sec IOPS with low CPU
> load. Using any synchronous IO engine is slower and produces higher load.
> 
> I do understand that switching to libaio isn't going to fly for PG
> (completely different approach).

Maybe it is possible to write a new f_smgr implementation (parallel to
md.c) that uses libaio.  There is no "seek" in that interface, at least,
though the interface does assume that the implementation is blocking.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..

From

Andres Freund

Date:

24 January 2017, 21:59:45

On 2017-01-24 15:36:13 -0300, Alvaro Herrera wrote:
> Tobias Oberstein wrote:
> 
> > I am benchmarking IOPS, and while doing so, it becomes apparent that at
> > these scales it does matter _how_ IO is done.
> > 
> > The most efficient way is libaio. I get 9.7 million/sec IOPS with low CPU
> > load. Using any synchronous IO engine is slower and produces higher load.
> > 
> > I do understand that switching to libaio isn't going to fly for PG
> > (completely different approach).
> 
> Maybe it is possible to write a new f_smgr implementation (parallel to
> md.c) that uses libaio.  There is no "seek" in that interface, at least,
> though the interface does assume that the implementation is blocking.

For it to be beneficial we'd need to redesign the IO stack above that so
much that it'd be basically not recognizable (since we'd need to
actually use async io for it to be beneficial). Using libaio IIRC still
requires O_DIRECT, so we'd to take more care with ordering of writeback
etc too - we got closer with 9.6, but we're still far away from it.
Besides that, it's also not always that clear when AIO would be
beneficial, since a lot of the synchronous IO is actually synchronous
for a reason.

Andres

Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..

From

Andres Freund

Date:

24 January 2017, 22:07:05

On 2017-01-24 19:25:52 +0100, Tobias Oberstein wrote:
> Hi,
> 
> > >  pid |                syscall                |   cnt   | cnt_per_sec
> > > -----+---------------------------------------+---------+-------------
> > >      | syscalls:sys_enter_lseek              | 4091584 |      136386
> > >      | syscalls:sys_enter_newfstat           | 2054988 |       68500
> > >      | syscalls:sys_enter_read               |  767990 |       25600
> > >      | syscalls:sys_enter_close              |  503803 |       16793
> > >      | syscalls:sys_enter_newstat            |  434080 |       14469
> > >      | syscalls:sys_enter_open               |  380382 |       12679
> > > 
> > > Note: there isn't a lot of load currently (this is from production).
> > 
> > That doesn't really mean that much - sure it shows that lseek is
> > frequent, but it doesn't tell you how much impact this has to the
> 
> Above is on a mostly idle system ("idle" for our loads) .. when things get
> hot, lseek calls can reach into the millions/sec.
> 
> Doing 5 million syscalls per sec comes with overhead no matter how
> lightweight the syscall is, doesn't it?

> Using pread instead of lseek+read halfes the syscalls.
> 
> I really don't understand what you are fighting here ..

Sure, there's some overhead. And as I said upthread, I'm much less
against this change than Tom.  What I'm saying is that your benchmarks
haven't shown a benefit in a meaningful way, so I don't think I can
agree with

> "Well, my point remains that I see little value in messing with
> long-established code if you can't demonstrate a benefit that's clearly
> above the noise level."
> 
> I have done lots of benchmarking over the last days on a massive box, and I
> can provide numbers that I think show that the impact can be significant.

since you've not actually shown that the impact is above the noise level
when measured with an actual postgres workload.

Greetings,

Andres Freund

Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..

From

Tobias Oberstein

Date:

25 January 2017, 11:51:53

Hi Alvaro,

Am 24.01.2017 um 19:36 schrieb Alvaro Herrera:
> Tobias Oberstein wrote:
>
>> I am benchmarking IOPS, and while doing so, it becomes apparent that at
>> these scales it does matter _how_ IO is done.
>>
>> The most efficient way is libaio. I get 9.7 million/sec IOPS with low CPU
>> load. Using any synchronous IO engine is slower and produces higher load.
>>
>> I do understand that switching to libaio isn't going to fly for PG
>> (completely different approach).
>
> Maybe it is possible to write a new f_smgr implementation (parallel to
> md.c) that uses libaio.  There is no "seek" in that interface, at least,
> though the interface does assume that the implementation is blocking.
>

FWIW, I now systematically compared the IO performance when normalized 
for system load induced over different IO methods.

I use the FIO ioengine terminology:

sync = lseek/read/write
psync = pread/pwrite

Here:

https://github.com/oberstet/scratchbox/raw/master/cruncher/engines-compared/normalized-iops.pdf

Conclusion:

psync has 1.15x the normalized IOPS compared to sync
libaio has up to 6.5x the normalized IOPS compared to sync

---

These measurements where done on 16 NVMe block devices.

As mentioned, when Linux MD comes into the game, the difference between 
sync and psync is much higher - the is a lock contention in MD.

The reason for that is: when MD comes into the game, even our massive 
CPU cannot hide the inefficiency of the double syscalls anymore.

This MD issue is our bigger problem (compared to PG using sync/psync). I 
am going to post to the linux-raid list about that, as being advised by 
FIO developers.

---

That being said, regarding getting maximum performance out of NVMes with 
minimal system load, the real deal probably isn't libaio either, but 
kernel bypass (hinted to my by FIO devs):

http://www.spdk.io/

FIO has a plugin for SPDK, which I am going to explore to establish a 
final conclusive baseline for maximum IOPS normalized for load.

There are similar approaches in networking (BSD netmap, DPDK) to bypass 
the kernel altogether (zero copy to userland, no interrupts but polling 
etc). With hardware like this (NVMe, 100GbE etc), the kernel gets in the 
way ..

Anyway, this is now probably OT as for PG;)

Cheers,
/Tobias

Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..

From

Tobias Oberstein

Date:

25 January 2017, 12:16:32

Hi Andres,

>> Using pread instead of lseek+read halfes the syscalls.
>>
>> I really don't understand what you are fighting here ..
>
> Sure, there's some overhead. And as I said upthread, I'm much less
> against this change than Tom.  What I'm saying is that your benchmarks
> haven't shown a benefit in a meaningful way, so I don't think I can
> agree with
>
>> "Well, my point remains that I see little value in messing with
>> long-established code if you can't demonstrate a benefit that's clearly
>> above the noise level."
>>
>> I have done lots of benchmarking over the last days on a massive box, and I
>> can provide numbers that I think show that the impact can be significant.
>
> since you've not actually shown that the impact is above the noise level
> when measured with an actual postgres workload.

I can follow that.

So real prove cannot be done with FIO, but "actual PG workload".

Synthetic PG workload or real world production workload?

Also: rgd the perf profiles from production that show lseek as #1 syscall.

You said it wouldn't be prove either, because it only shows number of 
syscalls, and though it is clear that millions of syscalls/sec do come 
with overhead, it is still not showing "above noise" level relevance 
(because PG is such a CPU hog in itself anyways;)

So how would I do a perf profile that would be acceptable as prove?

Maybe I can expand our

https://gist.github.com/oberstet/ca03d7ab49be4c8edb70ffa1a9fe160c

profiling function.

Cheers,
/Tobias

Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..

From

Andres Freund

Date:

25 January 2017, 22:52:38

Hi,

On 2017-01-25 10:16:32 +0100, Tobias Oberstein wrote:
> > > Using pread instead of lseek+read halfes the syscalls.
> > > 
> > > I really don't understand what you are fighting here ..
> > 
> > Sure, there's some overhead. And as I said upthread, I'm much less
> > against this change than Tom.  What I'm saying is that your benchmarks
> > haven't shown a benefit in a meaningful way, so I don't think I can
> > agree with
> > 
> > > "Well, my point remains that I see little value in messing with
> > > long-established code if you can't demonstrate a benefit that's clearly
> > > above the noise level."
> > > 
> > > I have done lots of benchmarking over the last days on a massive box, and I
> > > can provide numbers that I think show that the impact can be significant.
> > 
> > since you've not actually shown that the impact is above the noise level
> > when measured with an actual postgres workload.
> 
> I can follow that.
> 
> So real prove cannot be done with FIO, but "actual PG workload".

Right.


> Synthetic PG workload or real world production workload?

Both might work, production-like has bigger pull, but I'd guess
synthetic is good enough.


> Also: rgd the perf profiles from production that show lseek as #1 syscall.

You'll, depending on your workload, still have a lot of lseeks even if
we were to use pread/pwrite because we do lseek(SEEK_END) to get file
sizes.


> You said it wouldn't be prove either, because it only shows number of
> syscalls, and though it is clear that millions of syscalls/sec do come with
> overhead, it is still not showing "above noise" level relevance (because PG
> is such a CPU hog in itself anyways;)

Yep.


> So how would I do a perf profile that would be acceptable as prove?

You'd have to look at cpu time, not number of syscalls.  IIRC I
suggested doing a cycles profile with -g and then using "perf report
--children" to see how many cycles are spent somewhere below lseek.

I'd also suggest sharing a profile cycles profile, it's quite likely
that the overhead is completely elsewhere.


- Andres

Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..

From

Tobias Oberstein

Date:

26 January 2017, 00:27:40

Hi,

>> Synthetic PG workload or real world production workload?
>
> Both might work, production-like has bigger pull, but I'd guess
> synthetic is good enough.

Thanks! The box should get PostgreSQL in the not too distant future. 
It'll get a backup from prod, but will act as new prod, so it might take 
some time until a job can be run and a profile collected.

>> So how would I do a perf profile that would be acceptable as prove?
>
> You'd have to look at cpu time, not number of syscalls.  IIRC I
> suggested doing a cycles profile with -g and then using "perf report
> --children" to see how many cycles are spent somewhere below lseek.

Understood. Either profile manually or expand the function.

> I'd also suggest sharing a profile cycles profile, it's quite likely
> that the overhead is completely elsewhere.

Yeah, could be. It'll be interesting to see for sure. I should get a 
chance to collect such profile and then I'll post it back here -

/Tobias

Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..

From

Robert Haas

Date:

22 June 2017, 19:43:16

On Wed, Jan 25, 2017 at 2:52 PM, Andres Freund <andres@anarazel.de> wrote:
> You'll, depending on your workload, still have a lot of lseeks even if
> we were to use pread/pwrite because we do lseek(SEEK_END) to get file
> sizes.

I'm pretty convinced that the lseek overhead that we're incurring
right now is excessive.  I mean, the Linux kernel guys fixed lseek to
scale better more or less specifically because of PostgreSQL, which
indicates that we're hitting it harder than most people.[1] And, more
concretely, I've seen strace -c output where the time spent in lseek
is far ahead of any other system call -- so if lseek overhead is
negligible, then all of our system call overhead taken together is
negligible, too.

Having said that, it's probably not a big percentage of our runtime
right now -- on normal workloads, it's probably some number of tenths
of one percent. But I'm not sure that's a good reason to ignore it.
The more we CPU-optimize other things (say, expression evaluation!)
the more significant the things that remain will become.  And we've
certainly made performance fixes to save far fewer cycles than we're
talking about here[2].

I'm no longer very sure fixing this is a very simple thing to do,
partly because of the use of lseek to get the file size which you note
above, and partly because of the possibility that this may, for
example, break read-ahead, as Tom worried about previously[3].  But I
think dismissing this as not-really-a-problem is the wrong approach.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] https://www.postgresql.org/message-id/201110282133.18125.andres@anarazel.de
[2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=2781b4bea7db357be59f9a5fd73ca1eb12ff5a79
[3] https://www.postgresql.org/message-id/6352.1471461075%40sss.pgh.pa.us

Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..

From

Andres Freund

Date:

22 June 2017, 19:50:48

On 2017-06-22 12:43:16 -0400, Robert Haas wrote:
> On Wed, Jan 25, 2017 at 2:52 PM, Andres Freund <andres@anarazel.de> wrote:
> > You'll, depending on your workload, still have a lot of lseeks even if
> > we were to use pread/pwrite because we do lseek(SEEK_END) to get file
> > sizes.
> 
> I'm pretty convinced that the lseek overhead that we're incurring
> right now is excessive.

No argument there.


> I mean, the Linux kernel guys fixed lseek to
> scale better more or less specifically because of PostgreSQL, which
> indicates that we're hitting it harder than most people.[1] And, more
> concretely, I've seen strace -c output where the time spent in lseek
> is far ahead of any other system call -- so if lseek overhead is
> negligible, then all of our system call overhead taken together is
> negligible, too.

That'll partially be because syscalls is where the kernel "prefers" to
switch between processes, and lseek is the most frequent one.


> Having said that, it's probably not a big percentage of our runtime
> right now -- on normal workloads, it's probably some number of tenths
> of one percent. But I'm not sure that's a good reason to ignore it.
> The more we CPU-optimize other things (say, expression evaluation!)
> the more significant the things that remain will become.  And we've
> certainly made performance fixes to save far fewer cycles than we're
> talking about here[2].

Well, there's some complexity / simplicity tradeoffs as everywhere ;)


> I'm no longer very sure fixing this is a very simple thing to do,
> partly because of the use of lseek to get the file size which you note
> above, and partly because of the possibility that this may, for
> example, break read-ahead, as Tom worried about previously[3].  But I
> think dismissing this as not-really-a-problem is the wrong approach.

I suspect this'll become a larger problem once we fix a few other
issues.  Right now I've a hard time measuring this, but if we'd keep
file sizes cached in shared memory, and we'd use direct IO, then we'd
potentially be able to have high enough IO throughput for this to
matter.  At the moment 8kb memcpy's (instead of DMA into user buffer) is
nearly always going to dwarf the overhead of the lseek().

Greetings,

Andres Freund

Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..

From

Thomas Munro

Date:

16 April 2018, 08:40:35

On Fri, Jun 23, 2017 at 4:50 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2017-06-22 12:43:16 -0400, Robert Haas wrote:
>> On Wed, Jan 25, 2017 at 2:52 PM, Andres Freund <andres@anarazel.de> wrote:
>> > You'll, depending on your workload, still have a lot of lseeks even if
>> > we were to use pread/pwrite because we do lseek(SEEK_END) to get file
>> > sizes.
>>
>> I'm pretty convinced that the lseek overhead that we're incurring
>> right now is excessive.
>
> No argument there.

My 2c:

* every comparable open source system I looked at uses pread() if it's available
* speedups have been claimed
* it's also been claimed that readahead heuristics are not defeated on
Linux or FreeBSD, which isn't too surprising because you'd expect it
to be about blocks being faulted in, not syscalls
* just in case there exists an operating system that has pread() but
doesn't do readahead in that case, we could provide a compile-time
option to select the fallback mode (until such time as you can get
that bug fixed in your OS?)
* syscalls aren't getting cheaper, and this is a 2-for-1 deal, what's
not to like?

+1 for adopting pread()/pwrite() in PG12.

I understand that the use of lseek() to find file sizes is a different
problem and unrelated.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..

From

Andrew Gierth

Date:

16 April 2018, 09:13:30

>>>>> "Thomas" == Thomas Munro <thomas.munro@enterprisedb.com> writes:

 Thomas> * it's also been claimed that readahead heuristics are not
 Thomas> defeated on Linux or FreeBSD, which isn't too surprising
 Thomas> because you'd expect it to be about blocks being faulted in,
 Thomas> not syscalls

I don't know about linux, but on FreeBSD, readahead/writebehind is
tracked at the level of open files but implemented at the level of
read/write clustering. I have patched kernels in the past to improve the
performance in mixed read/write cases; pg would benefit on unpatched
kernels from using separate file opens for backend reads and writes.
(The typical bad scenario is doing a create index, or other seqscan that
updates hint bits, on a freshly-restored table; the alternation of
reading block N and writing block N-x destroys the readahead/writebehind
since they use a common offset.)

The code that detects sequential behavior can not distinguish between
pread() and lseek+read, it looks only at the actual offset of the
current request compared to the previous one for the same fp.

 Thomas> +1 for adopting pread()/pwrite() in PG12.

ditto

-- 
Andrew (irc:RhodiumToad)

Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..

From

Robert Haas

Date:

25 April 2018, 21:41:44

On Mon, Apr 16, 2018 at 2:13 AM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:
> The code that detects sequential behavior can not distinguish between
> pread() and lseek+read, it looks only at the actual offset of the
> current request compared to the previous one for the same fp.
>
>  Thomas> +1 for adopting pread()/pwrite() in PG12.
>
> ditto

Likewise.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..

From

Andres Freund

Date:

25 April 2018, 23:33:33

On 2018-04-25 14:41:44 -0400, Robert Haas wrote:
> On Mon, Apr 16, 2018 at 2:13 AM, Andrew Gierth
> <andrew@tao11.riddles.org.uk> wrote:
> > The code that detects sequential behavior can not distinguish between
> > pread() and lseek+read, it looks only at the actual offset of the
> > current request compared to the previous one for the same fp.
> >
> >  Thomas> +1 for adopting pread()/pwrite() in PG12.
> >
> > ditto
> 
> Likewise.

+1 as well. Medium term I forsee usage of at least pwritev(), and
possibly also preadv(). Being able to write out multiple buffers at once
is pretty crucial if we ever want to do direct IO.

Greetings,

Andres Freund

Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..

From

Thomas Munro

Date:

25 May 2018, 08:58:41

On Thu, Apr 26, 2018 at 8:33 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2018-04-25 14:41:44 -0400, Robert Haas wrote:
>> On Mon, Apr 16, 2018 at 2:13 AM, Andrew Gierth
>> <andrew@tao11.riddles.org.uk> wrote:
>> > The code that detects sequential behavior can not distinguish between
>> > pread() and lseek+read, it looks only at the actual offset of the
>> > current request compared to the previous one for the same fp.
>> >
>> >  Thomas> +1 for adopting pread()/pwrite() in PG12.
>> >
>> > ditto
>>
>> Likewise.
>
> +1 as well. Medium term I forsee usage of at least pwritev(), and
> possibly also preadv(). Being able to write out multiple buffers at once
> is pretty crucial if we ever want to do direct IO.

Also if we ever use threads and want to share file descriptors we'd
have to use it.

CC'ing Oskari Saarenmaa who proposed a patch for this a couple of years back[1].

Oskari, would you like to update your patch and post it for the
September commitfest?  At first glance, it probably needs autoconf-fu
to check if pread()/pwrite() are supported and fallback code, so
someone should update the patch to do that or explain why it's not
needed based on standards we require.  At least Windows apparently
needs special handling (ReadFile() and WriteFile() with an OVERLAPPED
object).

Practically speaking, are there any Unix-like systems outside museums
that don't have it?  According to the man pages I looked at, this
stuff is from System V R4 (1988) and appeared in ancient BSD strains
too.  Hmm, I suppose it's possible that pademelon and gaur don't: they
apparently run HP-UX 10.20 (1996) which Wikipedia tells me is derived
from System V R3!  I can see that current HP-UX does have them... but
unfortunately their man pages don't have a HISTORY section.

FWIW these functions just showed up in the latest POSIX standard[2]
(issue 7, 2017/2018?), having moved from "XSI option" to "base".

[1] https://www.postgresql.org/message-id/flat/7fdcb664-4f8a-8626-75df-ffde85005829%40ohmu.fi
[2] http://pubs.opengroup.org/onlinepubs/9699919799/functions/pread.html

-- 
Thomas Munro
http://www.enterprisedb.com