Re: [HACKERS] Re: Anyone have experience benchmarking very higheffective_io_concurrency on NVME's? - Mailing list pgsql-hackers

From Andres Freund
Subject Re: [HACKERS] Re: Anyone have experience benchmarking very higheffective_io_concurrency on NVME's?
Date
Msg-id 20171101034941.oovo5ovfgppwlord@alap3.anarazel.de
Whole thread Raw
In response to Re: [HACKERS] Re: Anyone have experience benchmarking very higheffective_io_concurrency on NVME's?  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: [HACKERS] Re: Anyone have experience benchmarking very higheffective_io_concurrency on NVME's?  (Craig Ringer <craig@2ndquadrant.com>)
List pgsql-hackers
Hi,

On 2017-10-31 18:47:06 +0100, Tomas Vondra wrote:
> On 10/31/2017 04:48 PM, Greg Stark wrote:
> > On 31 October 2017 at 07:05, Chris Travers <chris.travers@adjust.com>
> wrote:
> >> Hi;
> >>
> >> After Andres's excellent talk at PGConf we tried benchmarking
> >> effective_io_concurrency on some of our servers and found that those
> which
> >> have a number of NVME storage volumes could not fill the I/O queue
> even at
> >> the maximum setting (1000).
> >
> > And was the system still i/o bound? If the cpu was 100% busy then
> > perhaps Postgres just can't keep up with the I/O system. It would
> > depend on workload though, if you start many very large sequential
> > scans you may be able to push the i/o system harder.
> >
> > Keep in mind effective_io_concurrency only really affects bitmap
> > index scans (and to a small degree index scans). It works by issuing
> > posix_fadvise() calls for upcoming buffers one by one. That gets
> > multiple spindles active but it's not really going to scale to many
> > thousands of prefetches (and effective_io_concurrency of 1000
> > actually means 7485 prefetches). At some point those i/o are going
> > to start completing before Postgres even has a chance to start
> > processing the data.

Note that even if they finish well before postgres gets around to
looking at the block, you might still be seeing benefits. SSDs benefit
from larger reads, and a deeper queue gives more chances for reordering
/ coalescing of requests. Won't beenefit the individual reader, but
might help the overall capacity of the system.


> Yeah, initiating the prefetches is not expensive, but it's not free
> either. So there's a trade-off between time spent on prefetching and
> processing the data.

Right. It'd probably be good to be a bit more adaptive here. But it's
hard to do with posix_fadvise - we'd need an operation that actually
notifies us of IO completion.  If we were using, say, asynchronous
direct IO, we could initiate the request and regularly check how many
blocks ahead of the current window are already completed and adjust the
queue based on that, rather than jus tfiring off fadvises and hoping for
the best.


> I believe this may be actually illustrated using Amdahl's law - the I/O
> is the parallel part, and processing the data is the serial part. And no
> matter what you do, the device only has so much bandwidth, which defines
> the maximum possible speedup (compared to "no prefetch" case).

Right.


> Furthermore, the device does not wait for all the I/O requests to be
> submitted - it won't wait for 1000 requests and then go "OMG! There's a
> lot of work to do!" It starts processing the requests as they arrive,
> and some of them will complete before you're done with submitting the
> rest, so you'll never see all the requests in the queue at once.

It'd be interesting to see how much it helps to scale the size of
readahead requests with the distance from the current read
iterator. E.g. if you're less than 16 blocks away from the current head,
issue size 1, up to 32 issue a 2 block request for consecutive blocks.
I suspect it won't help because at least linux's block layer / io
elevator seems quite successfully at merging. E.g. for the query:
EXPLAIN ANALYZE SELECT sum(l_quantity) FROM lineitem where l_receiptdate between '1993-05-03' and '1993-08-03';
on a tpc-h scale dataset on my laptop, I see:
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz
wareq-sz svctm  %util
 
sda           25702.00    0.00    495.27      0.00 37687.00     0.00  59.45   0.00    5.13    0.00 132.09    19.73
0.00  0.04 100.00
 

but it'd be worthwhile to see whether doing the merging ourselves allows
for deeper queues.


I think we really should start incorporating explicit prefetching in
more places. Ordered indexscans might actually be one case that's not
too hard to do in a simple manner - whenever at an inner node, prefetch
the leaf nodes below it. We obviously could do better, but that might be
a decent starting point to get some numbers.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: [HACKERS] WIP: long transactions on hot standby feedback replica/ proof of concept
Next
From: Alvaro Herrera
Date:
Subject: Re: [HACKERS] Re: PANIC: invalid index offnum: 186 when processingBRIN indexes in VACUUM