Thread: effective_io_concurrency and NVMe devices

effective_io_concurrency and NVMe devices

From
Bruce Momjian
Date:
NVMe devices have a maximum queue length of 64k:

    https://blog.westerndigital.com/nvme-queues-explained/

but our effective_io_concurrency maximum is 1,000:

    test=> set effective_io_concurrency = 1001;
    ERROR:  1001 is outside the valid range for parameter "effective_io_concurrency" (0 .. 1000)

Should we increase its maximum to 64k?  Backpatched?  (SATA has a
maximum queue length of 256.)

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson




Re: effective_io_concurrency and NVMe devices

From
Nathan Bossart
Date:
On Tue, Apr 19, 2022 at 10:56:05PM -0400, Bruce Momjian wrote:
> NVMe devices have a maximum queue length of 64k:
> 
>     https://blog.westerndigital.com/nvme-queues-explained/
> 
> but our effective_io_concurrency maximum is 1,000:
> 
>     test=> set effective_io_concurrency = 1001;
>     ERROR:  1001 is outside the valid range for parameter "effective_io_concurrency" (0 .. 1000)
> 
> Should we increase its maximum to 64k?  Backpatched?  (SATA has a
> maximum queue length of 256.)

If there are demonstrable improvements with higher values, this seems
reasonable to me.  I would even suggest removing the limit completely so
this doesn't need to be revisited in the future.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: effective_io_concurrency and NVMe devices

From
David Rowley
Date:
On Wed, 20 Apr 2022 at 14:56, Bruce Momjian <bruce@momjian.us> wrote:
> NVMe devices have a maximum queue length of 64k:

> Should we increase its maximum to 64k?  Backpatched?  (SATA has a
> maximum queue length of 256.)

I have a machine here with 1 x PCIe 3.0 NVMe SSD and also 1 x PCIe 4.0
NVMe SSD. I ran a few tests to see how different values of
effective_io_concurrency would affect performance. I tried to come up
with a query that did little enough CPU processing to ensure that I/O
was the clear bottleneck.

The test was with a 128GB table on a machine with 64GB of RAM.  I
padded the tuples out so there were 4 per page so that the aggregation
didn't have much work to do.

The query I ran was: explain (analyze, buffers, timing off) select
count(p) from r where a = 1;

Here's what I saw:

NVME PCIe 3.0 (Samsung 970 Evo 1TB)
e_i_c query_time_ms
0 88627.221
1 652915.192
5 271536.054
10 141168.986
100 67340.026
1000 70686.596
10000 70027.938
100000 70106.661

Saw a max of 991 MB/sec in iotop

NVME PCIe 4.0 (Samsung 980 Pro 1TB)
e_i_c query_time_ms
0 59306.960
1 956170.704
5 237879.121
10 135004.111
100 55662.030
1000 51513.717
10000 59807.824
100000 53443.291

Saw a max of 1126 MB/sec in iotop

I'm not pretending that this is the best query and table size to show
it, but at least this test shows that there's not much to gain by
prefetching further.   I imagine going further than we need to is
likely to have negative consequences due to populating the kernel page
cache with buffers that won't be used for a while. I also imagine
going too far out likely increases the risk that buffers we've
prefetched are evicted before they're used.

This does also highlight that an effective_io_concurrency of 1 (the
default) is pretty terrible in this test.  The bitmap contained every
2nd page. I imagine that would break normal page prefetching by the
kernel. If that's true, then it does not explain why e_i_c = 0 was so
fast.

I've attached the test setup that I did. I'm open to modifying the
test and running again if someone has an idea that might show benefits
to larger values for effective_io_concurrency.

David

Attachment

Re: effective_io_concurrency and NVMe devices

From
Tomas Vondra
Date:
On 4/21/22 10:14, David Rowley wrote:
> On Wed, 20 Apr 2022 at 14:56, Bruce Momjian <bruce@momjian.us> wrote:
>> NVMe devices have a maximum queue length of 64k:
> 
>> Should we increase its maximum to 64k?  Backpatched?  (SATA has a
>> maximum queue length of 256.)
> 
> I have a machine here with 1 x PCIe 3.0 NVMe SSD and also 1 x PCIe 4.0
> NVMe SSD. I ran a few tests to see how different values of
> effective_io_concurrency would affect performance. I tried to come up
> with a query that did little enough CPU processing to ensure that I/O
> was the clear bottleneck.
> 
> The test was with a 128GB table on a machine with 64GB of RAM.  I
> padded the tuples out so there were 4 per page so that the aggregation
> didn't have much work to do.
> 
> The query I ran was: explain (analyze, buffers, timing off) select
> count(p) from r where a = 1;
> 
> Here's what I saw:
> 
> NVME PCIe 3.0 (Samsung 970 Evo 1TB)
> e_i_c query_time_ms
> 0 88627.221
> 1 652915.192
> 5 271536.054
> 10 141168.986
> 100 67340.026
> 1000 70686.596
> 10000 70027.938
> 100000 70106.661
> 
> Saw a max of 991 MB/sec in iotop
> 
> NVME PCIe 4.0 (Samsung 980 Pro 1TB)
> e_i_c query_time_ms
> 0 59306.960
> 1 956170.704
> 5 237879.121
> 10 135004.111
> 100 55662.030
> 1000 51513.717
> 10000 59807.824
> 100000 53443.291
> 
> Saw a max of 1126 MB/sec in iotop
> 
> I'm not pretending that this is the best query and table size to show
> it, but at least this test shows that there's not much to gain by
> prefetching further.   I imagine going further than we need to is
> likely to have negative consequences due to populating the kernel page
> cache with buffers that won't be used for a while. I also imagine
> going too far out likely increases the risk that buffers we've
> prefetched are evicted before they're used.
> 

Not sure.

I don't think the risk of polluting the cache is very high, because the
1k buffers is 8MB and 64k would be 512MB. That's significant, but likely
just a tiny fraction of the available memory in machines with NVME.
Sure, there may be multiple sessions doing prefetch, but chances the
sessions touch the same data etc.

> This does also highlight that an effective_io_concurrency of 1 (the
> default) is pretty terrible in this test.  The bitmap contained every
> 2nd page. I imagine that would break normal page prefetching by the
> kernel. If that's true, then it does not explain why e_i_c = 0 was so
> fast.
> 

Yeah, this default is clearly pretty unfortunate. I think the problem is
that async request is not free, i.e. prefetching means

  async request + read

and the prefetch trick is in assuming that

  cost(async request) << cost(read)

and moving the read to a background thread. But the NVMe make reads
cheaper, so the amount of work moved to the background thread gets
lower, while the cost of the async request remains roughly the same.
Which means the difference (benefit) decreases over time.

Also, recent NVMe devices (like Intel Optane) aim to require lower queue
depths, so although the NVMe spec supports 64k queues and 64k commands
per queue, that does not mean you need to use that many requests to get
good performance.

As for the strange behavior with e_i_c=0, I think this can be explained
by how NVMe work internally. A simplified model of NVMe device is "slow"
flash with a DRAM cache, and AFAIK the data is not read from flash into
DRAM in 8kB pages but larger chunks. So even if there's no explicit OS
readahead, the device may still cache larger chunks in the DRAM buffer.


> I've attached the test setup that I did. I'm open to modifying the
> test and running again if someone has an idea that might show benefits
> to larger values for effective_io_concurrency.
> 

I think it'd be interesting to test different / less regular patterns,
not just every 2nd page etc.

The other idea I had while looking at batching a while back, is that we
should batch the prefetches. The current logic interleaves prefetches
with other work - prefetch one page, process one page, ... But once
reading a page gets sufficiently fast, this means the queues never get
deep enough for optimizations. So maybe we should think about batching
the prefetches, in some way. Unfortunately posix_fadvise does not allow
batching of requests, but we can at least stop interleaving the requests.

The attached patch is a trivial version that waits until we're at least
32 pages behind the target, and then prefetches all of them. Maybe give
it a try? (This pretty much disables prefetching for e_i_c below 32, but
for an experimental patch that's enough.)


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: effective_io_concurrency and NVMe devices

From
Tomas Vondra
Date:
Hi,

I've been looking at this a bit more, investigating the regression. I
was wondering how come no one noticed/reported this issue before, since
we have "1" as the default value since 9.5.

So either this behaves very differently on moder flash/NVMe storage, or
maybe it somehow depends on the dataset / access pattern.

Note: I don't have access to a machine with NVMe at the moment, so I did
all the tests on my usual machine with SATA SSDs. I plan to run the same
tests on NVMe once the bigger machine is available, but I think that'll
lead mostly to the same conclusions. So let me present the results now,
including the scripts so David can run those tests on their machine.


From now on, I'll refer to two storage devices:

1) SSD RAID - 6x Intel S3600 100GB SATA, in RAID0

2) SSD SINGLE - Intel Series 320, 120GB

The machine is pretty small, with just 8GB of RAM and i5-2500k (4C) CPU.


Firstly, I remembered there were some prefetching benchmarks [1], so I
repeated those. I don't have the same SSD as Merlin, but the general
behavior should be similar.

   e_i_c       1     2     4     8    16    32    64   128   256
   -------------------------------------------------------------
   timing   46.3  49.3  29.1  23.2  22.1  20.7  20.0  19.3  19.2
     diff   100%  106%   63%   50%   48%   45%   43%   42%   41%

The second line is simply the timing relative to the first column.
Merlin did not include timing for e_i_c=0 (I think that was valid value,
meaning "disabled" even back then.

In any case, those results shows significant improvements compared to
e_i_c=1 as prefetch increases.

When I run the same query on scale 3000, including eic=0:

  e_i_c        0     1     2     4      8    16    32    64   128   256
  ---------------------------------------------------------------------
  ssd       29.4  49.4  33.9  25.2   31.9  27.2  28.0  29.3  27.6  27.6
  ssd       100%  168%  115%   86%   108%   92%   95%  100%   94%   94%
  ---------------------------------------------------------------------
  ssd raid  10.7  74.2  51.2  30.6   24.0  13.8  14.6  14.3  14.1  14.0
  ssd raid  100%  691%  477%  285%   224%  129%  137%  134%  132%  131%

Notice that ignoring the eic=0 value (no prefetch), the behavior is
pretty similar to what Melin reported - consistent improvements as the
eic value increases. Ultimately it gets close to eic=0, but not faster
(at least not significantly).

FWIW I actually tried running this on 9.3, and the behavior is the same.

So I guess the behavior is the same, but it misses that eic=1 actually
may be making it much worse (compared to eic=0). The last para in [1]
actually says:

  > Interesting that at setting of '2' (the lowest possible setting with
  > the feature actually working) is pessimal.

which sounds a bit like '1' does nothing (no prefetch). But that's not
(and was not) the case, I think. But we don't have the results for eic=0
unfortunately.

Note: We stopped using the complex prefetch distance calculating since
then, but we can ignore that here I think.


The other problem with reproducing/interpreting those results is it's
unclear whether the query was executed immediately after "pgbench -i" or
sometime later (after a bunch of transactions were done). Consider the
query is:

   select * from pgbench_accounts
    where aid between 1000 and 50000000 and abalance != 0;

and right after initialization the accounts will be almost perfectly
sequential. So the query will match a continuous range of pages
(roughtly 1/6 of the whole table). But updates may be shuffling rows
around, making the I/O access pattern more random (but I'm not sure how
much, I'd expect most updates to fit on the same page).

This might explain the poor results (compared to eic=0). Sequential
access is great for readahead (in the OS and also internal in SSD),
which makes our prefetch pretty unnecessary / perhaps even actively harmful.

And the same explanation applies to David's query - that's also almost
perfectly sequential, AFAICS.

But that just raises the question - how does the prefetch work for other
access patterns, with pages not this sequential, but spread randomly
through the table.

So I constructed a couple datasets, with different patterns, generated
by the attached bash script. The table has this structure:

   CREATE TABLE t (a int, padding text)

and "a" has values between 0 and 1000, and the script generates data so
that each page contains 27 rows with the same "a" value. This allows us
to write queries matching arbitrary fraction of the table. For example
we can say "a BETWEEN 10 AND 20" which matches 1%, etc.

Furthermore, the pages are either independent (each with a different
value) or with longer streaks of the same value.

The script generates these data sets:

 random:   each page gets a random "a" value
 random-8: each sequence of 8 pages gets a random value
 random-32: each sequence of 8 pages gets a random value
 sequential: split into 1000 sequences, values 0, 1, 2, ...

And then the script runs queries matching a random subset the table,
with fractions 1%, 5%, 10%, 25% and 50% (queries with different
selectivity). The ranges are generated at random, it's just the length
of the range that matters.

The script also restarts the database and drops caches, so that the
prefetch actually does something.

Attached are CSV files with a complete run from the two SSD devices, if
you want to dig in. But the two PDFs are a better "visualization" of
performance compared to "no prefetch" (eic=0).

The "tables" PDF shows timing compared to eic=0, so 100% means "the
same" and 200% "twice slower". Or by color - red is "slower" (bad) while
green is "faster" (good).

The "charts" PDF shows essentially the same thing (duration compared to
eic=0), but as chart with "eic" on x-axis. In principle, we want all the
values to be "below" 100% line.


I think there are three obvious observations we can make from the tables
and charts:

1) The higher the selectivity, the worse.

2) The more sequential the data, the worse.

3) These two things "combine".


For example on the "random" data, prefetching works perfectly fine for
queries matching 1%, 5% and 10% even for eic=1. But queries matching 25%
and 50% get much slower with eic=1 and need much higher values to even
break even.

The less random data sets make it worse and worse. With random-32 all
query cases (even 1%) require much at least eic=4 or more to break even,
and with "sequential" it never happens.

I'd bet the NVMe devices will behave mostly the same way, after all
David showed the same issue for prefetching on sequential data. I'm not
sure about the "more random" cases, because one of the supposed
advantages of modern NVMe devices is they require lower queue depth.

This may also explain why we haven't received any reports - most queries
probably match either tiny fraction of data, or the data is mostly
random. So prefetching either helps, or at least is not too harmful.


I think this can be explained mostly by OS read-ahead and/or internal
caching on SSD devices, which works pretty well for sequential accesses
and "our" prefetching may be either unnecessary (essentially a little
bit of extra overhead) or interfering with it - changing the access
pattern so that OS does not recognize/trigger the read-ahead, or maybe
evicting the interesting pages from internal device cache.


What can we do about this? AFAICS it shouldn't be difficult to look at
the bitmap generated by the bitmap index scan, and analyze it - that
will tell us what fraction of pages match, and also how sequential the
patterns are. And based on that we can either adjust prefetching
distance, or maybe wen disable prefetching for cases matching too many
pages or "too sequential". Of course, that'll require some heuristics or
a simple "cost model".



regards


[1]
https://www.postgresql.org/message-id/CAHyXU0yiVvfQAnR9cyH%3DHWh1WbLRsioe%3DmzRJTHwtr%3D2azsTdQ%40mail.gmail.com


-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

RE: effective_io_concurrency and NVMe devices

From
Jakub Wartak
Date:
Hi Nathan,

> > NVMe devices have a maximum queue length of 64k:
[..]
> > but our effective_io_concurrency maximum is 1,000:
[..]
> > Should we increase its maximum to 64k?  Backpatched?  (SATA has a
> > maximum queue length of 256.)
>
> If there are demonstrable improvements with higher values, this seems
> reasonable to me.  I would even suggest removing the limit completely so
> this doesn't need to be revisited in the future.

Well, are there any? I remember playing with this (although for ANALYZE Stephen's case [1]) and got quite contrary
results[2] -- see going to 16 from 8 actually degraded performance. 
I somehow struggle to understand how 1000+ fadvise() syscalls would be a net benefit on storage with latency of ~0.1..
0.3msas each syscall on it's own is overhead (quite contrary, it should help on very slow one?)  
Pardon if I'm wrong (I don't have time to lookup code now), but maybe Bitmap Scans/fadvise() logic would first need to
performsome fadvise() offset/length aggregations to bigger fadvise() syscalls and in the end real hardware observable
I/Oconcurrency would be bigger (assuming that fs/LVM/dm/mq layer would split that into more parallel IOs). 

-J.

[1] - https://commitfest.postgresql.org/30/2799/
[2] -
https://www.postgresql.org/message-id/flat/VI1PR0701MB69603A433348EDCF783C6ECBF6EF0@VI1PR0701MB6960.eurprd07.prod.outlook.com







RE: effective_io_concurrency and NVMe devices

From
Jakub Wartak
Date:
Hi Tomas,

> > I have a machine here with 1 x PCIe 3.0 NVMe SSD and also 1 x PCIe 4.0
> > NVMe SSD. I ran a few tests to see how different values of
> > effective_io_concurrency would affect performance. I tried to come up
> > with a query that did little enough CPU processing to ensure that I/O
> > was the clear bottleneck.
> >
> > The test was with a 128GB table on a machine with 64GB of RAM.  I
> > padded the tuples out so there were 4 per page so that the aggregation
> > didn't have much work to do.
> >
> > The query I ran was: explain (analyze, buffers, timing off) select
> > count(p) from r where a = 1;

> The other idea I had while looking at batching a while back, is that we should
> batch the prefetches. The current logic interleaves prefetches with other work -
> prefetch one page, process one page, ... But once reading a page gets
> sufficiently fast, this means the queues never get deep enough for
> optimizations. So maybe we should think about batching the prefetches, in some
> way. Unfortunately posix_fadvise does not allow batching of requests, but we
> can at least stop interleaving the requests.

.. for now it doesn't, but IORING_OP_FADVISE is on the long-term horizon.

> The attached patch is a trivial version that waits until we're at least
> 32 pages behind the target, and then prefetches all of them. Maybe give it a try?
> (This pretty much disables prefetching for e_i_c below 32, but for an
> experimental patch that's enough.)

I've tried it at e_i_c=10 initially on David's setup.sql, and most defaults s_b=128MB, dbsize=8kb but with forced
disabledparallel query (for easier inspection with strace just to be sure//so please don't compare times).  

run:
a) master (e_i_c=10)  181760ms, 185680ms, 185384ms @ ~ 340MB/s and 44k IOPS (~122k IOPS practical max here for libaio)
b) patched(e_i_c=10)  237774ms, 236326ms, ..as you stated it disabled prefetching, fadvise() not occurring
c) patched(e_i_c=128) 90430ms, 88354ms, 85446ms, 78475ms, 74983ms, 81432ms (mean=83186ms +/- 5947ms) @ ~570MB/s and 75k
IOPS(it even peaked for a second on ~122k) 
d) master (e_i_c=128) 116865ms, 101178ms, 89529ms, 95024ms, 89942ms 99939ms (mean=98746ms +/- 10118ms) @ ~510MB/s and
65kIOPS (rare peaks to 90..100k IOPS) 

~16% benefit sounds good (help me understand: L1i cache?). Maybe it is worth throwing that patch onto more advanced /
completeperformance test farm too ? (although it's only for bitmap heap scans) 

run a: looked interleaved as you said:
fadvise64(160, 1064157184, 8192, POSIX_FADV_WILLNEED) = 0
pread64(160, "@\0\0\0\200\303/_\0\0\4\0(\0\200\0\0 \4 \0\0\0\0 \230\300\17@\220\300\17"..., 8192, 1064009728) = 8192
fadvise64(160, 1064173568, 8192, POSIX_FADV_WILLNEED) = 0
pread64(160, "@\0\0\0\0\0040_\0\0\4\0(\0\200\0\0 \4 \0\0\0\0 \230\300\17@\220\300\17"..., 8192, 1064026112) = 8192
[..]

BTW: interesting note, for run b, the avgrq-sz from extended iostat jumps is flipping between 16(*512=8kB) to
~256(*512=~128kB!)as if kernel was doing some own prefetching heuristics on and off in cycles, while when calling
e_i_c/fadvise()is in action then it seems to be always 8kB requests. So with disabled fadivse() one IMHO might have
problemsdeterministically benchmarking short queries as kernel voodoo might be happening (?) 

-J.



Re: effective_io_concurrency and NVMe devices

From
Tomas Vondra
Date:
On 6/7/22 15:29, Jakub Wartak wrote:
> Hi Tomas,
> 
>>> I have a machine here with 1 x PCIe 3.0 NVMe SSD and also 1 x PCIe 4.0
>>> NVMe SSD. I ran a few tests to see how different values of
>>> effective_io_concurrency would affect performance. I tried to come up
>>> with a query that did little enough CPU processing to ensure that I/O
>>> was the clear bottleneck.
>>>
>>> The test was with a 128GB table on a machine with 64GB of RAM.  I
>>> padded the tuples out so there were 4 per page so that the aggregation
>>> didn't have much work to do.
>>>
>>> The query I ran was: explain (analyze, buffers, timing off) select
>>> count(p) from r where a = 1;
>  
>> The other idea I had while looking at batching a while back, is that we should
>> batch the prefetches. The current logic interleaves prefetches with other work -
>> prefetch one page, process one page, ... But once reading a page gets
>> sufficiently fast, this means the queues never get deep enough for
>> optimizations. So maybe we should think about batching the prefetches, in some
>> way. Unfortunately posix_fadvise does not allow batching of requests, but we
>> can at least stop interleaving the requests.
> 
> .. for now it doesn't, but IORING_OP_FADVISE is on the long-term horizon.  
> 

Interesting! Will take time to get into real systems, though.

>> The attached patch is a trivial version that waits until we're at least
>> 32 pages behind the target, and then prefetches all of them. Maybe give it a try?
>> (This pretty much disables prefetching for e_i_c below 32, but for an
>> experimental patch that's enough.)
> 
> I've tried it at e_i_c=10 initially on David's setup.sql, and most defaults s_b=128MB, dbsize=8kb but with forced
disabledparallel query (for easier inspection with strace just to be sure//so please don't compare times). 
 
> 
> run:
> a) master (e_i_c=10)  181760ms, 185680ms, 185384ms @ ~ 340MB/s and 44k IOPS (~122k IOPS practical max here for
libaio)
> b) patched(e_i_c=10)  237774ms, 236326ms, ..as you stated it disabled prefetching, fadvise() not occurring
> c) patched(e_i_c=128) 90430ms, 88354ms, 85446ms, 78475ms, 74983ms, 81432ms (mean=83186ms +/- 5947ms) @ ~570MB/s and
75kIOPS (it even peaked for a second on ~122k)
 
> d) master (e_i_c=128) 116865ms, 101178ms, 89529ms, 95024ms, 89942ms 99939ms (mean=98746ms +/- 10118ms) @ ~510MB/s and
65kIOPS (rare peaks to 90..100k IOPS)
 
> 
> ~16% benefit sounds good (help me understand: L1i cache?). Maybe it is worth throwing that patch onto more advanced /
completeperformance test farm too ? (although it's only for bitmap heap scans)
 
> 
> run a: looked interleaved as you said:
> fadvise64(160, 1064157184, 8192, POSIX_FADV_WILLNEED) = 0
> pread64(160, "@\0\0\0\200\303/_\0\0\4\0(\0\200\0\0 \4 \0\0\0\0 \230\300\17@\220\300\17"..., 8192, 1064009728) = 8192
> fadvise64(160, 1064173568, 8192, POSIX_FADV_WILLNEED) = 0
> pread64(160, "@\0\0\0\0\0040_\0\0\4\0(\0\200\0\0 \4 \0\0\0\0 \230\300\17@\220\300\17"..., 8192, 1064026112) = 8192
> [..]
> 
> BTW: interesting note, for run b, the avgrq-sz from extended iostat jumps is flipping between 16(*512=8kB) to
~256(*512=~128kB!)as if kernel was doing some own prefetching heuristics on and off in cycles, while when calling
e_i_c/fadvise()is in action then it seems to be always 8kB requests. So with disabled fadivse() one IMHO might have
problemsdeterministically benchmarking short queries as kernel voodoo might be happening (?)
 
> 

Yes, kernel certainly does it's own read-ahead, which works pretty well
for sequential patterns. What does

   blockdev --getra /dev/...

say?

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



RE: effective_io_concurrency and NVMe devices

From
Jakub Wartak
Date:
> >> The attached patch is a trivial version that waits until we're at
> >> least
> >> 32 pages behind the target, and then prefetches all of them. Maybe give it a
> try?
> >> (This pretty much disables prefetching for e_i_c below 32, but for an
> >> experimental patch that's enough.)
> >
> > I've tried it at e_i_c=10 initially on David's setup.sql, and most defaults
> s_b=128MB, dbsize=8kb but with forced disabled parallel query (for easier
> inspection with strace just to be sure//so please don't compare times).
> >
> > run:
> > a) master (e_i_c=10)  181760ms, 185680ms, 185384ms @ ~ 340MB/s and 44k
> > IOPS (~122k IOPS practical max here for libaio)
> > b) patched(e_i_c=10)  237774ms, 236326ms, ..as you stated it disabled
> > prefetching, fadvise() not occurring
> > c) patched(e_i_c=128) 90430ms, 88354ms, 85446ms, 78475ms, 74983ms,
> > 81432ms (mean=83186ms +/- 5947ms) @ ~570MB/s and 75k IOPS (it even
> > peaked for a second on ~122k)
> > d) master (e_i_c=128) 116865ms, 101178ms, 89529ms, 95024ms, 89942ms
> > 99939ms (mean=98746ms +/- 10118ms) @ ~510MB/s and 65k IOPS (rare peaks
> > to 90..100k IOPS)
> >
> > ~16% benefit sounds good (help me understand: L1i cache?). Maybe it is
> > worth throwing that patch onto more advanced / complete performance
> > test farm too ? (although it's only for bitmap heap scans)

I hope you have some future plans for this patch :)

> Yes, kernel certainly does it's own read-ahead, which works pretty well for
> sequential patterns. What does
>
>    blockdev --getra /dev/...
>
> say?

It's default, 256 sectors (128kb) so it matches.

-J.



Re: effective_io_concurrency and NVMe devices

From
Tomas Vondra
Date:
On 6/8/22 08:29, Jakub Wartak wrote:
>>>> The attached patch is a trivial version that waits until we're at
>>>> least
>>>> 32 pages behind the target, and then prefetches all of them. Maybe give it a
>> try?
>>>> (This pretty much disables prefetching for e_i_c below 32, but for an
>>>> experimental patch that's enough.)
>>>
>>> I've tried it at e_i_c=10 initially on David's setup.sql, and most defaults
>> s_b=128MB, dbsize=8kb but with forced disabled parallel query (for easier
>> inspection with strace just to be sure//so please don't compare times).
>>>
>>> run:
>>> a) master (e_i_c=10)  181760ms, 185680ms, 185384ms @ ~ 340MB/s and 44k
>>> IOPS (~122k IOPS practical max here for libaio)
>>> b) patched(e_i_c=10)  237774ms, 236326ms, ..as you stated it disabled
>>> prefetching, fadvise() not occurring
>>> c) patched(e_i_c=128) 90430ms, 88354ms, 85446ms, 78475ms, 74983ms,
>>> 81432ms (mean=83186ms +/- 5947ms) @ ~570MB/s and 75k IOPS (it even
>>> peaked for a second on ~122k)
>>> d) master (e_i_c=128) 116865ms, 101178ms, 89529ms, 95024ms, 89942ms
>>> 99939ms (mean=98746ms +/- 10118ms) @ ~510MB/s and 65k IOPS (rare peaks
>>> to 90..100k IOPS)
>>>
>>> ~16% benefit sounds good (help me understand: L1i cache?). Maybe it is
>>> worth throwing that patch onto more advanced / complete performance
>>> test farm too ? (although it's only for bitmap heap scans)
> 
> I hope you have some future plans for this patch :)
> 

I think the big challenge is to make this adaptive, i.e. work well for
access patterns that are not known in advance. The existing prefetching
works fine for our random stuff (even for nvme devices), not so much for
sequential (as demonstrated by David's example).

>> Yes, kernel certainly does it's own read-ahead, which works pretty well for
>> sequential patterns. What does
>>
>>    blockdev --getra /dev/...
>>
>> say?
> 
> It's default, 256 sectors (128kb) so it matches.
> 

Right. I think this is pretty much why (our) prefetching performs so
poorly on sequential access patterns - the kernel read-ahead works very
well in this case, and our prefetching can't help but can interfere.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company