Thread: Increasing default value for effective_io_concurrency?

Increasing default value for effective_io_concurrency?

From
Tomas Vondra
Date:
Hi,

I think we should consider changing the effective_io_concurrency default
value, i.e. the guc that determines how many pages we try to prefetch in
a couple of places (the most important being Bitmap Heap Scan).

The default is 1 since forever, but from my experience hardly the right
value, no matter what storage system you use. I've always ended up with
values that are either 0 (so, disabled prefetching) or significantly
higher (at least 8 or 16). In fact, e_i_c=1 can easily be detrimental
depending on the workload and storage system.

Which is an issue, because people often don't know how to tune this and
I see systems with the default value quite often.

So I do propose to increase the defaut to a value between 4 and 16.


I'm hardly the first person to notice this, as illustrated for example
by this [1] post by Merlin Moncure on pgsql-hackers from 2017, which
measured this behavior on Intel S3500 SSD:

  effective_io_concurrency 1: 46.3 sec, ~ 170 mb/sec peak via iostat
  effective_io_concurrency 2:  49.3 sec, ~ 158 mb/sec peak via iostat
  effective_io_concurrency 4:  29.1 sec, ~ 291 mb/sec peak via iostat
  effective_io_concurrency 8:  23.2 sec, ~ 385 mb/sec peak via iostat
  effective_io_concurrency 16:  22.1 sec, ~ 409 mb/sec peak via iostat
  effective_io_concurrency 32:  20.7 sec, ~ 447 mb/sec peak via iostat
  effective_io_concurrency 64:  20.0 sec, ~ 468 mb/sec peak via iostat
  effective_io_concurrency 128:  19.3 sec, ~ 488 mb/sec peak via iostat
  effective_io_concurrency 256:  19.2 sec, ~ 494 mb/sec peak via iostat

That's just one anecdotal example of behavior, of course, so I've
decided to do a couple of tests on different storage systems. Attached
is a couple of scripts I used to generate synthetic data sets with data
laid out in different patterns (random vs. regular), and running queries
scanning various fractions of the table (1%, 5%, ...) using plans using
bitmap index scans.

I've done that on three different storage systems:

1) SATA RAID (3 x 7.2k drives in RAID0)
2) SSD RAID (6 x SATA SSD in RAID0)
3) NVMe drive

Attached is a spreadsheet with a summary of results fo the tested cases.
In general, the data support what I already wrote above - the current
default is pretty bad.

In some cases it helps a bit, but a bit higher value (4 or 8) performs
significantly better. Consider for example this "sequential" data set
from the 6xSSD RAID system (x-axis shows e_i_c values, pct means what
fraction of pages matches the query):

    pct         0         1        4         16        64       128
    ---------------------------------------------------------------
      1     25990     18624      3269      2219      2189      2171
      5     88116     60242     14002      8663      8560      8726
     10    120556     99364     29856     17117     16590     17383
     25    101080    184327     79212     47884     46846     46855
     50    130709    309857    163614    103001     94267     94809
     75    126516    435653    248281    156586    139500    140087

compared to the e_i_c=0 case, it looks like this:

    pct       1        4         16        64       128
    ----------------------------------------------------
      1     72%      13%         9%        8%        8%
      5     68%      16%        10%       10%       10%
     10     82%      25%        14%       14%       14%
     25    182%      78%        47%       46%       46%
     50    237%     125%        79%       72%       73%
     75    344%     196%       124%      110%      111%

So for 1% of the table the e_i_c=1 is faster by about ~30%, but with
e_i_c=4 (or more) it's ~10x faster. This is a fairly common pattern, not
just on this storage system.

The e_i_c=1 can perform pretty poorly, especially when the query matches
large fraction of the table - for example in this example it's 2-3x
slower compared to no prefetching, and higher e_i_c values limit the
damage quite a bit.

It's not entirely terrible because in most cases those queries would use
seqscan (the benchmark forces queries to use bitmap heap scan), but it's
not something we can ignore either because of possible underestimates.

Furthermore, there are cases with much worse behavior. For example, one
of the tests on SATA RAID behaves like this:

    pct       1        4         16        64       128
    ----------------------------------------------------
      1    147%     101%        61%       52%       55%
      5    180%     106%        71%       71%       70%
     10    208%     106%        73%       80%       79%
     25    225%     118%        84%       96%       86%
     50    234%     123%        91%      102%       95%
     75    241%     127%        94%      103%       98%

Pretty much all cases are significantly slower with e_i_c=1.

Of course, I'm sure there may be other things to consider. For example,
these tests were done in isolation, while on actual systems there will
be other queries running concurrently (and those may also generate I/O).


regards

[1] https://www.postgresql.org/message-id/flat/55AA2469.20306%40dalibo.com#dda46134fb309ae09233b1547411c029

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Increasing default value for effective_io_concurrency?

From
Andres Freund
Date:
Hi,

On 2019-06-29 22:15:19 +0200, Tomas Vondra wrote:
> I think we should consider changing the effective_io_concurrency default
> value, i.e. the guc that determines how many pages we try to prefetch in
> a couple of places (the most important being Bitmap Heap Scan).

Maybe we need improve the way it's used / implemented instead - it seems
just too hard to determine the correct setting as currently implemented.


> In some cases it helps a bit, but a bit higher value (4 or 8) performs
> significantly better. Consider for example this "sequential" data set
> from the 6xSSD RAID system (x-axis shows e_i_c values, pct means what
> fraction of pages matches the query):

I assume that the y axis is the time of the query?

How much data is this compared to memory available for the kernel to do
caching?


>    pct         0         1        4         16        64       128
>    ---------------------------------------------------------------
>      1     25990     18624      3269      2219      2189      2171
>      5     88116     60242     14002      8663      8560      8726
>     10    120556     99364     29856     17117     16590     17383
>     25    101080    184327     79212     47884     46846     46855
>     50    130709    309857    163614    103001     94267     94809
>     75    126516    435653    248281    156586    139500    140087
> 
> compared to the e_i_c=0 case, it looks like this:
> 
>    pct       1        4         16        64       128
>    ----------------------------------------------------
>      1     72%      13%         9%        8%        8%
>      5     68%      16%        10%       10%       10%
>     10     82%      25%        14%       14%       14%
>     25    182%      78%        47%       46%       46%
>     50    237%     125%        79%       72%       73%
>     75    344%     196%       124%      110%      111%
> 
> So for 1% of the table the e_i_c=1 is faster by about ~30%, but with
> e_i_c=4 (or more) it's ~10x faster. This is a fairly common pattern, not
> just on this storage system.
> 
> The e_i_c=1 can perform pretty poorly, especially when the query matches
> large fraction of the table - for example in this example it's 2-3x
> slower compared to no prefetching, and higher e_i_c values limit the
> damage quite a bit.

I'm surprised the slowdown for small e_i_c values is that big - it's not
obvious to me why that is.  Which os / os version / filesystem / io
scheduler / io scheduler settings were used?

Greetings,

Andres Freund



Re: Increasing default value for effective_io_concurrency?

From
Tomas Vondra
Date:
On Mon, Jul 01, 2019 at 04:32:15PM -0700, Andres Freund wrote:
>Hi,
>
>On 2019-06-29 22:15:19 +0200, Tomas Vondra wrote:
>> I think we should consider changing the effective_io_concurrency default
>> value, i.e. the guc that determines how many pages we try to prefetch in
>> a couple of places (the most important being Bitmap Heap Scan).
>
>Maybe we need improve the way it's used / implemented instead - it seems
>just too hard to determine the correct setting as currently implemented.
>

Sure, if we can improve those bits, that'd be nice. It's definitely hard
to decide what value is appropriate for a given storage system. But I'm
not sure it's something we can do easily, considering how opaque the
hardware is for us ...

I wonder 

>
>> In some cases it helps a bit, but a bit higher value (4 or 8) performs
>> significantly better. Consider for example this "sequential" data set
>> from the 6xSSD RAID system (x-axis shows e_i_c values, pct means what
>> fraction of pages matches the query):
>
>I assume that the y axis is the time of the query?
>

The y-axis is the fraction of table matched by the query. The values in
the contingency table are query durations (average of 3 runs, but the
numbers vere very close).

>How much data is this compared to memory available for the kernel to do
>caching?
>

Multiple of RAM, in all cases. The queries were hitting random subsets of
the data, and the page cache was dropped after each test, to eliminate
cross-query caching.

>
>>    pct         0         1        4         16        64       128
>>    ---------------------------------------------------------------
>>      1     25990     18624      3269      2219      2189      2171
>>      5     88116     60242     14002      8663      8560      8726
>>     10    120556     99364     29856     17117     16590     17383
>>     25    101080    184327     79212     47884     46846     46855
>>     50    130709    309857    163614    103001     94267     94809
>>     75    126516    435653    248281    156586    139500    140087
>>
>> compared to the e_i_c=0 case, it looks like this:
>>
>>    pct       1        4         16        64       128
>>    ----------------------------------------------------
>>      1     72%      13%         9%        8%        8%
>>      5     68%      16%        10%       10%       10%
>>     10     82%      25%        14%       14%       14%
>>     25    182%      78%        47%       46%       46%
>>     50    237%     125%        79%       72%       73%
>>     75    344%     196%       124%      110%      111%
>>
>> So for 1% of the table the e_i_c=1 is faster by about ~30%, but with
>> e_i_c=4 (or more) it's ~10x faster. This is a fairly common pattern, not
>> just on this storage system.
>>
>> The e_i_c=1 can perform pretty poorly, especially when the query matches
>> large fraction of the table - for example in this example it's 2-3x
>> slower compared to no prefetching, and higher e_i_c values limit the
>> damage quite a bit.
>
>I'm surprised the slowdown for small e_i_c values is that big - it's not
>obvious to me why that is.  Which os / os version / filesystem / io
>scheduler / io scheduler settings were used?
>

This is the system with NVMe storage, and SATA RAID:

Linux bench2 4.19.26 #1 SMP Sat Mar 2 19:50:14 CET 2019 x86_64 Intel(R)
Xeon(R) CPU E5-2620 v4 @ 2.10GHz GenuineIntel GNU/Linux

/dev/nvme0n1p1 on /mnt/data type ext4 (rw,relatime)
/dev/md0 on /mnt/raid type ext4 (rw,relatime,stripe=48)

The other system looks pretty much the same (same kernel, ext4).


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Increasing default value for effective_io_concurrency?

From
Robert Haas
Date:
On Mon, Jul 1, 2019 at 7:32 PM Andres Freund <andres@anarazel.de> wrote:
> On 2019-06-29 22:15:19 +0200, Tomas Vondra wrote:
> > I think we should consider changing the effective_io_concurrency default
> > value, i.e. the guc that determines how many pages we try to prefetch in
> > a couple of places (the most important being Bitmap Heap Scan).
>
> Maybe we need improve the way it's used / implemented instead - it seems
> just too hard to determine the correct setting as currently implemented.

Perhaps the translation from effective_io_concurrency to a prefetch
distance, which is found in the slightly-misnamed ComputeIoConcurrency
function, should be changed.  The comments therein say:

         * Experimental results show that both of these formulas
aren't aggressive
         * enough, but we don't really have any better proposals.

Perhaps we could test experimentally what works well with N spindles
and then fit a formula to that curve and stick it in here, so that our
tuning is based on practice rather than theory.

I'm not sure if that approach is adequate or not.  It just seems like
something to try.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Increasing default value for effective_io_concurrency?

From
Tomas Vondra
Date:
On Wed, Jul 03, 2019 at 11:04:59AM -0400, Robert Haas wrote:
>On Mon, Jul 1, 2019 at 7:32 PM Andres Freund <andres@anarazel.de> wrote:
>> On 2019-06-29 22:15:19 +0200, Tomas Vondra wrote:
>> > I think we should consider changing the effective_io_concurrency default
>> > value, i.e. the guc that determines how many pages we try to prefetch in
>> > a couple of places (the most important being Bitmap Heap Scan).
>>
>> Maybe we need improve the way it's used / implemented instead - it seems
>> just too hard to determine the correct setting as currently implemented.
>
>Perhaps the translation from effective_io_concurrency to a prefetch
>distance, which is found in the slightly-misnamed ComputeIoConcurrency
>function, should be changed.  The comments therein say:
>
>         * Experimental results show that both of these formulas
>aren't aggressive
>         * enough, but we don't really have any better proposals.
>
>Perhaps we could test experimentally what works well with N spindles
>and then fit a formula to that curve and stick it in here, so that our
>tuning is based on practice rather than theory.
>
>I'm not sure if that approach is adequate or not.  It just seems like
>something to try.
>

Maybe. And it would probably work for the systems I used for benchmarks. 

It however assumes two things: (a) the storage system actually has
spindles and (b) you know how many spindles there are. Which is becoming
less and less safe these days - flash storage becomes pretty common, and
even when there are spindles they are often hidden behind the veil of
virtualization in a SAN, or something.

I wonder if we might provide something like pg_test_prefetch which would
measure performance with different values, similarly to pg_test_fsync.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Increasing default value for effective_io_concurrency?

From
Robert Haas
Date:
On Wed, Jul 3, 2019 at 11:24 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Maybe. And it would probably work for the systems I used for benchmarks.
>
> It however assumes two things: (a) the storage system actually has
> spindles and (b) you know how many spindles there are. Which is becoming
> less and less safe these days - flash storage becomes pretty common, and
> even when there are spindles they are often hidden behind the veil of
> virtualization in a SAN, or something.

Yeah, that's true.

> I wonder if we might provide something like pg_test_prefetch which would
> measure performance with different values, similarly to pg_test_fsync.

That's not a bad idea, but I'm not sure if the results that we got in
a synthetic test - presumably unloaded - would be a good guide to what
to use in a production situation.  Maybe it would; I'm just not sure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Increasing default value for effective_io_concurrency?

From
Bruce Momjian
Date:
On Wed, Jul  3, 2019 at 11:42:49AM -0400, Robert Haas wrote:
> On Wed, Jul 3, 2019 at 11:24 AM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> > Maybe. And it would probably work for the systems I used for benchmarks.
> >
> > It however assumes two things: (a) the storage system actually has
> > spindles and (b) you know how many spindles there are. Which is becoming
> > less and less safe these days - flash storage becomes pretty common, and
> > even when there are spindles they are often hidden behind the veil of
> > virtualization in a SAN, or something.
> 
> Yeah, that's true.
> 
> > I wonder if we might provide something like pg_test_prefetch which would
> > measure performance with different values, similarly to pg_test_fsync.
> 
> That's not a bad idea, but I'm not sure if the results that we got in
> a synthetic test - presumably unloaded - would be a good guide to what
> to use in a production situation.  Maybe it would; I'm just not sure.

I think it would be better than what we have now.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: Increasing default value for effective_io_concurrency?

From
Tomas Vondra
Date:
On Mon, Jul 08, 2019 at 08:11:55PM -0400, Bruce Momjian wrote:
>On Wed, Jul  3, 2019 at 11:42:49AM -0400, Robert Haas wrote:
>> On Wed, Jul 3, 2019 at 11:24 AM Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>> > Maybe. And it would probably work for the systems I used for benchmarks.
>> >
>> > It however assumes two things: (a) the storage system actually has
>> > spindles and (b) you know how many spindles there are. Which is becoming
>> > less and less safe these days - flash storage becomes pretty common, and
>> > even when there are spindles they are often hidden behind the veil of
>> > virtualization in a SAN, or something.
>>
>> Yeah, that's true.
>>
>> > I wonder if we might provide something like pg_test_prefetch which would
>> > measure performance with different values, similarly to pg_test_fsync.
>>
>> That's not a bad idea, but I'm not sure if the results that we got in
>> a synthetic test - presumably unloaded - would be a good guide to what
>> to use in a production situation.  Maybe it would; I'm just not sure.
>
>I think it would be better than what we have now.
>

TBH I don't know how useful would that tool be. AFAICS the key assumptions
prefetching relies are that (a) issuing the prefetch request is much
cheaper than jut doing the I/O, and (b) the prefetch request can be
completed before we actually need the page.

(a) is becoming not quite true on new hardware - if you look at results
from the NVMe device, the improvements are much smaller compared to the
other storage systems. The speedup is ~1.6x, no matter the e_i_c value,
while on other storage types it's easily 10x in some cases.

But this is something we could measure using the new tool, because it's
mostly hardware dependent.

But (b) is the hard bit, because it depends on how much time it takes to
process a page read from the heap - if it takes a lot of time, lower e_i_c
values are fine. If it's fast, we need to increase the prefetch distance.

Essentially, from the tests I've done it seems fetching just 1 page in
advance is way too conservative, because (1) it does not really increase
I/O concurrency at the storage level and (2) we often get into situation
where the prefetch is still in progress when we actually need the page.

I don't know how to meaningfully benchmark this, though - it's way to
dependent on the particular workload / query. 

Of course, backend concurrency just makes it even more complicated.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services