Thread: Re: [PERFORM] intel s3500 -- hot stuff

Re: [PERFORM] intel s3500 -- hot stuff

From

Julien Rouhaud

Date:

18 July 2015, 10:17:45

On 18/07/2015 12:03, Julien Rouhaud wrote:
> On 10/12/2014 17:52, Jeff Janes wrote:
>> On Tue, Dec 9, 2014 at 12:43 PM, Bruce Momjian <bruce@momjian.us
>> <mailto:bruce@momjian.us>> wrote:
>>
>>     On Mon, Dec  8, 2014 at 03:40:43PM -0600, Merlin Moncure wrote:
>>     > >> Did not see consistent measurable gains > 256
>>     > >> effective_io_concurrency.  Interesting that at setting of '2' (the
>>     > >> lowest possible setting with the feature actually working) is
>>     > >> pessimal.
>>     > >
>>     > > Very interesting.  When we added a per-tablespace random_page_cost,
>>     > > there was a suggestion that we might want to add per-tablespace
>>     > > effective_io_concurrency someday:
>>     >
>>     > What I'd really like to see is to have effective_io_concurrency work
>>     > on other types of scans.  It's clearly a barn burner on fast storage
>>     > and perhaps the default should be something other than '1'.  Spinning
>>     > storage is clearly dead and ssd seem to really benefit from the posix
>>     > readhead api.
>>
>>
>> I haven't played much with SSD, but effective_io_concurrency can be a
>> big win even on spinning disk.
>>
>>
>>
>>     Well, the real question is knowing which blocks to request before
>>     actually needing them.  With a bitmap scan, that is easy --- I am
>>     unclear how to do it for other scans.  We already have kernel read-ahead
>>     for sequential scans, and any index scan that hits multiple rows will
>>     probably already be using a bitmap heap scan.
>>
>>
>> If the index scan is used to provide ordering as well as selectivity
>> than it will resist being converted to an bitmap scan. Also it won't
>> convert to a bitmap scan solely to get credit for the use of
>> effective_io_concurrency, as that setting doesn't enter into planning
>> decisions.
>>
>> For a regular index scan, it should be easy to prefetch table blocks for
>> all the tuples that will need to be retrieved based on the current index
>> leaf page, for example.  Looking ahead across leaf page boundaries would
>> be harder.
>>
>
> I also think that having effective_io_concurrency for other nodes that
> bitmap scan would be really great, but for now
> having a per-tablespace effective_io_concurrency is simpler to implement
> and will already help, so here's a patch to implement it.  I'm also
> adding it to the next commitfest.
>

I didn't know that the thread must exists on -hackers to be able to add
a commitfest entry, so I transfer the thread here.

Sorry the double post.

--
Julien Rouhaud
http://dalibo.com - http://dalibo.org

Attachment

io_concurrency_per_tablespace-v1.patch

Re: Allow a per-tablespace effective_io_concurrency setting

From

Andres Freund

Date:

02 September 2015, 13:54:03

Hi,

On 2015-07-18 12:17:39 +0200, Julien Rouhaud wrote:
> I didn't know that the thread must exists on -hackers to be able to add
> a commitfest entry, so I transfer the thread here.

Please, in the future, also update the title of the thread to something
fitting.

> @@ -539,6 +541,9 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
>  {
>      BitmapHeapScanState *scanstate;
>      Relation    currentRelation;
> +#ifdef USE_PREFETCH
> +    int new_io_concurrency;
> +#endif
>  
>      /* check for unsupported flags */
>      Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
> @@ -598,6 +603,25 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
>       */
>      currentRelation = ExecOpenScanRelation(estate, node->scan.scanrelid, eflags);
>  
> +#ifdef USE_PREFETCH
> +    /* check if the effective_io_concurrency has been overloaded for the
> +     * tablespace storing the relation and compute the target_prefetch_pages,
> +     * or just get the current target_prefetch_pages
> +     */
> +    new_io_concurrency = get_tablespace_io_concurrency(
> +            currentRelation->rd_rel->reltablespace);
> +
> +
> +    scanstate->target_prefetch_pages = target_prefetch_pages;
> +
> +    if (new_io_concurrency != effective_io_concurrency)
> +    {
> +        double prefetch_pages;
> +       if (compute_io_concurrency(new_io_concurrency, &prefetch_pages))
> +            scanstate->target_prefetch_pages = rint(prefetch_pages);
> +    }
> +#endif

Maybe it's just me - but imo there should be as few USE_PREFETCH
dependant places in the code as possible. It'll just be 0 when not
supported, that's fine? Especially changing the size of externally
visible structs depending on a configure detected ifdef seems wrong to
me.

> +bool
> +compute_io_concurrency(int io_concurrency, double *target_prefetch_pages)
> +{
> +    double        new_prefetch_pages = 0.0;
> +    int            i;
> +
> +    /* make sure the io_concurrency value is correct, it may have been forced
> +     * with a pg_tablespace UPDATE
> +     */

Nitpick: Wrong comment style (/* stands on its own line).

> +    if (io_concurrency > MAX_IO_CONCURRENCY)
> +        io_concurrency = MAX_IO_CONCURRENCY;
> +
> +    /*----------
> +     * The user-visible GUC parameter is the number of drives (spindles),
> +     * which we need to translate to a number-of-pages-to-prefetch target.
> +     * The target value is stashed in *extra and then assigned to the actual
> +     * variable by assign_effective_io_concurrency.
> +     *
> +     * The expected number of prefetch pages needed to keep N drives busy is:
> +     *
> +     * drives |   I/O requests
> +     * -------+----------------
> +     *        1 |   1
> +     *        2 |   2/1 + 2/2 = 3
> +     *        3 |   3/1 + 3/2 + 3/3 = 5 1/2
> +     *        4 |   4/1 + 4/2 + 4/3 + 4/4 = 8 1/3
> +     *        n |   n * H(n)

I know you just moved this code. But: I don't buy this formula. Like at
all. Doesn't queuing and reordering entirely invalidate the logic here?

Perhaps more relevantly: Imo nodeBitmapHeapscan.c is the wrong place for
this. bufmgr.c maybe?

You also didn't touch
/** How many buffers PrefetchBuffer callers should try to stay ahead of their* ReadBuffer calls by.  This is maintained
bythe assign hook for* effective_io_concurrency.  Zero means "never prefetch".*/
 
int            target_prefetch_pages = 0;
which surely doesn't make sense anymore after these changes.

But do we even need that variable now?

> diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
> index dc167f9..57008fc 100644
> --- a/src/include/utils/guc.h
> +++ b/src/include/utils/guc.h
> @@ -26,6 +26,9 @@
>  #define MAX_KILOBYTES    (INT_MAX / 1024)
>  #endif
>  
> +/* upper limit for effective_io_concurrency */
> +#define MAX_IO_CONCURRENCY 1000
> +
>  /*
>   * Automatic configuration file name for ALTER SYSTEM.
>   * This file will be used to store values of configuration parameters
> @@ -256,6 +259,8 @@ extern int    temp_file_limit;
>  
>  extern int    num_temp_buffers;
>  
> +extern int    effective_io_concurrency;
> +

target_prefetch_pages is declared in bufmgr.h - that seems like a better
place for these.

Greetings,

Andres Freund

Re: Allow a per-tablespace effective_io_concurrency setting

From

Tomas Vondra

Date:

02 September 2015, 16:07:09

Hi

On 09/02/2015 03:53 PM, Andres Freund wrote:
>
> Hi,
>
> On 2015-07-18 12:17:39 +0200, Julien Rouhaud wrote:
>> I didn't know that the thread must exists on -hackers to be able to add
>> a commitfest entry, so I transfer the thread here.
>
> Please, in the future, also update the title of the thread to something
> fitting.
>
>> @@ -539,6 +541,9 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
>>   {
>>       BitmapHeapScanState *scanstate;
>>       Relation    currentRelation;
>> +#ifdef USE_PREFETCH
>> +    int new_io_concurrency;
>> +#endif
>>
>>       /* check for unsupported flags */
>>       Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
>> @@ -598,6 +603,25 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
>>        */
>>       currentRelation = ExecOpenScanRelation(estate, node->scan.scanrelid, eflags);
>>
>> +#ifdef USE_PREFETCH
>> +    /* check if the effective_io_concurrency has been overloaded for the
>> +     * tablespace storing the relation and compute the target_prefetch_pages,
>> +     * or just get the current target_prefetch_pages
>> +     */
>> +    new_io_concurrency = get_tablespace_io_concurrency(
>> +            currentRelation->rd_rel->reltablespace);
>> +
>> +
>> +    scanstate->target_prefetch_pages = target_prefetch_pages;
>> +
>> +    if (new_io_concurrency != effective_io_concurrency)
>> +    {
>> +        double prefetch_pages;
>> +       if (compute_io_concurrency(new_io_concurrency, &prefetch_pages))
>> +            scanstate->target_prefetch_pages = rint(prefetch_pages);
>> +    }
>> +#endif
>
> Maybe it's just me - but imo there should be as few USE_PREFETCH
> dependant places in the code as possible. It'll just be 0 when not
> supported, that's fine?

It's not just you. Dealing with code with plenty of ifdefs is annoying - 
it compiles just fine most of the time, until you compile it with some 
rare configuration. Then it either starts producing strange warnings or 
the compilation fails entirely.

It might make a tiny difference on builds without prefetching support 
because of code size, but realistically I think it's noise.

> Especially changing the size of externally visible structs depending
> ona configure detected ifdef seems wrong to me.

+100 to that

>> +    /*----------
>> +     * The user-visible GUC parameter is the number of drives (spindles),
>> +     * which we need to translate to a number-of-pages-to-prefetch target.
>> +     * The target value is stashed in *extra and then assigned to the actual
>> +     * variable by assign_effective_io_concurrency.
>> +     *
>> +     * The expected number of prefetch pages needed to keep N drives busy is:
>> +     *
>> +     * drives |   I/O requests
>> +     * -------+----------------
>> +     *        1 |   1
>> +     *        2 |   2/1 + 2/2 = 3
>> +     *        3 |   3/1 + 3/2 + 3/3 = 5 1/2
>> +     *        4 |   4/1 + 4/2 + 4/3 + 4/4 = 8 1/3
>> +     *        n |   n * H(n)
>
> I know you just moved this code. But: I don't buy this formula. Like at
> all. Doesn't queuing and reordering entirely invalidate the logic here?

Well, even the comment right next after the formula says that:
    * Experimental results show that both of these formulas aren't    * aggressiveenough, but we don't really have any
betterproposals.

That's the reason why users generally either use 0 or some rather high 
value (16 or 32 are the most common values see). The problem is that we 
don't really care about the number of spindles (and not just because 
SSDs don't have them at all), but about the target queue length per 
device. Spinning rust uses TCQ/NCQ to optimize the head movement, SSDs 
are parallel by nature (stacking multiple chips with separate channels).

I doubt we can really improve the formula, except maybe for saying "we 
want 16 requests per device" and multiplying the number by that. We 
don't really have the necessary introspection to determine better values 
(and it's not really possible, because the devices may be hidden behind 
a RAID controller (or a SAN). So we can't really do much.

Maybe the best thing we can do is just completely abandon the "number of 
spindles" idea, and just say "number of I/O requests to prefetch". 
Possibly with an explanation of how to estimate it (devices * queue length).

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Allow a per-tablespace effective_io_concurrency setting

From

Andres Freund

Date:

02 September 2015, 16:46:00

On 2015-09-02 18:06:54 +0200, Tomas Vondra wrote:
> Maybe the best thing we can do is just completely abandon the "number of
> spindles" idea, and just say "number of I/O requests to prefetch". Possibly
> with an explanation of how to estimate it (devices * queue length).

I think that'd be a lot better.

Re: Allow a per-tablespace effective_io_concurrency setting

From

Greg Stark

Date:

02 September 2015, 18:49:58

On 2 Sep 2015 14:54, "Andres Freund" <andres@anarazel.de> wrote:
>
>
> > +     /*----------
> > +      * The user-visible GUC parameter is the number of drives (spindles),
> > +      * which we need to translate to a number-of-pages-to-prefetch target.
> > +      * The target value is stashed in *extra and then assigned to the actual
> > +      * variable by assign_effective_io_concurrency.
> > +      *
> > +      * The expected number of prefetch pages needed to keep N drives busy is:
> > +      *
> > +      * drives |   I/O requests
> > +      * -------+----------------
> > +      *              1 |   1
> > +      *              2 |   2/1 + 2/2 = 3
> > +      *              3 |   3/1 + 3/2 + 3/3 = 5 1/2
> > +      *              4 |   4/1 + 4/2 + 4/3 + 4/4 = 8 1/3
> > +      *              n |   n * H(n)
>
> I know you just moved this code. But: I don't buy this formula. Like at
> all. Doesn't queuing and reordering entirely invalidate the logic here?

I can take the blame for this formula.

It's called the "Coupon Collector Problem". If you hit get a random
coupon from a set of n possible coupons, how many random coupons would
you have to collect before you expect to have at least one of each.

This computation model assumes we have no information about which
spindle each block will hit. That's basically true for the case of
bitmapheapscan for most cases because the idea of bitmapheapscan is to
be picking a sparse set of blocks and there's no reason the blocks
being read will have any regularity that causes them all to fall on
the same spindles. If in fact you're reading a fairly dense set then
bitmapheapscan probably is a waste of time and simply reading
sequentially would be exactly as fast or even faster.

We talked about this quite a bit back then and there was no dispute
that the aim is to provide GUCs that mean something meaningful to the
DBA who can actually measure them. They know how many spindles they
have. They do not know what the optimal prefetch depth is and the only
way to determine it would be to experiment with Postgres. Worse, I
think the above formula works for essentially random I/O but for more
predictable I/O it might be possible to use a different formula. But
if we made the GUC something low level like "how many blocks to
prefetch" then we're left in the dark about how to handle that
different access pattern.

I did speak to a dm developer and he suggested that the kernel could
help out with an API. He suggested something of the form "how many
blocks do I have to read before the end of the current device". I
wasn't sure exactly what we would do with something like that but it
would be better than just guessing how many I/O operations we need to
issue to keep all the spindles busy.

Re: Allow a per-tablespace effective_io_concurrency setting

From

Julien Rouhaud

Date:

02 September 2015, 19:13:16

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

On 02/09/2015 18:06, Tomas Vondra wrote:
> Hi
> 
> On 09/02/2015 03:53 PM, Andres Freund wrote:
>> 
>> Hi,
>> 
>> On 2015-07-18 12:17:39 +0200, Julien Rouhaud wrote:
>>> I didn't know that the thread must exists on -hackers to be
>>> able to add a commitfest entry, so I transfer the thread here.
>> 
>> Please, in the future, also update the title of the thread to
>> something fitting.
>> 

Sorry for that.

>>> @@ -539,6 +541,9 @@ ExecInitBitmapHeapScan(BitmapHeapScan
>>> *node, EState *estate, int eflags) { BitmapHeapScanState
>>> *scanstate; Relation    currentRelation; +#ifdef USE_PREFETCH +
>>> int new_io_concurrency; +#endif
>>> 
>>> /* check for unsupported flags */ Assert(!(eflags &
>>> (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK))); @@ -598,6 +603,25 @@
>>> ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate,
>>> int eflags) */ currentRelation = ExecOpenScanRelation(estate, 
>>> node->scan.scanrelid, eflags);
>>> 
>>> +#ifdef USE_PREFETCH +    /* check if the
>>> effective_io_concurrency has been overloaded for the +     *
>>> tablespace storing the relation and compute the 
>>> target_prefetch_pages, +     * or just get the current
>>> target_prefetch_pages +     */ +    new_io_concurrency =
>>> get_tablespace_io_concurrency( +
>>> currentRelation->rd_rel->reltablespace); + + +
>>> scanstate->target_prefetch_pages = target_prefetch_pages; + +
>>> if (new_io_concurrency != effective_io_concurrency) +    { +
>>> double prefetch_pages; +       if
>>> (compute_io_concurrency(new_io_concurrency, &prefetch_pages)) +
>>> scanstate->target_prefetch_pages = rint(prefetch_pages); +
>>> } +#endif
>> 
>> Maybe it's just me - but imo there should be as few USE_PREFETCH 
>> dependant places in the code as possible. It'll just be 0 when
>> not supported, that's fine?
> 
> It's not just you. Dealing with code with plenty of ifdefs is
> annoying - it compiles just fine most of the time, until you
> compile it with some rare configuration. Then it either starts
> producing strange warnings or the compilation fails entirely.
> 
> It might make a tiny difference on builds without prefetching
> support because of code size, but realistically I think it's
> noise.
> 
>> Especially changing the size of externally visible structs
>> depending ona configure detected ifdef seems wrong to me.
> 
> +100 to that
> 

I totally agree. I'll remove the ifdefs.

>> Nitpick: Wrong comment style (/* stands on its own line).

I did run pgindent before submitting patch, but apparently I picked
the wrong one. Already fixed in local branch.

>>> +    /*---------- +     * The user-visible GUC parameter is the
>>> number of drives (spindles), +     * which we need to translate
>>> to a number-of-pages-to-prefetch target. +     * The target
>>> value is stashed in *extra and then assigned to the actual +
>>> * variable by assign_effective_io_concurrency. +     * +     *
>>> The expected number of prefetch pages needed to keep N drives 
>>> busy is: +     * +     * drives |   I/O requests +     *
>>> -------+---------------- +     *        1 |   1 +     *
>>> 2 |   2/1 + 2/2 = 3 +     *        3 |   3/1 + 3/2 + 3/3 = 5
>>> 1/2 +     *        4 |   4/1 + 4/2 + 4/3 + 4/4 = 8 1/3 +     *
>>> n |   n * H(n)
>> 
>> I know you just moved this code. But: I don't buy this formula.
>> Like at all. Doesn't queuing and reordering entirely invalidate
>> the logic here?
> 
> Well, even the comment right next after the formula says that:
> 
> * Experimental results show that both of these formulas aren't *
> aggressiveenough, but we don't really have any better proposals.
> 
> That's the reason why users generally either use 0 or some rather
> high value (16 or 32 are the most common values see). The problem
> is that we don't really care about the number of spindles (and not
> just because SSDs don't have them at all), but about the target
> queue length per device. Spinning rust uses TCQ/NCQ to optimize the
> head movement, SSDs are parallel by nature (stacking multiple chips
> with separate channels).
> 
> I doubt we can really improve the formula, except maybe for saying
> "we want 16 requests per device" and multiplying the number by
> that. We don't really have the necessary introspection to determine
> better values (and it's not really possible, because the devices
> may be hidden behind a RAID controller (or a SAN). So we can't
> really do much.
> 
> Maybe the best thing we can do is just completely abandon the
> "number of spindles" idea, and just say "number of I/O requests to
> prefetch". Possibly with an explanation of how to estimate it
> (devices * queue length).
> 
>> I think that'd be a lot better.

+1 for that too.

If everone's ok with this change, I can submit a patch for that too.
Should I split that into two patches, and/or start a new thread?



- -- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (GNU/Linux)

iQEcBAEBAgAGBQJV50opAAoJELGaJ8vfEpOqve4H/0ZJCoFb0wHtArkGye6ietks
9uahdJy5ixO4J+AZsf2mVxV/DZK7dhK8rWIXt6yS3kfYfPDB79cRFWU5EgjEGAHB
qcB7wXCa5HibqLySgQct3zhVDj3CN3ucKT3LVp9OC9mrH2mnGtAp7PYkjd/HqBwd
AzW05pu21Ivi/z2gUBOdxNEEDxCLu8T1OtYq3WY9l7Yc4HfLG3huYLQD2LoRFRFn
lWwhQifML6uKzz7w3MfZrK4i2ffGGv/r1afHcpZvN3UsB5te1fSzr8KcUeJL7+Zg
xJTKwppiEMHpxokn5lw4LzYkjYD7W1fvwLnJhzRrs7dXGPl6rZtLmasyCld4FVk=
=r2jE
-----END PGP SIGNATURE-----

Re: Allow a per-tablespace effective_io_concurrency setting

From

Julien Rouhaud

Date:

02 September 2015, 20:22:23

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 02/09/2015 15:53, Andres Freund wrote:
> On 2015-07-18 12:17:39 +0200, Julien Rouhaud wrote:
> 
> You also didn't touch /* * How many buffers PrefetchBuffer callers
> should try to stay ahead of their * ReadBuffer calls by.  This is
> maintained by the assign hook for * effective_io_concurrency.  Zero
> means "never prefetch". */ int            target_prefetch_pages = 0; which
> surely doesn't make sense anymore after these changes.
> 
> But do we even need that variable now?
> 

I thought this was related to the effective_io_concurrency GUC
(possibly overloaded by the per-tablespace setting), so I didn't make
any change on that.

I also just found an issue with my previous patch, the global
effective_io_concurrency GUC was ignored if the tablespace had a
specific seq_page_cost or random_page_cost setting, I just fixed that
in my local branch.


>> diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h 
>> index dc167f9..57008fc 100644 --- a/src/include/utils/guc.h +++
>> b/src/include/utils/guc.h @@ -26,6 +26,9 @@ #define MAX_KILOBYTES
>> (INT_MAX / 1024) #endif
>> 
>> +/* upper limit for effective_io_concurrency */ +#define
>> MAX_IO_CONCURRENCY 1000 + /* * Automatic configuration file name
>> for ALTER SYSTEM. * This file will be used to store values of
>> configuration parameters @@ -256,6 +259,8 @@ extern int
>> temp_file_limit;
>> 
>> extern int    num_temp_buffers;
>> 
>> +extern int    effective_io_concurrency; +
> 
> target_prefetch_pages is declared in bufmgr.h - that seems like a
> better place for these.
> 

I was rather sceptical about that too. I'll move these in bufmgr.h.


Regards.


- -- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (GNU/Linux)

iQEcBAEBAgAGBQJV51pcAAoJELGaJ8vfEpOqIV0H/Rj1e/DtJS60X2mReWDyfooD
G3j0Ptblhy+brYIIxo9Bdp9hVeYFmEqlOJIht9T/3gjfkg5IMz+5bV2waEbAan/m
9uedR/RmS9sz2YpwGgpd21bfSt2LwB+UC448t3rq8KtuzwmXgSVVEflmDmN1qV3z
PseUFzS74HeIJWfxLRLGsJ5amN0hJ8bdolIfxdFR0FyFDv0tRv1DzppdMebVJmHs
uIdJOU49sIDHjcnsUcq67jkP+IfTUon+nnwvk5FYVVKdBX2ka1Q/1VAvTfmWo0oV
WZSlIjQdMUnlTX91zke0NdmsTnagIeRy1oISn/K1v+YmSqnsPqPAcZ6FFQhUMqI=
=4ofZ
-----END PGP SIGNATURE-----

Re: Allow a per-tablespace effective_io_concurrency setting

From

Tomas Vondra

Date:

02 September 2015, 21:25:45

Hi,

On 09/02/2015 08:49 PM, Greg Stark wrote:
> On 2 Sep 2015 14:54, "Andres Freund" <andres@anarazel.de> wrote:
>>
>>
>>> +     /*----------
>>> +      * The user-visible GUC parameter is the number of drives (spindles),
>>> +      * which we need to translate to a number-of-pages-to-prefetch target.
>>> +      * The target value is stashed in *extra and then assigned to the actual
>>> +      * variable by assign_effective_io_concurrency.
>>> +      *
>>> +      * The expected number of prefetch pages needed to keep N drives busy is:
>>> +      *
>>> +      * drives |   I/O requests
>>> +      * -------+----------------
>>> +      *              1 |   1
>>> +      *              2 |   2/1 + 2/2 = 3
>>> +      *              3 |   3/1 + 3/2 + 3/3 = 5 1/2
>>> +      *              4 |   4/1 + 4/2 + 4/3 + 4/4 = 8 1/3
>>> +      *              n |   n * H(n)
>>
>> I know you just moved this code. But: I don't buy this formula. Like at
>> all. Doesn't queuing and reordering entirely invalidate the logic here?
>
> I can take the blame for this formula.
>
> It's called the "Coupon Collector Problem". If you hit get a random
> coupon from a set of n possible coupons, how many random coupons would
> you have to collect before you expect to have at least one of each.
>
> This computation model assumes we have no information about which
> spindle each block will hit. That's basically true for the case of
> bitmapheapscan for most cases because the idea of bitmapheapscan is to
> be picking a sparse set of blocks and there's no reason the blocks
> being read will have any regularity that causes them all to fall on
> the same spindles. If in fact you're reading a fairly dense set then
> bitmapheapscan probably is a waste of time and simply reading
> sequentially would be exactly as fast or even faster.

There are different meanings of "busy". If I get the coupon collector 
problem right (after quickly skimming the wikipedia article today), it 
effectively makes sure that each "spindle" has at least 1 request in the 
queue. Which sucks in practice, because on spinning rust it makes 
queuing (TCQ/NCQ) totally inefficient, and on SSDs it only saturates one 
of the multiple channels.

On spinning drives, it's usually good to keep the iodepth>=4. For 
example this 10k Seagate drive [1] can do ~450 random IOPS with 
iodepth=16, while 10k drive should be able to do ~150 IOPS (with 
iodepth=1). The other SAS drives behave quite similarly.

[1] 
http://www.storagereview.com/seagate_enterprise_performance_10k_hdd_savvio_10k6_review

On SSDs the good values usually start at 16, depending on the model (and 
controller), and size (large SSDs are basically multiple small ones 
glued together, thus have more channels).

This is why the numbers from coupon collector are way too low in many 
cases. (OTOH this is done per backend, so if there are multiple backends 
doing prefetching ...)

>
> We talked about this quite a bit back then and there was no dispute
> that the aim is to provide GUCs that mean something meaningful to the
> DBA who can actually measure them. They know how many spindles they
> have. They do not know what the optimal prefetch depth is and the only
> way to determine it would be to experiment with Postgres. Worse, I

As I explained, spindles have very little to do with it - you need 
multiple I/O requests per device, to get the benefit. Sure, the DBAs 
should know how many spindles they have and should be able to determine 
optimal IO depth. But we actually say this in the docs:
     A good starting point for this setting is the number of separate     drives comprising a RAID 0 stripe or RAID 1
mirrorbeing used for     the database. (For RAID 5 the parity drive should not be counted.)     However, if the
databaseis often busy with multiple queries     issued in concurrent sessions, lower values may be sufficient to
keepthe disk array busy. A value higher than needed to keep the     disks busy will only result in extra CPU overhead.
 

So we recommend number of drives as a good starting value, and then warn 
against increasing the value further.

Moreover, ISTM it's very unclear what value to use even if you know the 
number of devices and optimal iodepth. Setting (devices * iodepth) 
doesn't really make much sense, because that effectively computes
    (devices*iodepth) * H(devices*iodepth)

which says "there are (devices*iodepth) devices, make sure there's at 
least one request for each of them", right? I guess we actually want
    (devices*iodepth) * H(devices)

Sadly that means we'd have to introduce another GUC, because we need 
track both ndevices and iodepth.

There probably is a value X so that
     X * H(X) ~= (devices*iodepth) * H(devices)

but it's far from clear that's what we need (it surely is not in the docs).


> think the above formula works for essentially random I/O but for
> more predictable I/O it might be possible to use a different formula.
> But if we made the GUC something low level like "how many blocks to
> prefetch" then we're left in the dark about how to handle that
> different access pattern.

Maybe. We only use this in Bitmap Index Scan at this point, and I don't 
see any proposals to introduce this to other places. So no opinion.

>
> I did speak to a dm developer and he suggested that the kernel could
> help out with an API. He suggested something of the form "how many
> blocks do I have to read before the end of the current device". I
> wasn't sure exactly what we would do with something like that but it
> would be better than just guessing how many I/O operations we need
> to issue to keep all the spindles busy.

I don't really see how that would help us?

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Allow a per-tablespace effective_io_concurrency setting

From

Josh Berkus

Date:

02 September 2015, 21:31:44

On 09/02/2015 02:25 PM, Tomas Vondra wrote:
> 
> As I explained, spindles have very little to do with it - you need
> multiple I/O requests per device, to get the benefit. Sure, the DBAs
> should know how many spindles they have and should be able to determine
> optimal IO depth. But we actually say this in the docs:

My experience with performance tuning is that values above 3 have no
real effect on how queries are executed.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Allow a per-tablespace effective_io_concurrency setting

From

Merlin Moncure

Date:

02 September 2015, 21:58:38

On Wed, Sep 2, 2015 at 4:31 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 09/02/2015 02:25 PM, Tomas Vondra wrote:
>>
>> As I explained, spindles have very little to do with it - you need
>> multiple I/O requests per device, to get the benefit. Sure, the DBAs
>> should know how many spindles they have and should be able to determine
>> optimal IO depth. But we actually say this in the docs:
>
> My experience with performance tuning is that values above 3 have no
> real effect on how queries are executed.

That's the exact opposite of my findings on intel S3500 (see:
http://www.postgresql.org/message-id/CAHyXU0yiVvfQAnR9cyH=HWh1WbLRsioe=mzRJTHwtr=2azsTdQ@mail.gmail.com).

merlin

Re: Allow a per-tablespace effective_io_concurrency setting

From

Jeff Janes

Date:

02 September 2015, 22:10:12

On Wed, Sep 2, 2015 at 2:31 PM, Josh Berkus <josh@agliodbs.com> wrote:

On 09/02/2015 02:25 PM, Tomas Vondra wrote:
>
> As I explained, spindles have very little to do with it - you need
> multiple I/O requests per device, to get the benefit. Sure, the DBAs
> should know how many spindles they have and should be able to determine
> optimal IO depth. But we actually say this in the docs:

My experience with performance tuning is that values above 3 have no
real effect on how queries are executed.

Perhaps one reason is that the planner assumes it will get no benefit from this setting, meaning it is somewhat unlikely to choose the types of plans which would actually show a benefit from higher values.

Cheers,

Jeff

Re: Allow a per-tablespace effective_io_concurrency setting

From

Andres Freund

Date:

02 September 2015, 22:23:42

On 2015-09-02 14:31:35 -0700, Josh Berkus wrote:
> On 09/02/2015 02:25 PM, Tomas Vondra wrote:
> > 
> > As I explained, spindles have very little to do with it - you need
> > multiple I/O requests per device, to get the benefit. Sure, the DBAs
> > should know how many spindles they have and should be able to determine
> > optimal IO depth. But we actually say this in the docs:
> 
> My experience with performance tuning is that values above 3 have no
> real effect on how queries are executed.

I saw pretty much the opposite - the benefits seldomly were significant
below 30 or so. Even on single disks. Which actually isn't that
surprising - to be actually beneficial (that is, turn an IO into a CPU
bound workload) the prefetched buffer needs to actually have been read
in by the time its needed. In many queries processing a single heap page
takes far shorter than prefetching the data from storage, even if it's
on good SSDs.

Therefore what you actually need is a queue of prefetches for the next
XX buffers so that between starting a prefetch and actually needing the
buffer ienough time has passed that the data is completely read in.  And
the point is that that's the case even for a single rotating disk!

Greetings,

Andres Freund

Re: Allow a per-tablespace effective_io_concurrency setting

From

Andres Freund

Date:

02 September 2015, 22:38:19

On 2015-09-02 19:49:13 +0100, Greg Stark wrote:
> I can take the blame for this formula.
> 
> It's called the "Coupon Collector Problem". If you hit get a random
> coupon from a set of n possible coupons, how many random coupons would
> you have to collect before you expect to have at least one of each.

My point is that that's just the entirely wrong way to model
prefetching. Prefetching can be massively beneficial even if you only
have a single platter! Even if there were no queues on the hardware or
OS level! Concurrency isn't the right way to look at prefetching.

You need to prefetch so far ahead that you'll never block on reading
heap pages - and that's only the case if processing the next N heap
blocks takes longer than the prefetch of the N+1 th page. That doesn't
mean there continously have to be N+1 prefetches in progress - in fact
that actually often will only be the case for the first few, after that
you hopefully are bottlnecked on CPU.

If you additionally take into account hardware realities where you have
multiple platters, multiple spindles, command queueing etc, that's even
more true. A single rotation of a single platter with command queuing
can often read several non consecutive blocks if they're on a similar

- Andres

Re: Allow a per-tablespace effective_io_concurrency setting

From

Tomas Vondra

Date:

02 September 2015, 23:59:21

On 09/03/2015 12:23 AM, Andres Freund wrote:
> On 2015-09-02 14:31:35 -0700, Josh Berkus wrote:
>> On 09/02/2015 02:25 PM, Tomas Vondra wrote:
>>>
>>> As I explained, spindles have very little to do with it - you need
>>> multiple I/O requests per device, to get the benefit. Sure, the DBAs
>>> should know how many spindles they have and should be able to determine
>>> optimal IO depth. But we actually say this in the docs:
>>
>> My experience with performance tuning is that values above 3 have no
>> real effect on how queries are executed.
>
> I saw pretty much the opposite - the benefits seldomly were
> significant below 30 or so. Even on single disks.

That's a bit surprising, especially considering that e_i_c=30 means ~100 
pages to prefetch if I'm doing the math right.

AFAIK queue depth for SATA drives generally is 32 (so prefetching 100 
pages should not make a difference), 256 for SAS drives and ~1000 for 
most current RAID controllers.

I'm not entirely surprised that values beyond 30 make a difference, but 
that you seldomly saw significant improvements below this value.

No doubt there are workloads like that, but I'd expect them to be quite 
rare and not prevalent as you're claiming.

> Which actually isn't that surprising - to be actually beneficial
> (that is, turn an IO into a CPU bound workload) the prefetched buffer
> needs to actually have been read in by the time its needed. In many
> queries processing a single heap page takes far shorter than
> prefetching the data from storage, even if it's on good SSDs.>
> Therefore what you actually need is a queue of prefetches for the
> next XX buffers so that between starting a prefetch and actually
> needing the buffer ienough time has passed that the data is
> completely read in. And the point is that that's the case even for a
> single rotating disk!

So instead of "How many blocks I need to prefetch to saturate the 
devices?" you're asking "How many blocks I need to prefetch to never 
actually wait for the I/O?"

I do like this view, but I'm not really sure how could we determine the 
right value? It seems to be very dependent on hardware and workload.

For spinning drives the speedup comes from optimizing random seeks to a 
more optimal path (thanks to NCQ/TCQ), and on SSDs thanks to using the 
parallel channels (and possibly faster access to the same block).

I guess the best thing we could do at this level is simply keep the 
on-device queues fully saturated, no?

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Allow a per-tablespace effective_io_concurrency setting

From

Andres Freund

Date:

03 September 2015, 00:25:01

On 2015-09-03 01:59:13 +0200, Tomas Vondra wrote:
> That's a bit surprising, especially considering that e_i_c=30 means ~100
> pages to prefetch if I'm doing the math right.
> 
> AFAIK queue depth for SATA drives generally is 32 (so prefetching 100 pages
> should not make a difference), 256 for SAS drives and ~1000 for most current
> RAID controllers.

I think the point is that after an initial buildup - we'll again only
prefetch pages in smaller increments because we already prefetched most
pages. The actual number of concurrent reads will then be determined by
how fast pages are processed.  A prefetch_target of a 100 does *not*
mean 100 requests are continously in flight. And even if it would, the
OS won't issue many more requests than the queue depth, so they'll just
sit in the OS queue.

> So instead of "How many blocks I need to prefetch to saturate the devices?"
> you're asking "How many blocks I need to prefetch to never actually wait for
> the I/O?"

Yes, pretty much.

> I do like this view, but I'm not really sure how could we determine the
> right value? It seems to be very dependent on hardware and workload.

Right.

> For spinning drives the speedup comes from optimizing random seeks to a more
> optimal path (thanks to NCQ/TCQ), and on SSDs thanks to using the parallel
> channels (and possibly faster access to the same block).

+ OS reordering & coalescing.

Don't forget that the OS processes the OS IO queues while the userland
process isn't scheduled - in a concurrent workload with more processes
than hardware threads that means that pending OS requests are being
issued while the query isn't actively being processed.

> I guess the best thing we could do at this level is simply keep the
> on-device queues fully saturated, no?

Well, being too aggressive can hurt throughput and latency of concurrent
processes, without being beneficial.

Andres Freund

Re: Allow a per-tablespace effective_io_concurrency setting

From

Greg Stark

Date:

03 September 2015, 01:44:10

<p dir="ltr">That doesn't match any of the empirical tests I did at the time. I posted graphs of the throughput for
varyingnumbers of spindles with varying amount of prefetch. In every case more prefetching increases throuput up to N
timesthe single platter throuput where N was the number of spindles.<p dir="ltr">There can be a small speedup from
overlappingCPU with I/O but that's a really small effect. At most that can be a single request and it would be a very
rarelywould the amount of CPU time be even a moderate fraction of the I/O latency. The only case where they're
comparableis when your reading sequentially and then hopefully you wouldn't be using postgres prefetching at all which
isonly really intended to help random I/O.<br /><div class="gmail_quote">On 2 Sep 2015 23:38, "Andres Freund" <<a
href="mailto:andres@anarazel.de">andres@anarazel.de</a>>wrote:<br type="attribution" /><blockquote
class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 2015-09-02 19:49:13 +0100,
GregStark wrote:<br /> > I can take the blame for this formula.<br /> ><br /> > It's called the "Coupon
CollectorProblem". If you hit get a random<br /> > coupon from a set of n possible coupons, how many random coupons
would<br/> > you have to collect before you expect to have at least one of each.<br /><br /> My point is that that's
justthe entirely wrong way to model<br /> prefetching. Prefetching can be massively beneficial even if you only<br />
havea single platter! Even if there were no queues on the hardware or<br /> OS level! Concurrency isn't the right way
tolook at prefetching.<br /><br /> You need to prefetch so far ahead that you'll never block on reading<br /> heap
pages- and that's only the case if processing the next N heap<br /> blocks takes longer than the prefetch of the N+1 th
page.That doesn't<br /> mean there continously have to be N+1 prefetches in progress - in fact<br /> that actually
oftenwill only be the case for the first few, after that<br /> you hopefully are bottlnecked on CPU.<br /><br /> If you
additionallytake into account hardware realities where you have<br /> multiple platters, multiple spindles, command
queueingetc, that's even<br /> more true. A single rotation of a single platter with command queuing<br /> can often
readseveral non consecutive blocks if they're on a similar<br /><br /> - Andres<br /></blockquote></div>

Re: Allow a per-tablespace effective_io_concurrency setting

From

Merlin Moncure

Date:

03 September 2015, 13:13:58

On Wed, Sep 2, 2015 at 5:38 PM, Andres Freund <andres@anarazel.de> wrote:
> If you additionally take into account hardware realities where you have
> multiple platters, multiple spindles, command queueing etc, that's even
> more true. A single rotation of a single platter with command queuing
> can often read several non consecutive blocks if they're on a similar

Yeah.  And in the case of solid state disks, it's really a much more
simple case of, "synchronously reading from the disk block by block
does not fully utilize the drive because of various introduced
latencies".  I find this talk of platters and spindles to be somewhat
baroque; for a 200$ part I have to work pretty hard to max out the
drive when reading and I'm still not completely sure if it's the drive
itself, postgres, cpu, or sata interface bottlenecking me.  This will
require a rethink of e_i_o configuration; in the old days there were
physical limitations of the drives that were in the way regardless of
the software stack but we are in a new era, I think.  I'm convinced
prefetching works and we're going to want to aggressively prefetch
anything and everything possible.  SSD controllers (at least the intel
ones) are very smart.

merlin

Re: Allow a per-tablespace effective_io_concurrency setting

From

Greg Stark

Date:

04 September 2015, 16:22:26

On Thu, Sep 3, 2015 at 2:13 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>   I find this talk of platters and spindles to be somewhat
> baroque; for a 200$ part I have to work pretty hard to max out the
> drive when reading and I'm still not completely sure if it's the drive
> itself, postgres, cpu, or sata interface bottlenecking me.  This will
> require a rethink of e_i_o configuration; in the old days there were
> physical limitations of the drives that were in the way regardless of
> the software stack but we are in a new era, I think.  I'm convinced
> prefetching works and we're going to want to aggressively prefetch
> anything and everything possible.  SSD controllers (at least the intel
> ones) are very smart.

Wouldn't SSDs need much *less* aggressive prefetching? There's still
latency and there are multiple I/O channels so they will still need
some. But spinning media gives latencies measured in milliseconds. You
can process a lot of tuples in milliseconds. If you have a hundred
spindles you want them all busy doing seeks because in the 5ms it
takes them to do that you can proess all the results on a single cpu
and the rest of time is spend waiting.

When your media has latency on the order of microseconds then you only
need to have a small handful of I/O requests in flight to keep your
processor busy.

-- 
greg

Re: Allow a per-tablespace effective_io_concurrency setting

From

Bruce Momjian

Date:

04 September 2015, 16:23:45

On Fri, Sep  4, 2015 at 05:21:38PM +0100, Greg Stark wrote:
> Wouldn't SSDs need much *less* aggressive prefetching? There's still
> latency and there are multiple I/O channels so they will still need
> some. But spinning media gives latencies measured in milliseconds. You
> can process a lot of tuples in milliseconds. If you have a hundred
> spindles you want them all busy doing seeks because in the 5ms it
> takes them to do that you can proess all the results on a single cpu
> and the rest of time is spend waiting.
> 
> When your media has latency on the order of microseconds then you only
> need to have a small handful of I/O requests in flight to keep your
> processor busy.

Well, there is still the processing time of getting that data ready. 
All I know is that people have reported that prefetching is even more
useful for SSDs.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: Allow a per-tablespace effective_io_concurrency setting

From

Andres Freund

Date:

04 September 2015, 16:37:40

On 2015-09-04 17:21:38 +0100, Greg Stark wrote:
> Wouldn't SSDs need much *less* aggressive prefetching? There's still
> latency and there are multiple I/O channels so they will still need
> some. But spinning media gives latencies measured in milliseconds. You
> can process a lot of tuples in milliseconds. If you have a hundred
> spindles you want them all busy doing seeks because in the 5ms it
> takes them to do that you can proess all the results on a single cpu
> and the rest of time is spend waiting.
> 
> When your media has latency on the order of microseconds then you only
> need to have a small handful of I/O requests in flight to keep your
> processor busy.

Most(?) SSDs have latencies between 0.1 and 0.4 ms for individual random
reads, often significantly more when a lot of IO is going on. In that
time you can still process a good number of pages on the CPU level.

Sure, there's few workloads where you need dozens of SSDs to parallelize
random reads like in the rotating media days. But that doesn't mean you
don't need prefetching?

Greetings,

Andres Freund

Re: Allow a per-tablespace effective_io_concurrency setting

From

Merlin Moncure

Date:

05 September 2015, 18:12:42

On Fri, Sep 4, 2015 at 11:21 AM, Greg Stark <stark@mit.edu> wrote:
> On Thu, Sep 3, 2015 at 2:13 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>>   I find this talk of platters and spindles to be somewhat
>> baroque; for a 200$ part I have to work pretty hard to max out the
>> drive when reading and I'm still not completely sure if it's the drive
>> itself, postgres, cpu, or sata interface bottlenecking me.  This will
>> require a rethink of e_i_o configuration; in the old days there were
>> physical limitations of the drives that were in the way regardless of
>> the software stack but we are in a new era, I think.  I'm convinced
>> prefetching works and we're going to want to aggressively prefetch
>> anything and everything possible.  SSD controllers (at least the intel
>> ones) are very smart.
>
>
> Wouldn't SSDs need much *less* aggressive prefetching? There's still
> latency and there are multiple I/O channels so they will still need
> some. But spinning media gives latencies measured in milliseconds. You
> can process a lot of tuples in milliseconds. If you have a hundred
> spindles you want them all busy doing seeks because in the 5ms it
> takes them to do that you can proess all the results on a single cpu
> and the rest of time is spend waiting.
>
> When your media has latency on the order of microseconds then you only
> need to have a small handful of I/O requests in flight to keep your
> processor busy.

I'm not sure I agree with that.  100 micosecond latency is still a
pretty big deal when memory latency is measured in nanoseconds.  This
is why we have pcie storage devices.  (if anyone has a fast pice flash
device I'd be interested in seing some e_i_o tests...I bet they'd
still show some impact but it'd be less).

For spinning media, any random i/o workload on large datasets is a
freight train to 100% iowait.  As each device has to process each data
set synchronously minus what small gains can be eeked out by
reordering and combining writes.  This drives data transfer rates
under a megabyte per second.

Flash devices OTOH are essentially giant raid 0 devices of electronic
media underneath the flash controller.  Each device can functionally
do many read operations in parallel so it makes perfect sense that
they perform better when given multiple overlapping requests; unlike
spinning media they can actually make use of that information and
those assumptions are well supported by measurement.

Flash is getting better all the time and storage technologies that
remove its one weakness, random writes, appear to be right around the
corner.   The age of storage being a bottleneck for database systems
without massive high cost engineering and optimization has come to an
end.  This is good for software developers and especially good for
databases because most of the perceived mythology of databases being
'slow' are in fact storage issues.  With that problem gone focus will
return to clean data processing models which is where database systems
excel with their many decades of refinement.

merlin

Re: Allow a per-tablespace effective_io_concurrency setting

From

Julien Rouhaud

Date:

06 September 2015, 13:50:40

Hi,

Please find attached a v2 of the patch. See below for changes.


On 02/09/2015 15:53, Andres Freund wrote:
>
> Hi,
>
> On 2015-07-18 12:17:39 +0200, Julien Rouhaud wrote:
>> I didn't know that the thread must exists on -hackers to be able to add
>> a commitfest entry, so I transfer the thread here.
>
> Please, in the future, also update the title of the thread to something
> fitting.
>
>> @@ -539,6 +541,9 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
>>  {
>>      BitmapHeapScanState *scanstate;
>>      Relation    currentRelation;
>> +#ifdef USE_PREFETCH
>> +    int new_io_concurrency;
>> +#endif
>>
>>      /* check for unsupported flags */
>>      Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
>> @@ -598,6 +603,25 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
>>       */
>>      currentRelation = ExecOpenScanRelation(estate, node->scan.scanrelid, eflags);
>>
>> +#ifdef USE_PREFETCH
>> +    /* check if the effective_io_concurrency has been overloaded for the
>> +     * tablespace storing the relation and compute the target_prefetch_pages,
>> +     * or just get the current target_prefetch_pages
>> +     */
>> +    new_io_concurrency = get_tablespace_io_concurrency(
>> +            currentRelation->rd_rel->reltablespace);
>> +
>> +
>> +    scanstate->target_prefetch_pages = target_prefetch_pages;
>> +
>> +    if (new_io_concurrency != effective_io_concurrency)
>> +    {
>> +        double prefetch_pages;
>> +       if (compute_io_concurrency(new_io_concurrency, &prefetch_pages))
>> +            scanstate->target_prefetch_pages = rint(prefetch_pages);
>> +    }
>> +#endif
>
> Maybe it's just me - but imo there should be as few USE_PREFETCH
> dependant places in the code as possible. It'll just be 0 when not
> supported, that's fine? Especially changing the size of externally
> visible structs depending on a configure detected ifdef seems wrong to
> me.
>

I removed these ifdefs, and the more problematic one in the struct.

>> +bool
>> +compute_io_concurrency(int io_concurrency, double *target_prefetch_pages)
>> +{
>> +    double        new_prefetch_pages = 0.0;
>> +    int            i;
>> +
>> +    /* make sure the io_concurrency value is correct, it may have been forced
>> +     * with a pg_tablespace UPDATE
>> +     */
>
> Nitpick: Wrong comment style (/* stands on its own line).
>
>> +    if (io_concurrency > MAX_IO_CONCURRENCY)
>> +        io_concurrency = MAX_IO_CONCURRENCY;
>> +
>> +    /*----------
>> +     * The user-visible GUC parameter is the number of drives (spindles),
>> +     * which we need to translate to a number-of-pages-to-prefetch target.
>> +     * The target value is stashed in *extra and then assigned to the actual
>> +     * variable by assign_effective_io_concurrency.
>> +     *
>> +     * The expected number of prefetch pages needed to keep N drives busy is:
>> +     *
>> +     * drives |   I/O requests
>> +     * -------+----------------
>> +     *        1 |   1
>> +     *        2 |   2/1 + 2/2 = 3
>> +     *        3 |   3/1 + 3/2 + 3/3 = 5 1/2
>> +     *        4 |   4/1 + 4/2 + 4/3 + 4/4 = 8 1/3
>> +     *        n |   n * H(n)
>
> I know you just moved this code. But: I don't buy this formula. Like at
> all. Doesn't queuing and reordering entirely invalidate the logic here?
>

Changing the formula, or changing the GUC to a number of pages to
prefetch is still discussed, so no change here.

> Perhaps more relevantly: Imo nodeBitmapHeapscan.c is the wrong place for
> this. bufmgr.c maybe?
>

Moved to bufmgr.c


> You also didn't touch
> /*
>  * How many buffers PrefetchBuffer callers should try to stay ahead of their
>  * ReadBuffer calls by.  This is maintained by the assign hook for
>  * effective_io_concurrency.  Zero means "never prefetch".
>  */
> int            target_prefetch_pages = 0;
> which surely doesn't make sense anymore after these changes.
>
> But do we even need that variable now?

I slighty updated the comment.  If the table doesn't belong to a
tablespace with an overloaded effective_io_concurrency, keeping this
pre-computed target_prefetch_pages can save a few cycles on each
execution, so I think it's better to keep it.

>
>> diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
>> index dc167f9..57008fc 100644
>> --- a/src/include/utils/guc.h
>> +++ b/src/include/utils/guc.h
>> @@ -26,6 +26,9 @@
>>  #define MAX_KILOBYTES    (INT_MAX / 1024)
>>  #endif
>>
>> +/* upper limit for effective_io_concurrency */
>> +#define MAX_IO_CONCURRENCY 1000
>> +
>>  /*
>>   * Automatic configuration file name for ALTER SYSTEM.
>>   * This file will be used to store values of configuration parameters
>> @@ -256,6 +259,8 @@ extern int    temp_file_limit;
>>
>>  extern int    num_temp_buffers;
>>
>> +extern int    effective_io_concurrency;
>> +
>
> target_prefetch_pages is declared in bufmgr.h - that seems like a better
> place for these.
>

Moved to bufmgr.h


As said in a previous mail, I also fixed a problem when having settings
other than effective_io_concurrency for a tablespace lead to ignore the
regular effective_io_concurrency.

I also added the forgotten lock level (AccessExclusiveLock) for this
tablespace setting, which was leading to a failed assert during initdb.

Regards.

--
Julien Rouhaud
http://dalibo.com - http://dalibo.org

Attachment

io_concurrency_per_tablespace-v2.patch

Re: Allow a per-tablespace effective_io_concurrency setting

From

Alvaro Herrera

Date:

08 September 2015, 16:00:56

Julien Rouhaud wrote:
> Hi,
> 
> Please find attached a v2 of the patch. See below for changes.

Pushed after smallish tweaks.  Please test to verify I didn't break
anything.

(It's a pity that we can't add a regression test with a value other than
0.)

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Allow a per-tablespace effective_io_concurrency setting

From

Julien Rouhaud

Date:

08 September 2015, 18:26:22

On 08/09/2015 18:00, Alvaro Herrera wrote:
> Julien Rouhaud wrote:
>> Hi,
>>
>> Please find attached a v2 of the patch. See below for changes.
> 
> Pushed after smallish tweaks.  Please test to verify I didn't break
> anything.
> 

I just tried with all the cases I could think of, everything works fine.
Thanks!

> (It's a pity that we can't add a regression test with a value other than
> 0.)
> 


-- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org