Re: Allow a per-tablespace effective_io_concurrency setting - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Allow a per-tablespace effective_io_concurrency setting
Date
Msg-id 55E76950.2070403@2ndquadrant.com
Whole thread Raw
In response to Re: Allow a per-tablespace effective_io_concurrency setting  (Greg Stark <stark@mit.edu>)
List pgsql-hackers
Hi,

On 09/02/2015 08:49 PM, Greg Stark wrote:
> On 2 Sep 2015 14:54, "Andres Freund" <andres@anarazel.de> wrote:
>>
>>
>>> +     /*----------
>>> +      * The user-visible GUC parameter is the number of drives (spindles),
>>> +      * which we need to translate to a number-of-pages-to-prefetch target.
>>> +      * The target value is stashed in *extra and then assigned to the actual
>>> +      * variable by assign_effective_io_concurrency.
>>> +      *
>>> +      * The expected number of prefetch pages needed to keep N drives busy is:
>>> +      *
>>> +      * drives |   I/O requests
>>> +      * -------+----------------
>>> +      *              1 |   1
>>> +      *              2 |   2/1 + 2/2 = 3
>>> +      *              3 |   3/1 + 3/2 + 3/3 = 5 1/2
>>> +      *              4 |   4/1 + 4/2 + 4/3 + 4/4 = 8 1/3
>>> +      *              n |   n * H(n)
>>
>> I know you just moved this code. But: I don't buy this formula. Like at
>> all. Doesn't queuing and reordering entirely invalidate the logic here?
>
> I can take the blame for this formula.
>
> It's called the "Coupon Collector Problem". If you hit get a random
> coupon from a set of n possible coupons, how many random coupons would
> you have to collect before you expect to have at least one of each.
>
> This computation model assumes we have no information about which
> spindle each block will hit. That's basically true for the case of
> bitmapheapscan for most cases because the idea of bitmapheapscan is to
> be picking a sparse set of blocks and there's no reason the blocks
> being read will have any regularity that causes them all to fall on
> the same spindles. If in fact you're reading a fairly dense set then
> bitmapheapscan probably is a waste of time and simply reading
> sequentially would be exactly as fast or even faster.

There are different meanings of "busy". If I get the coupon collector 
problem right (after quickly skimming the wikipedia article today), it 
effectively makes sure that each "spindle" has at least 1 request in the 
queue. Which sucks in practice, because on spinning rust it makes 
queuing (TCQ/NCQ) totally inefficient, and on SSDs it only saturates one 
of the multiple channels.

On spinning drives, it's usually good to keep the iodepth>=4. For 
example this 10k Seagate drive [1] can do ~450 random IOPS with 
iodepth=16, while 10k drive should be able to do ~150 IOPS (with 
iodepth=1). The other SAS drives behave quite similarly.

[1] 
http://www.storagereview.com/seagate_enterprise_performance_10k_hdd_savvio_10k6_review

On SSDs the good values usually start at 16, depending on the model (and 
controller), and size (large SSDs are basically multiple small ones 
glued together, thus have more channels).

This is why the numbers from coupon collector are way too low in many 
cases. (OTOH this is done per backend, so if there are multiple backends 
doing prefetching ...)

>
> We talked about this quite a bit back then and there was no dispute
> that the aim is to provide GUCs that mean something meaningful to the
> DBA who can actually measure them. They know how many spindles they
> have. They do not know what the optimal prefetch depth is and the only
> way to determine it would be to experiment with Postgres. Worse, I

As I explained, spindles have very little to do with it - you need 
multiple I/O requests per device, to get the benefit. Sure, the DBAs 
should know how many spindles they have and should be able to determine 
optimal IO depth. But we actually say this in the docs:
     A good starting point for this setting is the number of separate     drives comprising a RAID 0 stripe or RAID 1
mirrorbeing used for     the database. (For RAID 5 the parity drive should not be counted.)     However, if the
databaseis often busy with multiple queries     issued in concurrent sessions, lower values may be sufficient to
keepthe disk array busy. A value higher than needed to keep the     disks busy will only result in extra CPU overhead.
 

So we recommend number of drives as a good starting value, and then warn 
against increasing the value further.

Moreover, ISTM it's very unclear what value to use even if you know the 
number of devices and optimal iodepth. Setting (devices * iodepth) 
doesn't really make much sense, because that effectively computes
    (devices*iodepth) * H(devices*iodepth)

which says "there are (devices*iodepth) devices, make sure there's at 
least one request for each of them", right? I guess we actually want
    (devices*iodepth) * H(devices)

Sadly that means we'd have to introduce another GUC, because we need 
track both ndevices and iodepth.

There probably is a value X so that
     X * H(X) ~= (devices*iodepth) * H(devices)

but it's far from clear that's what we need (it surely is not in the docs).


> think the above formula works for essentially random I/O but for
> more predictable I/O it might be possible to use a different formula.
> But if we made the GUC something low level like "how many blocks to
> prefetch" then we're left in the dark about how to handle that
> different access pattern.

Maybe. We only use this in Bitmap Index Scan at this point, and I don't 
see any proposals to introduce this to other places. So no opinion.

>
> I did speak to a dm developer and he suggested that the kernel could
> help out with an API. He suggested something of the form "how many
> blocks do I have to read before the end of the current device". I
> wasn't sure exactly what we would do with something like that but it
> would be better than just guessing how many I/O operations we need
> to issue to keep all the spindles busy.

I don't really see how that would help us?

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: exposing pg_controldata and pg_config as functions
Next
From: Joe Conway
Date:
Subject: Re: exposing pg_controldata and pg_config as functions