Re: Allow a per-tablespace effective_io_concurrency setting - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Allow a per-tablespace effective_io_concurrency setting |
Date | |
Msg-id | 55E76950.2070403@2ndquadrant.com Whole thread Raw |
In response to | Re: Allow a per-tablespace effective_io_concurrency setting (Greg Stark <stark@mit.edu>) |
List | pgsql-hackers |
Hi, On 09/02/2015 08:49 PM, Greg Stark wrote: > On 2 Sep 2015 14:54, "Andres Freund" <andres@anarazel.de> wrote: >> >> >>> + /*---------- >>> + * The user-visible GUC parameter is the number of drives (spindles), >>> + * which we need to translate to a number-of-pages-to-prefetch target. >>> + * The target value is stashed in *extra and then assigned to the actual >>> + * variable by assign_effective_io_concurrency. >>> + * >>> + * The expected number of prefetch pages needed to keep N drives busy is: >>> + * >>> + * drives | I/O requests >>> + * -------+---------------- >>> + * 1 | 1 >>> + * 2 | 2/1 + 2/2 = 3 >>> + * 3 | 3/1 + 3/2 + 3/3 = 5 1/2 >>> + * 4 | 4/1 + 4/2 + 4/3 + 4/4 = 8 1/3 >>> + * n | n * H(n) >> >> I know you just moved this code. But: I don't buy this formula. Like at >> all. Doesn't queuing and reordering entirely invalidate the logic here? > > I can take the blame for this formula. > > It's called the "Coupon Collector Problem". If you hit get a random > coupon from a set of n possible coupons, how many random coupons would > you have to collect before you expect to have at least one of each. > > This computation model assumes we have no information about which > spindle each block will hit. That's basically true for the case of > bitmapheapscan for most cases because the idea of bitmapheapscan is to > be picking a sparse set of blocks and there's no reason the blocks > being read will have any regularity that causes them all to fall on > the same spindles. If in fact you're reading a fairly dense set then > bitmapheapscan probably is a waste of time and simply reading > sequentially would be exactly as fast or even faster. There are different meanings of "busy". If I get the coupon collector problem right (after quickly skimming the wikipedia article today), it effectively makes sure that each "spindle" has at least 1 request in the queue. Which sucks in practice, because on spinning rust it makes queuing (TCQ/NCQ) totally inefficient, and on SSDs it only saturates one of the multiple channels. On spinning drives, it's usually good to keep the iodepth>=4. For example this 10k Seagate drive [1] can do ~450 random IOPS with iodepth=16, while 10k drive should be able to do ~150 IOPS (with iodepth=1). The other SAS drives behave quite similarly. [1] http://www.storagereview.com/seagate_enterprise_performance_10k_hdd_savvio_10k6_review On SSDs the good values usually start at 16, depending on the model (and controller), and size (large SSDs are basically multiple small ones glued together, thus have more channels). This is why the numbers from coupon collector are way too low in many cases. (OTOH this is done per backend, so if there are multiple backends doing prefetching ...) > > We talked about this quite a bit back then and there was no dispute > that the aim is to provide GUCs that mean something meaningful to the > DBA who can actually measure them. They know how many spindles they > have. They do not know what the optimal prefetch depth is and the only > way to determine it would be to experiment with Postgres. Worse, I As I explained, spindles have very little to do with it - you need multiple I/O requests per device, to get the benefit. Sure, the DBAs should know how many spindles they have and should be able to determine optimal IO depth. But we actually say this in the docs: A good starting point for this setting is the number of separate drives comprising a RAID 0 stripe or RAID 1 mirrorbeing used for the database. (For RAID 5 the parity drive should not be counted.) However, if the databaseis often busy with multiple queries issued in concurrent sessions, lower values may be sufficient to keepthe disk array busy. A value higher than needed to keep the disks busy will only result in extra CPU overhead. So we recommend number of drives as a good starting value, and then warn against increasing the value further. Moreover, ISTM it's very unclear what value to use even if you know the number of devices and optimal iodepth. Setting (devices * iodepth) doesn't really make much sense, because that effectively computes (devices*iodepth) * H(devices*iodepth) which says "there are (devices*iodepth) devices, make sure there's at least one request for each of them", right? I guess we actually want (devices*iodepth) * H(devices) Sadly that means we'd have to introduce another GUC, because we need track both ndevices and iodepth. There probably is a value X so that X * H(X) ~= (devices*iodepth) * H(devices) but it's far from clear that's what we need (it surely is not in the docs). > think the above formula works for essentially random I/O but for > more predictable I/O it might be possible to use a different formula. > But if we made the GUC something low level like "how many blocks to > prefetch" then we're left in the dark about how to handle that > different access pattern. Maybe. We only use this in Bitmap Index Scan at this point, and I don't see any proposals to introduce this to other places. So no opinion. > > I did speak to a dm developer and he suggested that the kernel could > help out with an API. He suggested something of the form "how many > blocks do I have to read before the end of the current device". I > wasn't sure exactly what we would do with something like that but it > would be better than just guessing how many I/O operations we need > to issue to keep all the spindles busy. I don't really see how that would help us? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: