Thread: Re: [PERFORM] intel s3500 -- hot stuff
On 18/07/2015 12:03, Julien Rouhaud wrote: > On 10/12/2014 17:52, Jeff Janes wrote: >> On Tue, Dec 9, 2014 at 12:43 PM, Bruce Momjian <bruce@momjian.us >> <mailto:bruce@momjian.us>> wrote: >> >> On Mon, Dec 8, 2014 at 03:40:43PM -0600, Merlin Moncure wrote: >> > >> Did not see consistent measurable gains > 256 >> > >> effective_io_concurrency. Interesting that at setting of '2' (the >> > >> lowest possible setting with the feature actually working) is >> > >> pessimal. >> > > >> > > Very interesting. When we added a per-tablespace random_page_cost, >> > > there was a suggestion that we might want to add per-tablespace >> > > effective_io_concurrency someday: >> > >> > What I'd really like to see is to have effective_io_concurrency work >> > on other types of scans. It's clearly a barn burner on fast storage >> > and perhaps the default should be something other than '1'. Spinning >> > storage is clearly dead and ssd seem to really benefit from the posix >> > readhead api. >> >> >> I haven't played much with SSD, but effective_io_concurrency can be a >> big win even on spinning disk. >> >> >> >> Well, the real question is knowing which blocks to request before >> actually needing them. With a bitmap scan, that is easy --- I am >> unclear how to do it for other scans. We already have kernel read-ahead >> for sequential scans, and any index scan that hits multiple rows will >> probably already be using a bitmap heap scan. >> >> >> If the index scan is used to provide ordering as well as selectivity >> than it will resist being converted to an bitmap scan. Also it won't >> convert to a bitmap scan solely to get credit for the use of >> effective_io_concurrency, as that setting doesn't enter into planning >> decisions. >> >> For a regular index scan, it should be easy to prefetch table blocks for >> all the tuples that will need to be retrieved based on the current index >> leaf page, for example. Looking ahead across leaf page boundaries would >> be harder. >> > > I also think that having effective_io_concurrency for other nodes that > bitmap scan would be really great, but for now > having a per-tablespace effective_io_concurrency is simpler to implement > and will already help, so here's a patch to implement it. I'm also > adding it to the next commitfest. > I didn't know that the thread must exists on -hackers to be able to add a commitfest entry, so I transfer the thread here. Sorry the double post. -- Julien Rouhaud http://dalibo.com - http://dalibo.org
Attachment
Hi, On 2015-07-18 12:17:39 +0200, Julien Rouhaud wrote: > I didn't know that the thread must exists on -hackers to be able to add > a commitfest entry, so I transfer the thread here. Please, in the future, also update the title of the thread to something fitting. > @@ -539,6 +541,9 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags) > { > BitmapHeapScanState *scanstate; > Relation currentRelation; > +#ifdef USE_PREFETCH > + int new_io_concurrency; > +#endif > > /* check for unsupported flags */ > Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK))); > @@ -598,6 +603,25 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags) > */ > currentRelation = ExecOpenScanRelation(estate, node->scan.scanrelid, eflags); > > +#ifdef USE_PREFETCH > + /* check if the effective_io_concurrency has been overloaded for the > + * tablespace storing the relation and compute the target_prefetch_pages, > + * or just get the current target_prefetch_pages > + */ > + new_io_concurrency = get_tablespace_io_concurrency( > + currentRelation->rd_rel->reltablespace); > + > + > + scanstate->target_prefetch_pages = target_prefetch_pages; > + > + if (new_io_concurrency != effective_io_concurrency) > + { > + double prefetch_pages; > + if (compute_io_concurrency(new_io_concurrency, &prefetch_pages)) > + scanstate->target_prefetch_pages = rint(prefetch_pages); > + } > +#endif Maybe it's just me - but imo there should be as few USE_PREFETCH dependant places in the code as possible. It'll just be 0 when not supported, that's fine? Especially changing the size of externally visible structs depending on a configure detected ifdef seems wrong to me. > +bool > +compute_io_concurrency(int io_concurrency, double *target_prefetch_pages) > +{ > + double new_prefetch_pages = 0.0; > + int i; > + > + /* make sure the io_concurrency value is correct, it may have been forced > + * with a pg_tablespace UPDATE > + */ Nitpick: Wrong comment style (/* stands on its own line). > + if (io_concurrency > MAX_IO_CONCURRENCY) > + io_concurrency = MAX_IO_CONCURRENCY; > + > + /*---------- > + * The user-visible GUC parameter is the number of drives (spindles), > + * which we need to translate to a number-of-pages-to-prefetch target. > + * The target value is stashed in *extra and then assigned to the actual > + * variable by assign_effective_io_concurrency. > + * > + * The expected number of prefetch pages needed to keep N drives busy is: > + * > + * drives | I/O requests > + * -------+---------------- > + * 1 | 1 > + * 2 | 2/1 + 2/2 = 3 > + * 3 | 3/1 + 3/2 + 3/3 = 5 1/2 > + * 4 | 4/1 + 4/2 + 4/3 + 4/4 = 8 1/3 > + * n | n * H(n) I know you just moved this code. But: I don't buy this formula. Like at all. Doesn't queuing and reordering entirely invalidate the logic here? Perhaps more relevantly: Imo nodeBitmapHeapscan.c is the wrong place for this. bufmgr.c maybe? You also didn't touch /** How many buffers PrefetchBuffer callers should try to stay ahead of their* ReadBuffer calls by. This is maintained bythe assign hook for* effective_io_concurrency. Zero means "never prefetch".*/ int target_prefetch_pages = 0; which surely doesn't make sense anymore after these changes. But do we even need that variable now? > diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h > index dc167f9..57008fc 100644 > --- a/src/include/utils/guc.h > +++ b/src/include/utils/guc.h > @@ -26,6 +26,9 @@ > #define MAX_KILOBYTES (INT_MAX / 1024) > #endif > > +/* upper limit for effective_io_concurrency */ > +#define MAX_IO_CONCURRENCY 1000 > + > /* > * Automatic configuration file name for ALTER SYSTEM. > * This file will be used to store values of configuration parameters > @@ -256,6 +259,8 @@ extern int temp_file_limit; > > extern int num_temp_buffers; > > +extern int effective_io_concurrency; > + target_prefetch_pages is declared in bufmgr.h - that seems like a better place for these. Greetings, Andres Freund
Hi On 09/02/2015 03:53 PM, Andres Freund wrote: > > Hi, > > On 2015-07-18 12:17:39 +0200, Julien Rouhaud wrote: >> I didn't know that the thread must exists on -hackers to be able to add >> a commitfest entry, so I transfer the thread here. > > Please, in the future, also update the title of the thread to something > fitting. > >> @@ -539,6 +541,9 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags) >> { >> BitmapHeapScanState *scanstate; >> Relation currentRelation; >> +#ifdef USE_PREFETCH >> + int new_io_concurrency; >> +#endif >> >> /* check for unsupported flags */ >> Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK))); >> @@ -598,6 +603,25 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags) >> */ >> currentRelation = ExecOpenScanRelation(estate, node->scan.scanrelid, eflags); >> >> +#ifdef USE_PREFETCH >> + /* check if the effective_io_concurrency has been overloaded for the >> + * tablespace storing the relation and compute the target_prefetch_pages, >> + * or just get the current target_prefetch_pages >> + */ >> + new_io_concurrency = get_tablespace_io_concurrency( >> + currentRelation->rd_rel->reltablespace); >> + >> + >> + scanstate->target_prefetch_pages = target_prefetch_pages; >> + >> + if (new_io_concurrency != effective_io_concurrency) >> + { >> + double prefetch_pages; >> + if (compute_io_concurrency(new_io_concurrency, &prefetch_pages)) >> + scanstate->target_prefetch_pages = rint(prefetch_pages); >> + } >> +#endif > > Maybe it's just me - but imo there should be as few USE_PREFETCH > dependant places in the code as possible. It'll just be 0 when not > supported, that's fine? It's not just you. Dealing with code with plenty of ifdefs is annoying - it compiles just fine most of the time, until you compile it with some rare configuration. Then it either starts producing strange warnings or the compilation fails entirely. It might make a tiny difference on builds without prefetching support because of code size, but realistically I think it's noise. > Especially changing the size of externally visible structs depending > ona configure detected ifdef seems wrong to me. +100 to that >> + /*---------- >> + * The user-visible GUC parameter is the number of drives (spindles), >> + * which we need to translate to a number-of-pages-to-prefetch target. >> + * The target value is stashed in *extra and then assigned to the actual >> + * variable by assign_effective_io_concurrency. >> + * >> + * The expected number of prefetch pages needed to keep N drives busy is: >> + * >> + * drives | I/O requests >> + * -------+---------------- >> + * 1 | 1 >> + * 2 | 2/1 + 2/2 = 3 >> + * 3 | 3/1 + 3/2 + 3/3 = 5 1/2 >> + * 4 | 4/1 + 4/2 + 4/3 + 4/4 = 8 1/3 >> + * n | n * H(n) > > I know you just moved this code. But: I don't buy this formula. Like at > all. Doesn't queuing and reordering entirely invalidate the logic here? Well, even the comment right next after the formula says that: * Experimental results show that both of these formulas aren't * aggressiveenough, but we don't really have any betterproposals. That's the reason why users generally either use 0 or some rather high value (16 or 32 are the most common values see). The problem is that we don't really care about the number of spindles (and not just because SSDs don't have them at all), but about the target queue length per device. Spinning rust uses TCQ/NCQ to optimize the head movement, SSDs are parallel by nature (stacking multiple chips with separate channels). I doubt we can really improve the formula, except maybe for saying "we want 16 requests per device" and multiplying the number by that. We don't really have the necessary introspection to determine better values (and it's not really possible, because the devices may be hidden behind a RAID controller (or a SAN). So we can't really do much. Maybe the best thing we can do is just completely abandon the "number of spindles" idea, and just say "number of I/O requests to prefetch". Possibly with an explanation of how to estimate it (devices * queue length). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2015-09-02 18:06:54 +0200, Tomas Vondra wrote: > Maybe the best thing we can do is just completely abandon the "number of > spindles" idea, and just say "number of I/O requests to prefetch". Possibly > with an explanation of how to estimate it (devices * queue length). I think that'd be a lot better.
On 2 Sep 2015 14:54, "Andres Freund" <andres@anarazel.de> wrote: > > > > + /*---------- > > + * The user-visible GUC parameter is the number of drives (spindles), > > + * which we need to translate to a number-of-pages-to-prefetch target. > > + * The target value is stashed in *extra and then assigned to the actual > > + * variable by assign_effective_io_concurrency. > > + * > > + * The expected number of prefetch pages needed to keep N drives busy is: > > + * > > + * drives | I/O requests > > + * -------+---------------- > > + * 1 | 1 > > + * 2 | 2/1 + 2/2 = 3 > > + * 3 | 3/1 + 3/2 + 3/3 = 5 1/2 > > + * 4 | 4/1 + 4/2 + 4/3 + 4/4 = 8 1/3 > > + * n | n * H(n) > > I know you just moved this code. But: I don't buy this formula. Like at > all. Doesn't queuing and reordering entirely invalidate the logic here? I can take the blame for this formula. It's called the "Coupon Collector Problem". If you hit get a random coupon from a set of n possible coupons, how many random coupons would you have to collect before you expect to have at least one of each. This computation model assumes we have no information about which spindle each block will hit. That's basically true for the case of bitmapheapscan for most cases because the idea of bitmapheapscan is to be picking a sparse set of blocks and there's no reason the blocks being read will have any regularity that causes them all to fall on the same spindles. If in fact you're reading a fairly dense set then bitmapheapscan probably is a waste of time and simply reading sequentially would be exactly as fast or even faster. We talked about this quite a bit back then and there was no dispute that the aim is to provide GUCs that mean something meaningful to the DBA who can actually measure them. They know how many spindles they have. They do not know what the optimal prefetch depth is and the only way to determine it would be to experiment with Postgres. Worse, I think the above formula works for essentially random I/O but for more predictable I/O it might be possible to use a different formula. But if we made the GUC something low level like "how many blocks to prefetch" then we're left in the dark about how to handle that different access pattern. I did speak to a dm developer and he suggested that the kernel could help out with an API. He suggested something of the form "how many blocks do I have to read before the end of the current device". I wasn't sure exactly what we would do with something like that but it would be better than just guessing how many I/O operations we need to issue to keep all the spindles busy.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi, On 02/09/2015 18:06, Tomas Vondra wrote: > Hi > > On 09/02/2015 03:53 PM, Andres Freund wrote: >> >> Hi, >> >> On 2015-07-18 12:17:39 +0200, Julien Rouhaud wrote: >>> I didn't know that the thread must exists on -hackers to be >>> able to add a commitfest entry, so I transfer the thread here. >> >> Please, in the future, also update the title of the thread to >> something fitting. >> Sorry for that. >>> @@ -539,6 +541,9 @@ ExecInitBitmapHeapScan(BitmapHeapScan >>> *node, EState *estate, int eflags) { BitmapHeapScanState >>> *scanstate; Relation currentRelation; +#ifdef USE_PREFETCH + >>> int new_io_concurrency; +#endif >>> >>> /* check for unsupported flags */ Assert(!(eflags & >>> (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK))); @@ -598,6 +603,25 @@ >>> ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, >>> int eflags) */ currentRelation = ExecOpenScanRelation(estate, >>> node->scan.scanrelid, eflags); >>> >>> +#ifdef USE_PREFETCH + /* check if the >>> effective_io_concurrency has been overloaded for the + * >>> tablespace storing the relation and compute the >>> target_prefetch_pages, + * or just get the current >>> target_prefetch_pages + */ + new_io_concurrency = >>> get_tablespace_io_concurrency( + >>> currentRelation->rd_rel->reltablespace); + + + >>> scanstate->target_prefetch_pages = target_prefetch_pages; + + >>> if (new_io_concurrency != effective_io_concurrency) + { + >>> double prefetch_pages; + if >>> (compute_io_concurrency(new_io_concurrency, &prefetch_pages)) + >>> scanstate->target_prefetch_pages = rint(prefetch_pages); + >>> } +#endif >> >> Maybe it's just me - but imo there should be as few USE_PREFETCH >> dependant places in the code as possible. It'll just be 0 when >> not supported, that's fine? > > It's not just you. Dealing with code with plenty of ifdefs is > annoying - it compiles just fine most of the time, until you > compile it with some rare configuration. Then it either starts > producing strange warnings or the compilation fails entirely. > > It might make a tiny difference on builds without prefetching > support because of code size, but realistically I think it's > noise. > >> Especially changing the size of externally visible structs >> depending ona configure detected ifdef seems wrong to me. > > +100 to that > I totally agree. I'll remove the ifdefs. >> Nitpick: Wrong comment style (/* stands on its own line). I did run pgindent before submitting patch, but apparently I picked the wrong one. Already fixed in local branch. >>> + /*---------- + * The user-visible GUC parameter is the >>> number of drives (spindles), + * which we need to translate >>> to a number-of-pages-to-prefetch target. + * The target >>> value is stashed in *extra and then assigned to the actual + >>> * variable by assign_effective_io_concurrency. + * + * >>> The expected number of prefetch pages needed to keep N drives >>> busy is: + * + * drives | I/O requests + * >>> -------+---------------- + * 1 | 1 + * >>> 2 | 2/1 + 2/2 = 3 + * 3 | 3/1 + 3/2 + 3/3 = 5 >>> 1/2 + * 4 | 4/1 + 4/2 + 4/3 + 4/4 = 8 1/3 + * >>> n | n * H(n) >> >> I know you just moved this code. But: I don't buy this formula. >> Like at all. Doesn't queuing and reordering entirely invalidate >> the logic here? > > Well, even the comment right next after the formula says that: > > * Experimental results show that both of these formulas aren't * > aggressiveenough, but we don't really have any better proposals. > > That's the reason why users generally either use 0 or some rather > high value (16 or 32 are the most common values see). The problem > is that we don't really care about the number of spindles (and not > just because SSDs don't have them at all), but about the target > queue length per device. Spinning rust uses TCQ/NCQ to optimize the > head movement, SSDs are parallel by nature (stacking multiple chips > with separate channels). > > I doubt we can really improve the formula, except maybe for saying > "we want 16 requests per device" and multiplying the number by > that. We don't really have the necessary introspection to determine > better values (and it's not really possible, because the devices > may be hidden behind a RAID controller (or a SAN). So we can't > really do much. > > Maybe the best thing we can do is just completely abandon the > "number of spindles" idea, and just say "number of I/O requests to > prefetch". Possibly with an explanation of how to estimate it > (devices * queue length). > >> I think that'd be a lot better. +1 for that too. If everone's ok with this change, I can submit a patch for that too. Should I split that into two patches, and/or start a new thread? - -- Julien Rouhaud http://dalibo.com - http://dalibo.org -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (GNU/Linux) iQEcBAEBAgAGBQJV50opAAoJELGaJ8vfEpOqve4H/0ZJCoFb0wHtArkGye6ietks 9uahdJy5ixO4J+AZsf2mVxV/DZK7dhK8rWIXt6yS3kfYfPDB79cRFWU5EgjEGAHB qcB7wXCa5HibqLySgQct3zhVDj3CN3ucKT3LVp9OC9mrH2mnGtAp7PYkjd/HqBwd AzW05pu21Ivi/z2gUBOdxNEEDxCLu8T1OtYq3WY9l7Yc4HfLG3huYLQD2LoRFRFn lWwhQifML6uKzz7w3MfZrK4i2ffGGv/r1afHcpZvN3UsB5te1fSzr8KcUeJL7+Zg xJTKwppiEMHpxokn5lw4LzYkjYD7W1fvwLnJhzRrs7dXGPl6rZtLmasyCld4FVk= =r2jE -----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 02/09/2015 15:53, Andres Freund wrote: > On 2015-07-18 12:17:39 +0200, Julien Rouhaud wrote: > > You also didn't touch /* * How many buffers PrefetchBuffer callers > should try to stay ahead of their * ReadBuffer calls by. This is > maintained by the assign hook for * effective_io_concurrency. Zero > means "never prefetch". */ int target_prefetch_pages = 0; which > surely doesn't make sense anymore after these changes. > > But do we even need that variable now? > I thought this was related to the effective_io_concurrency GUC (possibly overloaded by the per-tablespace setting), so I didn't make any change on that. I also just found an issue with my previous patch, the global effective_io_concurrency GUC was ignored if the tablespace had a specific seq_page_cost or random_page_cost setting, I just fixed that in my local branch. >> diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h >> index dc167f9..57008fc 100644 --- a/src/include/utils/guc.h +++ >> b/src/include/utils/guc.h @@ -26,6 +26,9 @@ #define MAX_KILOBYTES >> (INT_MAX / 1024) #endif >> >> +/* upper limit for effective_io_concurrency */ +#define >> MAX_IO_CONCURRENCY 1000 + /* * Automatic configuration file name >> for ALTER SYSTEM. * This file will be used to store values of >> configuration parameters @@ -256,6 +259,8 @@ extern int >> temp_file_limit; >> >> extern int num_temp_buffers; >> >> +extern int effective_io_concurrency; + > > target_prefetch_pages is declared in bufmgr.h - that seems like a > better place for these. > I was rather sceptical about that too. I'll move these in bufmgr.h. Regards. - -- Julien Rouhaud http://dalibo.com - http://dalibo.org -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (GNU/Linux) iQEcBAEBAgAGBQJV51pcAAoJELGaJ8vfEpOqIV0H/Rj1e/DtJS60X2mReWDyfooD G3j0Ptblhy+brYIIxo9Bdp9hVeYFmEqlOJIht9T/3gjfkg5IMz+5bV2waEbAan/m 9uedR/RmS9sz2YpwGgpd21bfSt2LwB+UC448t3rq8KtuzwmXgSVVEflmDmN1qV3z PseUFzS74HeIJWfxLRLGsJ5amN0hJ8bdolIfxdFR0FyFDv0tRv1DzppdMebVJmHs uIdJOU49sIDHjcnsUcq67jkP+IfTUon+nnwvk5FYVVKdBX2ka1Q/1VAvTfmWo0oV WZSlIjQdMUnlTX91zke0NdmsTnagIeRy1oISn/K1v+YmSqnsPqPAcZ6FFQhUMqI= =4ofZ -----END PGP SIGNATURE-----
Hi, On 09/02/2015 08:49 PM, Greg Stark wrote: > On 2 Sep 2015 14:54, "Andres Freund" <andres@anarazel.de> wrote: >> >> >>> + /*---------- >>> + * The user-visible GUC parameter is the number of drives (spindles), >>> + * which we need to translate to a number-of-pages-to-prefetch target. >>> + * The target value is stashed in *extra and then assigned to the actual >>> + * variable by assign_effective_io_concurrency. >>> + * >>> + * The expected number of prefetch pages needed to keep N drives busy is: >>> + * >>> + * drives | I/O requests >>> + * -------+---------------- >>> + * 1 | 1 >>> + * 2 | 2/1 + 2/2 = 3 >>> + * 3 | 3/1 + 3/2 + 3/3 = 5 1/2 >>> + * 4 | 4/1 + 4/2 + 4/3 + 4/4 = 8 1/3 >>> + * n | n * H(n) >> >> I know you just moved this code. But: I don't buy this formula. Like at >> all. Doesn't queuing and reordering entirely invalidate the logic here? > > I can take the blame for this formula. > > It's called the "Coupon Collector Problem". If you hit get a random > coupon from a set of n possible coupons, how many random coupons would > you have to collect before you expect to have at least one of each. > > This computation model assumes we have no information about which > spindle each block will hit. That's basically true for the case of > bitmapheapscan for most cases because the idea of bitmapheapscan is to > be picking a sparse set of blocks and there's no reason the blocks > being read will have any regularity that causes them all to fall on > the same spindles. If in fact you're reading a fairly dense set then > bitmapheapscan probably is a waste of time and simply reading > sequentially would be exactly as fast or even faster. There are different meanings of "busy". If I get the coupon collector problem right (after quickly skimming the wikipedia article today), it effectively makes sure that each "spindle" has at least 1 request in the queue. Which sucks in practice, because on spinning rust it makes queuing (TCQ/NCQ) totally inefficient, and on SSDs it only saturates one of the multiple channels. On spinning drives, it's usually good to keep the iodepth>=4. For example this 10k Seagate drive [1] can do ~450 random IOPS with iodepth=16, while 10k drive should be able to do ~150 IOPS (with iodepth=1). The other SAS drives behave quite similarly. [1] http://www.storagereview.com/seagate_enterprise_performance_10k_hdd_savvio_10k6_review On SSDs the good values usually start at 16, depending on the model (and controller), and size (large SSDs are basically multiple small ones glued together, thus have more channels). This is why the numbers from coupon collector are way too low in many cases. (OTOH this is done per backend, so if there are multiple backends doing prefetching ...) > > We talked about this quite a bit back then and there was no dispute > that the aim is to provide GUCs that mean something meaningful to the > DBA who can actually measure them. They know how many spindles they > have. They do not know what the optimal prefetch depth is and the only > way to determine it would be to experiment with Postgres. Worse, I As I explained, spindles have very little to do with it - you need multiple I/O requests per device, to get the benefit. Sure, the DBAs should know how many spindles they have and should be able to determine optimal IO depth. But we actually say this in the docs: A good starting point for this setting is the number of separate drives comprising a RAID 0 stripe or RAID 1 mirrorbeing used for the database. (For RAID 5 the parity drive should not be counted.) However, if the databaseis often busy with multiple queries issued in concurrent sessions, lower values may be sufficient to keepthe disk array busy. A value higher than needed to keep the disks busy will only result in extra CPU overhead. So we recommend number of drives as a good starting value, and then warn against increasing the value further. Moreover, ISTM it's very unclear what value to use even if you know the number of devices and optimal iodepth. Setting (devices * iodepth) doesn't really make much sense, because that effectively computes (devices*iodepth) * H(devices*iodepth) which says "there are (devices*iodepth) devices, make sure there's at least one request for each of them", right? I guess we actually want (devices*iodepth) * H(devices) Sadly that means we'd have to introduce another GUC, because we need track both ndevices and iodepth. There probably is a value X so that X * H(X) ~= (devices*iodepth) * H(devices) but it's far from clear that's what we need (it surely is not in the docs). > think the above formula works for essentially random I/O but for > more predictable I/O it might be possible to use a different formula. > But if we made the GUC something low level like "how many blocks to > prefetch" then we're left in the dark about how to handle that > different access pattern. Maybe. We only use this in Bitmap Index Scan at this point, and I don't see any proposals to introduce this to other places. So no opinion. > > I did speak to a dm developer and he suggested that the kernel could > help out with an API. He suggested something of the form "how many > blocks do I have to read before the end of the current device". I > wasn't sure exactly what we would do with something like that but it > would be better than just guessing how many I/O operations we need > to issue to keep all the spindles busy. I don't really see how that would help us? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 09/02/2015 02:25 PM, Tomas Vondra wrote: > > As I explained, spindles have very little to do with it - you need > multiple I/O requests per device, to get the benefit. Sure, the DBAs > should know how many spindles they have and should be able to determine > optimal IO depth. But we actually say this in the docs: My experience with performance tuning is that values above 3 have no real effect on how queries are executed. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Wed, Sep 2, 2015 at 4:31 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 09/02/2015 02:25 PM, Tomas Vondra wrote: >> >> As I explained, spindles have very little to do with it - you need >> multiple I/O requests per device, to get the benefit. Sure, the DBAs >> should know how many spindles they have and should be able to determine >> optimal IO depth. But we actually say this in the docs: > > My experience with performance tuning is that values above 3 have no > real effect on how queries are executed. That's the exact opposite of my findings on intel S3500 (see: http://www.postgresql.org/message-id/CAHyXU0yiVvfQAnR9cyH=HWh1WbLRsioe=mzRJTHwtr=2azsTdQ@mail.gmail.com). merlin
On Wed, Sep 2, 2015 at 2:31 PM, Josh Berkus <josh@agliodbs.com> wrote:
On 09/02/2015 02:25 PM, Tomas Vondra wrote:
>
> As I explained, spindles have very little to do with it - you need
> multiple I/O requests per device, to get the benefit. Sure, the DBAs
> should know how many spindles they have and should be able to determine
> optimal IO depth. But we actually say this in the docs:
My experience with performance tuning is that values above 3 have no
real effect on how queries are executed.
Perhaps one reason is that the planner assumes it will get no benefit from this setting, meaning it is somewhat unlikely to choose the types of plans which would actually show a benefit from higher values.
Cheers,
Jeff
On 2015-09-02 14:31:35 -0700, Josh Berkus wrote: > On 09/02/2015 02:25 PM, Tomas Vondra wrote: > > > > As I explained, spindles have very little to do with it - you need > > multiple I/O requests per device, to get the benefit. Sure, the DBAs > > should know how many spindles they have and should be able to determine > > optimal IO depth. But we actually say this in the docs: > > My experience with performance tuning is that values above 3 have no > real effect on how queries are executed. I saw pretty much the opposite - the benefits seldomly were significant below 30 or so. Even on single disks. Which actually isn't that surprising - to be actually beneficial (that is, turn an IO into a CPU bound workload) the prefetched buffer needs to actually have been read in by the time its needed. In many queries processing a single heap page takes far shorter than prefetching the data from storage, even if it's on good SSDs. Therefore what you actually need is a queue of prefetches for the next XX buffers so that between starting a prefetch and actually needing the buffer ienough time has passed that the data is completely read in. And the point is that that's the case even for a single rotating disk! Greetings, Andres Freund
On 2015-09-02 19:49:13 +0100, Greg Stark wrote: > I can take the blame for this formula. > > It's called the "Coupon Collector Problem". If you hit get a random > coupon from a set of n possible coupons, how many random coupons would > you have to collect before you expect to have at least one of each. My point is that that's just the entirely wrong way to model prefetching. Prefetching can be massively beneficial even if you only have a single platter! Even if there were no queues on the hardware or OS level! Concurrency isn't the right way to look at prefetching. You need to prefetch so far ahead that you'll never block on reading heap pages - and that's only the case if processing the next N heap blocks takes longer than the prefetch of the N+1 th page. That doesn't mean there continously have to be N+1 prefetches in progress - in fact that actually often will only be the case for the first few, after that you hopefully are bottlnecked on CPU. If you additionally take into account hardware realities where you have multiple platters, multiple spindles, command queueing etc, that's even more true. A single rotation of a single platter with command queuing can often read several non consecutive blocks if they're on a similar - Andres
On 09/03/2015 12:23 AM, Andres Freund wrote: > On 2015-09-02 14:31:35 -0700, Josh Berkus wrote: >> On 09/02/2015 02:25 PM, Tomas Vondra wrote: >>> >>> As I explained, spindles have very little to do with it - you need >>> multiple I/O requests per device, to get the benefit. Sure, the DBAs >>> should know how many spindles they have and should be able to determine >>> optimal IO depth. But we actually say this in the docs: >> >> My experience with performance tuning is that values above 3 have no >> real effect on how queries are executed. > > I saw pretty much the opposite - the benefits seldomly were > significant below 30 or so. Even on single disks. That's a bit surprising, especially considering that e_i_c=30 means ~100 pages to prefetch if I'm doing the math right. AFAIK queue depth for SATA drives generally is 32 (so prefetching 100 pages should not make a difference), 256 for SAS drives and ~1000 for most current RAID controllers. I'm not entirely surprised that values beyond 30 make a difference, but that you seldomly saw significant improvements below this value. No doubt there are workloads like that, but I'd expect them to be quite rare and not prevalent as you're claiming. > Which actually isn't that surprising - to be actually beneficial > (that is, turn an IO into a CPU bound workload) the prefetched buffer > needs to actually have been read in by the time its needed. In many > queries processing a single heap page takes far shorter than > prefetching the data from storage, even if it's on good SSDs.> > Therefore what you actually need is a queue of prefetches for the > next XX buffers so that between starting a prefetch and actually > needing the buffer ienough time has passed that the data is > completely read in. And the point is that that's the case even for a > single rotating disk! So instead of "How many blocks I need to prefetch to saturate the devices?" you're asking "How many blocks I need to prefetch to never actually wait for the I/O?" I do like this view, but I'm not really sure how could we determine the right value? It seems to be very dependent on hardware and workload. For spinning drives the speedup comes from optimizing random seeks to a more optimal path (thanks to NCQ/TCQ), and on SSDs thanks to using the parallel channels (and possibly faster access to the same block). I guess the best thing we could do at this level is simply keep the on-device queues fully saturated, no? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2015-09-03 01:59:13 +0200, Tomas Vondra wrote: > That's a bit surprising, especially considering that e_i_c=30 means ~100 > pages to prefetch if I'm doing the math right. > > AFAIK queue depth for SATA drives generally is 32 (so prefetching 100 pages > should not make a difference), 256 for SAS drives and ~1000 for most current > RAID controllers. I think the point is that after an initial buildup - we'll again only prefetch pages in smaller increments because we already prefetched most pages. The actual number of concurrent reads will then be determined by how fast pages are processed. A prefetch_target of a 100 does *not* mean 100 requests are continously in flight. And even if it would, the OS won't issue many more requests than the queue depth, so they'll just sit in the OS queue. > So instead of "How many blocks I need to prefetch to saturate the devices?" > you're asking "How many blocks I need to prefetch to never actually wait for > the I/O?" Yes, pretty much. > I do like this view, but I'm not really sure how could we determine the > right value? It seems to be very dependent on hardware and workload. Right. > For spinning drives the speedup comes from optimizing random seeks to a more > optimal path (thanks to NCQ/TCQ), and on SSDs thanks to using the parallel > channels (and possibly faster access to the same block). + OS reordering & coalescing. Don't forget that the OS processes the OS IO queues while the userland process isn't scheduled - in a concurrent workload with more processes than hardware threads that means that pending OS requests are being issued while the query isn't actively being processed. > I guess the best thing we could do at this level is simply keep the > on-device queues fully saturated, no? Well, being too aggressive can hurt throughput and latency of concurrent processes, without being beneficial. Andres Freund
<p dir="ltr">That doesn't match any of the empirical tests I did at the time. I posted graphs of the throughput for varyingnumbers of spindles with varying amount of prefetch. In every case more prefetching increases throuput up to N timesthe single platter throuput where N was the number of spindles.<p dir="ltr">There can be a small speedup from overlappingCPU with I/O but that's a really small effect. At most that can be a single request and it would be a very rarelywould the amount of CPU time be even a moderate fraction of the I/O latency. The only case where they're comparableis when your reading sequentially and then hopefully you wouldn't be using postgres prefetching at all which isonly really intended to help random I/O.<br /><div class="gmail_quote">On 2 Sep 2015 23:38, "Andres Freund" <<a href="mailto:andres@anarazel.de">andres@anarazel.de</a>>wrote:<br type="attribution" /><blockquote class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 2015-09-02 19:49:13 +0100, GregStark wrote:<br /> > I can take the blame for this formula.<br /> ><br /> > It's called the "Coupon CollectorProblem". If you hit get a random<br /> > coupon from a set of n possible coupons, how many random coupons would<br/> > you have to collect before you expect to have at least one of each.<br /><br /> My point is that that's justthe entirely wrong way to model<br /> prefetching. Prefetching can be massively beneficial even if you only<br /> havea single platter! Even if there were no queues on the hardware or<br /> OS level! Concurrency isn't the right way tolook at prefetching.<br /><br /> You need to prefetch so far ahead that you'll never block on reading<br /> heap pages- and that's only the case if processing the next N heap<br /> blocks takes longer than the prefetch of the N+1 th page.That doesn't<br /> mean there continously have to be N+1 prefetches in progress - in fact<br /> that actually oftenwill only be the case for the first few, after that<br /> you hopefully are bottlnecked on CPU.<br /><br /> If you additionallytake into account hardware realities where you have<br /> multiple platters, multiple spindles, command queueingetc, that's even<br /> more true. A single rotation of a single platter with command queuing<br /> can often readseveral non consecutive blocks if they're on a similar<br /><br /> - Andres<br /></blockquote></div>
On Wed, Sep 2, 2015 at 5:38 PM, Andres Freund <andres@anarazel.de> wrote: > If you additionally take into account hardware realities where you have > multiple platters, multiple spindles, command queueing etc, that's even > more true. A single rotation of a single platter with command queuing > can often read several non consecutive blocks if they're on a similar Yeah. And in the case of solid state disks, it's really a much more simple case of, "synchronously reading from the disk block by block does not fully utilize the drive because of various introduced latencies". I find this talk of platters and spindles to be somewhat baroque; for a 200$ part I have to work pretty hard to max out the drive when reading and I'm still not completely sure if it's the drive itself, postgres, cpu, or sata interface bottlenecking me. This will require a rethink of e_i_o configuration; in the old days there were physical limitations of the drives that were in the way regardless of the software stack but we are in a new era, I think. I'm convinced prefetching works and we're going to want to aggressively prefetch anything and everything possible. SSD controllers (at least the intel ones) are very smart. merlin
On Thu, Sep 3, 2015 at 2:13 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > I find this talk of platters and spindles to be somewhat > baroque; for a 200$ part I have to work pretty hard to max out the > drive when reading and I'm still not completely sure if it's the drive > itself, postgres, cpu, or sata interface bottlenecking me. This will > require a rethink of e_i_o configuration; in the old days there were > physical limitations of the drives that were in the way regardless of > the software stack but we are in a new era, I think. I'm convinced > prefetching works and we're going to want to aggressively prefetch > anything and everything possible. SSD controllers (at least the intel > ones) are very smart. Wouldn't SSDs need much *less* aggressive prefetching? There's still latency and there are multiple I/O channels so they will still need some. But spinning media gives latencies measured in milliseconds. You can process a lot of tuples in milliseconds. If you have a hundred spindles you want them all busy doing seeks because in the 5ms it takes them to do that you can proess all the results on a single cpu and the rest of time is spend waiting. When your media has latency on the order of microseconds then you only need to have a small handful of I/O requests in flight to keep your processor busy. -- greg
On Fri, Sep 4, 2015 at 05:21:38PM +0100, Greg Stark wrote: > Wouldn't SSDs need much *less* aggressive prefetching? There's still > latency and there are multiple I/O channels so they will still need > some. But spinning media gives latencies measured in milliseconds. You > can process a lot of tuples in milliseconds. If you have a hundred > spindles you want them all busy doing seeks because in the 5ms it > takes them to do that you can proess all the results on a single cpu > and the rest of time is spend waiting. > > When your media has latency on the order of microseconds then you only > need to have a small handful of I/O requests in flight to keep your > processor busy. Well, there is still the processing time of getting that data ready. All I know is that people have reported that prefetching is even more useful for SSDs. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 2015-09-04 17:21:38 +0100, Greg Stark wrote: > Wouldn't SSDs need much *less* aggressive prefetching? There's still > latency and there are multiple I/O channels so they will still need > some. But spinning media gives latencies measured in milliseconds. You > can process a lot of tuples in milliseconds. If you have a hundred > spindles you want them all busy doing seeks because in the 5ms it > takes them to do that you can proess all the results on a single cpu > and the rest of time is spend waiting. > > When your media has latency on the order of microseconds then you only > need to have a small handful of I/O requests in flight to keep your > processor busy. Most(?) SSDs have latencies between 0.1 and 0.4 ms for individual random reads, often significantly more when a lot of IO is going on. In that time you can still process a good number of pages on the CPU level. Sure, there's few workloads where you need dozens of SSDs to parallelize random reads like in the rotating media days. But that doesn't mean you don't need prefetching? Greetings, Andres Freund
On Fri, Sep 4, 2015 at 11:21 AM, Greg Stark <stark@mit.edu> wrote: > On Thu, Sep 3, 2015 at 2:13 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> I find this talk of platters and spindles to be somewhat >> baroque; for a 200$ part I have to work pretty hard to max out the >> drive when reading and I'm still not completely sure if it's the drive >> itself, postgres, cpu, or sata interface bottlenecking me. This will >> require a rethink of e_i_o configuration; in the old days there were >> physical limitations of the drives that were in the way regardless of >> the software stack but we are in a new era, I think. I'm convinced >> prefetching works and we're going to want to aggressively prefetch >> anything and everything possible. SSD controllers (at least the intel >> ones) are very smart. > > > Wouldn't SSDs need much *less* aggressive prefetching? There's still > latency and there are multiple I/O channels so they will still need > some. But spinning media gives latencies measured in milliseconds. You > can process a lot of tuples in milliseconds. If you have a hundred > spindles you want them all busy doing seeks because in the 5ms it > takes them to do that you can proess all the results on a single cpu > and the rest of time is spend waiting. > > When your media has latency on the order of microseconds then you only > need to have a small handful of I/O requests in flight to keep your > processor busy. I'm not sure I agree with that. 100 micosecond latency is still a pretty big deal when memory latency is measured in nanoseconds. This is why we have pcie storage devices. (if anyone has a fast pice flash device I'd be interested in seing some e_i_o tests...I bet they'd still show some impact but it'd be less). For spinning media, any random i/o workload on large datasets is a freight train to 100% iowait. As each device has to process each data set synchronously minus what small gains can be eeked out by reordering and combining writes. This drives data transfer rates under a megabyte per second. Flash devices OTOH are essentially giant raid 0 devices of electronic media underneath the flash controller. Each device can functionally do many read operations in parallel so it makes perfect sense that they perform better when given multiple overlapping requests; unlike spinning media they can actually make use of that information and those assumptions are well supported by measurement. Flash is getting better all the time and storage technologies that remove its one weakness, random writes, appear to be right around the corner. The age of storage being a bottleneck for database systems without massive high cost engineering and optimization has come to an end. This is good for software developers and especially good for databases because most of the perceived mythology of databases being 'slow' are in fact storage issues. With that problem gone focus will return to clean data processing models which is where database systems excel with their many decades of refinement. merlin
Hi, Please find attached a v2 of the patch. See below for changes. On 02/09/2015 15:53, Andres Freund wrote: > > Hi, > > On 2015-07-18 12:17:39 +0200, Julien Rouhaud wrote: >> I didn't know that the thread must exists on -hackers to be able to add >> a commitfest entry, so I transfer the thread here. > > Please, in the future, also update the title of the thread to something > fitting. > >> @@ -539,6 +541,9 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags) >> { >> BitmapHeapScanState *scanstate; >> Relation currentRelation; >> +#ifdef USE_PREFETCH >> + int new_io_concurrency; >> +#endif >> >> /* check for unsupported flags */ >> Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK))); >> @@ -598,6 +603,25 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags) >> */ >> currentRelation = ExecOpenScanRelation(estate, node->scan.scanrelid, eflags); >> >> +#ifdef USE_PREFETCH >> + /* check if the effective_io_concurrency has been overloaded for the >> + * tablespace storing the relation and compute the target_prefetch_pages, >> + * or just get the current target_prefetch_pages >> + */ >> + new_io_concurrency = get_tablespace_io_concurrency( >> + currentRelation->rd_rel->reltablespace); >> + >> + >> + scanstate->target_prefetch_pages = target_prefetch_pages; >> + >> + if (new_io_concurrency != effective_io_concurrency) >> + { >> + double prefetch_pages; >> + if (compute_io_concurrency(new_io_concurrency, &prefetch_pages)) >> + scanstate->target_prefetch_pages = rint(prefetch_pages); >> + } >> +#endif > > Maybe it's just me - but imo there should be as few USE_PREFETCH > dependant places in the code as possible. It'll just be 0 when not > supported, that's fine? Especially changing the size of externally > visible structs depending on a configure detected ifdef seems wrong to > me. > I removed these ifdefs, and the more problematic one in the struct. >> +bool >> +compute_io_concurrency(int io_concurrency, double *target_prefetch_pages) >> +{ >> + double new_prefetch_pages = 0.0; >> + int i; >> + >> + /* make sure the io_concurrency value is correct, it may have been forced >> + * with a pg_tablespace UPDATE >> + */ > > Nitpick: Wrong comment style (/* stands on its own line). > >> + if (io_concurrency > MAX_IO_CONCURRENCY) >> + io_concurrency = MAX_IO_CONCURRENCY; >> + >> + /*---------- >> + * The user-visible GUC parameter is the number of drives (spindles), >> + * which we need to translate to a number-of-pages-to-prefetch target. >> + * The target value is stashed in *extra and then assigned to the actual >> + * variable by assign_effective_io_concurrency. >> + * >> + * The expected number of prefetch pages needed to keep N drives busy is: >> + * >> + * drives | I/O requests >> + * -------+---------------- >> + * 1 | 1 >> + * 2 | 2/1 + 2/2 = 3 >> + * 3 | 3/1 + 3/2 + 3/3 = 5 1/2 >> + * 4 | 4/1 + 4/2 + 4/3 + 4/4 = 8 1/3 >> + * n | n * H(n) > > I know you just moved this code. But: I don't buy this formula. Like at > all. Doesn't queuing and reordering entirely invalidate the logic here? > Changing the formula, or changing the GUC to a number of pages to prefetch is still discussed, so no change here. > Perhaps more relevantly: Imo nodeBitmapHeapscan.c is the wrong place for > this. bufmgr.c maybe? > Moved to bufmgr.c > You also didn't touch > /* > * How many buffers PrefetchBuffer callers should try to stay ahead of their > * ReadBuffer calls by. This is maintained by the assign hook for > * effective_io_concurrency. Zero means "never prefetch". > */ > int target_prefetch_pages = 0; > which surely doesn't make sense anymore after these changes. > > But do we even need that variable now? I slighty updated the comment. If the table doesn't belong to a tablespace with an overloaded effective_io_concurrency, keeping this pre-computed target_prefetch_pages can save a few cycles on each execution, so I think it's better to keep it. > >> diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h >> index dc167f9..57008fc 100644 >> --- a/src/include/utils/guc.h >> +++ b/src/include/utils/guc.h >> @@ -26,6 +26,9 @@ >> #define MAX_KILOBYTES (INT_MAX / 1024) >> #endif >> >> +/* upper limit for effective_io_concurrency */ >> +#define MAX_IO_CONCURRENCY 1000 >> + >> /* >> * Automatic configuration file name for ALTER SYSTEM. >> * This file will be used to store values of configuration parameters >> @@ -256,6 +259,8 @@ extern int temp_file_limit; >> >> extern int num_temp_buffers; >> >> +extern int effective_io_concurrency; >> + > > target_prefetch_pages is declared in bufmgr.h - that seems like a better > place for these. > Moved to bufmgr.h As said in a previous mail, I also fixed a problem when having settings other than effective_io_concurrency for a tablespace lead to ignore the regular effective_io_concurrency. I also added the forgotten lock level (AccessExclusiveLock) for this tablespace setting, which was leading to a failed assert during initdb. Regards. -- Julien Rouhaud http://dalibo.com - http://dalibo.org
Attachment
Julien Rouhaud wrote: > Hi, > > Please find attached a v2 of the patch. See below for changes. Pushed after smallish tweaks. Please test to verify I didn't break anything. (It's a pity that we can't add a regression test with a value other than 0.) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 08/09/2015 18:00, Alvaro Herrera wrote: > Julien Rouhaud wrote: >> Hi, >> >> Please find attached a v2 of the patch. See below for changes. > > Pushed after smallish tweaks. Please test to verify I didn't break > anything. > I just tried with all the cases I could think of, everything works fine. Thanks! > (It's a pity that we can't add a regression test with a value other than > 0.) > -- Julien Rouhaud http://dalibo.com - http://dalibo.org