Thread: maintenance_work_mem and CREATE INDEX time
Hello, While understanding the effect of maintenance_work_mem on time taken by CREATE INDEX, I observed that for the values of maintenance_work_mem less than the value for which an internal sort is performed, the time taken by CREATE INDEX increases as maintenance_work_increases. My guess is that for all those values an external sort is chosen at some point and larger the value of maintenance_work_mem, later the switch to external sort would be made causing CREATE INDEX to take longer. That is a smaller value of maintenance_work_mem would be preferred for when external sort is performed anyway. Does that make sense? -- Amit Langote
On Tue, Jul 23, 2013 at 1:11 PM, Amit Langote <amitlangote09@gmail.com> wrote: > Hello, > > While understanding the effect of maintenance_work_mem on time taken > by CREATE INDEX, I observed that for the values of > maintenance_work_mem less than the value for which an internal sort is > performed, the time taken by CREATE INDEX increases as > maintenance_work_increases. > > My guess is that for all those values an external sort is chosen at > some point and larger the value of maintenance_work_mem, later the > switch to external sort would be made causing CREATE INDEX to take > longer. That is a smaller value of maintenance_work_mem would be > preferred for when external sort is performed anyway. Does that make > sense? > Upon further investigation, it is found that the delay to switch to external sort caused by a larger value of maintenance_work_mem is small compared to the total time of CREATE INDEX. So, plotting for a number of maintenance_work_mem values shows that its effect is negligible. Are there any other parts of external sort logic that might make it slower with increasing values of maintenance_work_mem. It seems merge order, number of tapes seem are related with state->allowedMem. Does that mean, external sort is affected by the value of maintenance_work_mem in a way roughly similar to above? -- Amit Langote
On Mon, Jul 22, 2013 at 9:11 PM, Amit Langote <amitlangote09@gmail.com> wrote: > Hello, > > While understanding the effect of maintenance_work_mem on time taken > by CREATE INDEX, I observed that for the values of > maintenance_work_mem less than the value for which an internal sort is > performed, the time taken by CREATE INDEX increases as > maintenance_work_increases. > > My guess is that for all those values an external sort is chosen at > some point and larger the value of maintenance_work_mem, later the > switch to external sort would be made causing CREATE INDEX to take > longer. That is a smaller value of maintenance_work_mem would be > preferred for when external sort is performed anyway. Does that make > sense? The heap structure used in external sorts is cache-unfriendly. The bigger the heap used, the more this unfriendliness becomes apparent. And the bigger maintenance_work_mem, the bigger the heap used. The bigger heap also means you have fewer "runs" to merge in the external sort. However, as long as the number of runs still fits in the same number of merge passes, this is generally not a meaningful difference. Ideally the planner (or something) would figure out how much memory would be needed to complete an external sort in just one external pass, and then lower the effective maintenance_work_mem to that amount. But that would be hard to do with complete precision, and the consequences of getting it wrong are asymmetric. (More thoroughly, it would figure out the number of passes needed for the given maintenance_work_mem, and then lower the effective maintenance_work_mem to the smallest value that gives the same number of passes. But for nearly all practical situations, I think the number of passes for an index build is going to be 0 or 1.) Cheers, Jeff
On Tue, Jul 23, 2013 at 1:23 AM, Amit Langote <amitlangote09@gmail.com> wrote: > On Tue, Jul 23, 2013 at 1:11 PM, Amit Langote <amitlangote09@gmail.com> wrote: >> Hello, >> >> While understanding the effect of maintenance_work_mem on time taken >> by CREATE INDEX, I observed that for the values of >> maintenance_work_mem less than the value for which an internal sort is >> performed, the time taken by CREATE INDEX increases as >> maintenance_work_increases. >> >> My guess is that for all those values an external sort is chosen at >> some point and larger the value of maintenance_work_mem, later the >> switch to external sort would be made causing CREATE INDEX to take >> longer. That is a smaller value of maintenance_work_mem would be >> preferred for when external sort is performed anyway. Does that make >> sense? >> > > Upon further investigation, it is found that the delay to switch to > external sort caused by a larger value of maintenance_work_mem is > small compared to the total time of CREATE INDEX. If you are using trace_sort to report that, it reports the switch as happening as soon as it runs out of memory. At point, all we have been doing is reading tuples into memory. The time it takes to do that will depend on maintenance_work_mem, because that affects how many tuples fit in memory. But all the rest of the tuples need to be read sooner or later anyway, so pushing more of them to later doesn't improve things overall it just shifts timing around. After it reports the switch, it still needs to heapify the existing in-memory tuples before the tapesort proper can begin. This is where the true lost opportunities start to arise, as the large heap starts driving cache misses which would not happen at all under different settings. Once the existing tuples are heapified, it then continues to use the heap to pop tuples from it to write out to "tape", and to push newly read tuples onto it. This also suffers lost opportunities. Once all the tuples are written out and it starts merging, then the large maintenance_work_mem is no longer a penalty as the new heap is limited by the number of tapes, which is almost always much smaller. In fact this stage will actually be faster, but not by enough to make up for the earlier slow down. So it is not surprising that the time before the switch is reported is a small part of the overall time difference. Cheers, Jeff
On Wed, Jul 24, 2013 at 6:02 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Tue, Jul 23, 2013 at 1:23 AM, Amit Langote <amitlangote09@gmail.com> wrote: >> On Tue, Jul 23, 2013 at 1:11 PM, Amit Langote <amitlangote09@gmail.com> wrote: >>> Hello, >>> >>> While understanding the effect of maintenance_work_mem on time taken >>> by CREATE INDEX, I observed that for the values of >>> maintenance_work_mem less than the value for which an internal sort is >>> performed, the time taken by CREATE INDEX increases as >>> maintenance_work_increases. >>> >>> My guess is that for all those values an external sort is chosen at >>> some point and larger the value of maintenance_work_mem, later the >>> switch to external sort would be made causing CREATE INDEX to take >>> longer. That is a smaller value of maintenance_work_mem would be >>> preferred for when external sort is performed anyway. Does that make >>> sense? >>> >> >> Upon further investigation, it is found that the delay to switch to >> external sort caused by a larger value of maintenance_work_mem is >> small compared to the total time of CREATE INDEX. > > If you are using trace_sort to report that, it reports the switch as > happening as soon as it runs out of memory. > > At point, all we have been doing is reading tuples into memory. The > time it takes to do that will depend on maintenance_work_mem, because > that affects how many tuples fit in memory. But all the rest of the > tuples need to be read sooner or later anyway, so pushing more of them > to later doesn't improve things overall it just shifts timing around. > > After it reports the switch, it still needs to heapify the existing > in-memory tuples before the tapesort proper can begin. This is where > the true lost opportunities start to arise, as the large heap starts > driving cache misses which would not happen at all under different > settings. > > Once the existing tuples are heapified, it then continues to use the > heap to pop tuples from it to write out to "tape", and to push newly > read tuples onto it. This also suffers lost opportunities. > > Once all the tuples are written out and it starts merging, then the > large maintenance_work_mem is no longer a penalty as the new heap is > limited by the number of tapes, which is almost always much smaller. > In fact this stage will actually be faster, but not by enough to make > up for the earlier slow down. > > So it is not surprising that the time before the switch is reported is > a small part of the overall time difference. > So, is it the actual sorting (before merging) that suffers with larger maintenance_work_mem? I am sorry but I can not grasp the complexity of external sort code at this point, so all I can say is that during an external sort a smaller value of maintenance_work_mem is beneficial (based on my observations in tests). But how that follows from what is going on in the implementation of external sort is still something I am working on understanding. -- Amit Langote
On Wed, Jul 24, 2013 at 11:30 AM, Amit Langote <amitlangote09@gmail.com> wrote: > On Wed, Jul 24, 2013 at 6:02 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >> If you are using trace_sort to report that, it reports the switch as >> happening as soon as it runs out of memory. >> >> At point, all we have been doing is reading tuples into memory. The >> time it takes to do that will depend on maintenance_work_mem, because >> that affects how many tuples fit in memory. But all the rest of the >> tuples need to be read sooner or later anyway, so pushing more of them >> to later doesn't improve things overall it just shifts timing around. >> >> After it reports the switch, it still needs to heapify the existing >> in-memory tuples before the tapesort proper can begin. This is where >> the true lost opportunities start to arise, as the large heap starts >> driving cache misses which would not happen at all under different >> settings. >> >> Once the existing tuples are heapified, it then continues to use the >> heap to pop tuples from it to write out to "tape", and to push newly >> read tuples onto it. This also suffers lost opportunities. >> >> Once all the tuples are written out and it starts merging, then the >> large maintenance_work_mem is no longer a penalty as the new heap is >> limited by the number of tapes, which is almost always much smaller. >> In fact this stage will actually be faster, but not by enough to make >> up for the earlier slow down. >> >> So it is not surprising that the time before the switch is reported is >> a small part of the overall time difference. >> > > So, is it the actual sorting (before merging) that suffers with larger > maintenance_work_mem? I am sorry but I can not grasp the complexity of > external sort code at this point, so all I can say is that during an > external sort a smaller value of maintenance_work_mem is beneficial > (based on my observations in tests). But how that follows from what is > going on in the implementation of external sort is still something I > am working on understanding. > Or does the increased create index time follow from something else altogether (not any part of external sort) may be still another question. Since we have to relate that to maintenance_work_mem, the first thing I could think of was to look at sorting part of it. -- Amit Langote
On Wed, Jul 24, 2013 at 3:20 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Mon, Jul 22, 2013 at 9:11 PM, Amit Langote <amitlangote09@gmail.com> wrote: >> Hello, >> >> While understanding the effect of maintenance_work_mem on time taken >> by CREATE INDEX, I observed that for the values of >> maintenance_work_mem less than the value for which an internal sort is >> performed, the time taken by CREATE INDEX increases as >> maintenance_work_increases. >> >> My guess is that for all those values an external sort is chosen at >> some point and larger the value of maintenance_work_mem, later the >> switch to external sort would be made causing CREATE INDEX to take >> longer. That is a smaller value of maintenance_work_mem would be >> preferred for when external sort is performed anyway. Does that make >> sense? > > The heap structure used in external sorts is cache-unfriendly. The > bigger the heap used, the more this unfriendliness becomes apparent. > And the bigger maintenance_work_mem, the bigger the heap used. > > The bigger heap also means you have fewer "runs" to merge in the > external sort. However, as long as the number of runs still fits in > the same number of merge passes, this is generally not a meaningful > difference. Does fewer runs mean more time (in whichever phase of external sort)? -- Amit Langote
On Tue, Jul 23, 2013 at 10:56 PM, Amit Langote <amitlangote09@gmail.com> wrote: > On Wed, Jul 24, 2013 at 3:20 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >> >> The heap structure used in external sorts is cache-unfriendly. The >> bigger the heap used, the more this unfriendliness becomes apparent. >> And the bigger maintenance_work_mem, the bigger the heap used. >> >> The bigger heap also means you have fewer "runs" to merge in the >> external sort. However, as long as the number of runs still fits in >> the same number of merge passes, this is generally not a meaningful >> difference. > > Does fewer runs mean more time (in whichever phase of external sort)? That's complicated. In general fewer runs are faster, as the heap used at that stage is smaller. But this difference is small. If you can get the number of runs down to a level that needs fewer passes over the data, that will make things faster. But this is rare. If the sort isn't already being done in a single pass, then your sort must be huge or your working memory setting is pathologically tiny. There is a rough conservation of total heap layers between the two phases: the initial tuple heap, and the merge stage heap-of-tapes. Say for example that by increasing work_mem, you can increase the initial heap from 25 layers to 27 layers, while decreasing the merge phase heap from 5 layers to 3 layers. The total number of comparisons for the entire sort will be about the same, but the comparisons across the 27 layer heap are much more likely to need to go to main RAM, rather than come from L3 cache (or whatever the cache level is). Cheers, Jeff
Hi Jeff, On Tue, Jul 30, 2013 at 3:25 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Tue, Jul 23, 2013 at 10:56 PM, Amit Langote <amitlangote09@gmail.com> wrote: >> On Wed, Jul 24, 2013 at 3:20 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >>> >>> The heap structure used in external sorts is cache-unfriendly. The >>> bigger the heap used, the more this unfriendliness becomes apparent. >>> And the bigger maintenance_work_mem, the bigger the heap used. >>> >>> The bigger heap also means you have fewer "runs" to merge in the >>> external sort. However, as long as the number of runs still fits in >>> the same number of merge passes, this is generally not a meaningful >>> difference. >> >> Does fewer runs mean more time (in whichever phase of external sort)? > > That's complicated. In general fewer runs are faster, as the heap > used at that stage is smaller. But this difference is small. If you > can get the number of runs down to a level that needs fewer passes > over the data, that will make things faster. But this is rare. If > the sort isn't already being done in a single pass, then your sort > must be huge or your working memory setting is pathologically tiny. > > There is a rough conservation of total heap layers between the two > phases: the initial tuple heap, and the merge stage heap-of-tapes. > Say for example that by increasing work_mem, you can increase the > initial heap from 25 layers to 27 layers, while decreasing the merge > phase heap from 5 layers to 3 layers. The total number of comparisons > for the entire sort will be about the same, but the comparisons across > the 27 layer heap are much more likely to need to go to main RAM, > rather than come from L3 cache (or whatever the cache level is). > If I my assumption that fewer runs mean longer runs is plausible, may it be correct to think that performsort step (performsort_done - performsort_starting) time increases when such longer runs are created due to larger workMem? -- Amit Langote