Thread: maintenance_work_mem and CREATE INDEX time

maintenance_work_mem and CREATE INDEX time

From

Amit Langote

Date:

23 July 2013, 07:11:55

Hello,

While understanding the effect of maintenance_work_mem on time taken
by CREATE INDEX, I observed that for the values of
maintenance_work_mem less than the value for which an internal sort is
performed, the time taken by CREATE INDEX increases as
maintenance_work_increases.

My guess is that for all those values an external sort is chosen at
some point and larger the value of maintenance_work_mem, later the
switch to external sort would be made causing CREATE INDEX to take
longer. That is a smaller value of maintenance_work_mem would be
preferred for when external sort is performed anyway. Does that make
sense?

--
Amit Langote

Re: maintenance_work_mem and CREATE INDEX time

From

Amit Langote

Date:

23 July 2013, 11:24:03

On Tue, Jul 23, 2013 at 1:11 PM, Amit Langote <amitlangote09@gmail.com> wrote:
> Hello,
>
> While understanding the effect of maintenance_work_mem on time taken
> by CREATE INDEX, I observed that for the values of
> maintenance_work_mem less than the value for which an internal sort is
> performed, the time taken by CREATE INDEX increases as
> maintenance_work_increases.
>
> My guess is that for all those values an external sort is chosen at
> some point and larger the value of maintenance_work_mem, later the
> switch to external sort would be made causing CREATE INDEX to take
> longer. That is a smaller value of maintenance_work_mem would be
> preferred for when external sort is performed anyway. Does that make
> sense?
>

Upon further investigation, it is found that the delay to switch to
external sort caused by a larger value of maintenance_work_mem is
small compared to the total time of CREATE INDEX. So, plotting for a
number of maintenance_work_mem values shows that its effect is
negligible. Are there any other parts of external sort logic that
might make it slower with increasing values of maintenance_work_mem.
It seems merge order, number of tapes seem are related with
state->allowedMem.

Does that mean, external sort is affected by the value of
maintenance_work_mem in a way roughly similar to above?

--
Amit Langote

Re: [HACKERS] maintenance_work_mem and CREATE INDEX time

From

Jeff Janes

Date:

23 July 2013, 21:20:54

On Mon, Jul 22, 2013 at 9:11 PM, Amit Langote <amitlangote09@gmail.com> wrote:
> Hello,
>
> While understanding the effect of maintenance_work_mem on time taken
> by CREATE INDEX, I observed that for the values of
> maintenance_work_mem less than the value for which an internal sort is
> performed, the time taken by CREATE INDEX increases as
> maintenance_work_increases.
>
> My guess is that for all those values an external sort is chosen at
> some point and larger the value of maintenance_work_mem, later the
> switch to external sort would be made causing CREATE INDEX to take
> longer. That is a smaller value of maintenance_work_mem would be
> preferred for when external sort is performed anyway. Does that make
> sense?

The heap structure used in external sorts is cache-unfriendly.  The
bigger the heap used, the more this unfriendliness becomes apparent.
And the bigger maintenance_work_mem, the bigger the heap used.

The bigger heap also means you have fewer "runs" to merge in the
external sort.  However, as long as the number of runs still fits in
the same number of merge passes, this is generally not a meaningful
difference.

Ideally the planner (or something) would figure out how much memory
would be needed to complete an external sort in just one external
pass, and then lower the effective maintenance_work_mem to that
amount.  But that would be hard to do with complete precision, and the
consequences of getting it wrong are asymmetric.

(More thoroughly, it would figure out the number of passes needed for
the given maintenance_work_mem, and then lower the effective
maintenance_work_mem to the smallest value that gives the same number
of passes. But for nearly all practical situations, I think the number
of passes for an index build is going to be 0 or 1.)

Cheers,

Jeff

Re: [HACKERS] maintenance_work_mem and CREATE INDEX time

From

Jeff Janes

Date:

24 July 2013, 00:03:04

On Tue, Jul 23, 2013 at 1:23 AM, Amit Langote <amitlangote09@gmail.com> wrote:
> On Tue, Jul 23, 2013 at 1:11 PM, Amit Langote <amitlangote09@gmail.com> wrote:
>> Hello,
>>
>> While understanding the effect of maintenance_work_mem on time taken
>> by CREATE INDEX, I observed that for the values of
>> maintenance_work_mem less than the value for which an internal sort is
>> performed, the time taken by CREATE INDEX increases as
>> maintenance_work_increases.
>>
>> My guess is that for all those values an external sort is chosen at
>> some point and larger the value of maintenance_work_mem, later the
>> switch to external sort would be made causing CREATE INDEX to take
>> longer. That is a smaller value of maintenance_work_mem would be
>> preferred for when external sort is performed anyway. Does that make
>> sense?
>>
>
> Upon further investigation, it is found that the delay to switch to
> external sort caused by a larger value of maintenance_work_mem is
> small compared to the total time of CREATE INDEX.

If you are using trace_sort to report that, it reports the switch as
happening as soon as it runs out of memory.

At point, all we have been doing is reading tuples into memory.  The
time it takes to do that will depend on maintenance_work_mem, because
that affects how many tuples fit in memory.  But all the rest of the
tuples need to be read sooner or later anyway, so pushing more of them
to later doesn't improve things overall it just shifts timing around.

After it reports the switch, it still needs to heapify the existing
in-memory tuples before the tapesort proper can begin.  This is where
the true lost opportunities start to arise, as the large heap starts
driving cache misses which would not happen at all under different
settings.

Once the existing tuples are heapified, it then continues to use the
heap to pop tuples from it to write out to "tape", and to push newly
read tuples onto it.  This also suffers lost opportunities.

Once all the tuples are written out and it starts merging, then the
large maintenance_work_mem is no longer a penalty as the new heap is
limited by the number of tapes, which is almost always much smaller.
In fact this stage will actually be faster, but not by enough to make
up for the earlier slow down.

So it is not surprising that the time before the switch is reported is
a small part of the overall time difference.

Cheers,

Jeff

Re: [HACKERS] maintenance_work_mem and CREATE INDEX time

From

Amit Langote

Date:

24 July 2013, 05:30:46

On Wed, Jul 24, 2013 at 6:02 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Tue, Jul 23, 2013 at 1:23 AM, Amit Langote <amitlangote09@gmail.com> wrote:
>> On Tue, Jul 23, 2013 at 1:11 PM, Amit Langote <amitlangote09@gmail.com> wrote:
>>> Hello,
>>>
>>> While understanding the effect of maintenance_work_mem on time taken
>>> by CREATE INDEX, I observed that for the values of
>>> maintenance_work_mem less than the value for which an internal sort is
>>> performed, the time taken by CREATE INDEX increases as
>>> maintenance_work_increases.
>>>
>>> My guess is that for all those values an external sort is chosen at
>>> some point and larger the value of maintenance_work_mem, later the
>>> switch to external sort would be made causing CREATE INDEX to take
>>> longer. That is a smaller value of maintenance_work_mem would be
>>> preferred for when external sort is performed anyway. Does that make
>>> sense?
>>>
>>
>> Upon further investigation, it is found that the delay to switch to
>> external sort caused by a larger value of maintenance_work_mem is
>> small compared to the total time of CREATE INDEX.
>
> If you are using trace_sort to report that, it reports the switch as
> happening as soon as it runs out of memory.
>
> At point, all we have been doing is reading tuples into memory.  The
> time it takes to do that will depend on maintenance_work_mem, because
> that affects how many tuples fit in memory.  But all the rest of the
> tuples need to be read sooner or later anyway, so pushing more of them
> to later doesn't improve things overall it just shifts timing around.
>
> After it reports the switch, it still needs to heapify the existing
> in-memory tuples before the tapesort proper can begin.  This is where
> the true lost opportunities start to arise, as the large heap starts
> driving cache misses which would not happen at all under different
> settings.
>
> Once the existing tuples are heapified, it then continues to use the
> heap to pop tuples from it to write out to "tape", and to push newly
> read tuples onto it.  This also suffers lost opportunities.
>
> Once all the tuples are written out and it starts merging, then the
> large maintenance_work_mem is no longer a penalty as the new heap is
> limited by the number of tapes, which is almost always much smaller.
> In fact this stage will actually be faster, but not by enough to make
> up for the earlier slow down.
>
> So it is not surprising that the time before the switch is reported is
> a small part of the overall time difference.
>

So, is it the actual sorting (before merging) that suffers with larger
maintenance_work_mem? I am sorry but I can not grasp the complexity of
external sort code at this point, so all I can say is that during an
external sort a smaller value of maintenance_work_mem is beneficial
(based on my observations in tests). But how that follows from what is
going on in the implementation of external sort is still something I
am working on understanding.


--
Amit Langote

Re: [HACKERS] maintenance_work_mem and CREATE INDEX time

From

Amit Langote

Date:

24 July 2013, 05:34:58

On Wed, Jul 24, 2013 at 11:30 AM, Amit Langote <amitlangote09@gmail.com> wrote:
> On Wed, Jul 24, 2013 at 6:02 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

>> If you are using trace_sort to report that, it reports the switch as
>> happening as soon as it runs out of memory.
>>
>> At point, all we have been doing is reading tuples into memory.  The
>> time it takes to do that will depend on maintenance_work_mem, because
>> that affects how many tuples fit in memory.  But all the rest of the
>> tuples need to be read sooner or later anyway, so pushing more of them
>> to later doesn't improve things overall it just shifts timing around.
>>
>> After it reports the switch, it still needs to heapify the existing
>> in-memory tuples before the tapesort proper can begin.  This is where
>> the true lost opportunities start to arise, as the large heap starts
>> driving cache misses which would not happen at all under different
>> settings.
>>
>> Once the existing tuples are heapified, it then continues to use the
>> heap to pop tuples from it to write out to "tape", and to push newly
>> read tuples onto it.  This also suffers lost opportunities.
>>
>> Once all the tuples are written out and it starts merging, then the
>> large maintenance_work_mem is no longer a penalty as the new heap is
>> limited by the number of tapes, which is almost always much smaller.
>> In fact this stage will actually be faster, but not by enough to make
>> up for the earlier slow down.
>>
>> So it is not surprising that the time before the switch is reported is
>> a small part of the overall time difference.
>>
>
> So, is it the actual sorting (before merging) that suffers with larger
> maintenance_work_mem? I am sorry but I can not grasp the complexity of
> external sort code at this point, so all I can say is that during an
> external sort a smaller value of maintenance_work_mem is beneficial
> (based on my observations in tests). But how that follows from what is
> going on in the implementation of external sort is still something I
> am working on understanding.
>

Or does the increased create index time follow from something else
altogether (not any part of external sort) may be still another
question. Since we have to relate  that to maintenance_work_mem, the
first thing I could think of was to look at sorting part of it.



--
Amit Langote

Re: [HACKERS] maintenance_work_mem and CREATE INDEX time

From

Amit Langote

Date:

24 July 2013, 08:56:38

On Wed, Jul 24, 2013 at 3:20 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Mon, Jul 22, 2013 at 9:11 PM, Amit Langote <amitlangote09@gmail.com> wrote:
>> Hello,
>>
>> While understanding the effect of maintenance_work_mem on time taken
>> by CREATE INDEX, I observed that for the values of
>> maintenance_work_mem less than the value for which an internal sort is
>> performed, the time taken by CREATE INDEX increases as
>> maintenance_work_increases.
>>
>> My guess is that for all those values an external sort is chosen at
>> some point and larger the value of maintenance_work_mem, later the
>> switch to external sort would be made causing CREATE INDEX to take
>> longer. That is a smaller value of maintenance_work_mem would be
>> preferred for when external sort is performed anyway. Does that make
>> sense?
>
> The heap structure used in external sorts is cache-unfriendly.  The
> bigger the heap used, the more this unfriendliness becomes apparent.
> And the bigger maintenance_work_mem, the bigger the heap used.
>
> The bigger heap also means you have fewer "runs" to merge in the
> external sort.  However, as long as the number of runs still fits in
> the same number of merge passes, this is generally not a meaningful
> difference.

Does fewer runs mean more time (in whichever phase of external sort)?


--
Amit Langote

Re: [HACKERS] maintenance_work_mem and CREATE INDEX time

From

Jeff Janes

Date:

29 July 2013, 21:25:21

On Tue, Jul 23, 2013 at 10:56 PM, Amit Langote <amitlangote09@gmail.com> wrote:
> On Wed, Jul 24, 2013 at 3:20 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>
>> The heap structure used in external sorts is cache-unfriendly.  The
>> bigger the heap used, the more this unfriendliness becomes apparent.
>> And the bigger maintenance_work_mem, the bigger the heap used.
>>
>> The bigger heap also means you have fewer "runs" to merge in the
>> external sort.  However, as long as the number of runs still fits in
>> the same number of merge passes, this is generally not a meaningful
>> difference.
>
> Does fewer runs mean more time (in whichever phase of external sort)?

That's complicated.  In general fewer runs are faster, as the heap
used at that stage is smaller.  But this difference is small.  If you
can get the number of runs down to a level that needs fewer passes
over the data, that will make things faster.  But this is rare.  If
the sort isn't already being done in a single pass, then your sort
must be huge or your working memory setting is pathologically tiny.

There is a rough conservation of total heap layers between the two
phases: the initial tuple heap, and the merge stage heap-of-tapes.
Say for example that by increasing work_mem, you can increase the
initial heap from 25 layers to 27 layers, while decreasing the merge
phase heap from 5 layers to 3 layers.  The total number of comparisons
for the entire sort will be about the same, but the comparisons across
the 27 layer heap are much more likely to need to go to main RAM,
rather than come from L3 cache (or whatever the cache level is).

Cheers,

Jeff

Re: [HACKERS] maintenance_work_mem and CREATE INDEX time

From

Amit Langote

Date:

30 July 2013, 03:29:28

Hi Jeff,

On Tue, Jul 30, 2013 at 3:25 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Tue, Jul 23, 2013 at 10:56 PM, Amit Langote <amitlangote09@gmail.com> wrote:
>> On Wed, Jul 24, 2013 at 3:20 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>>
>>> The heap structure used in external sorts is cache-unfriendly.  The
>>> bigger the heap used, the more this unfriendliness becomes apparent.
>>> And the bigger maintenance_work_mem, the bigger the heap used.
>>>
>>> The bigger heap also means you have fewer "runs" to merge in the
>>> external sort.  However, as long as the number of runs still fits in
>>> the same number of merge passes, this is generally not a meaningful
>>> difference.
>>
>> Does fewer runs mean more time (in whichever phase of external sort)?
>
> That's complicated.  In general fewer runs are faster, as the heap
> used at that stage is smaller.  But this difference is small.  If you
> can get the number of runs down to a level that needs fewer passes
> over the data, that will make things faster.  But this is rare.  If
> the sort isn't already being done in a single pass, then your sort
> must be huge or your working memory setting is pathologically tiny.
>
> There is a rough conservation of total heap layers between the two
> phases: the initial tuple heap, and the merge stage heap-of-tapes.
> Say for example that by increasing work_mem, you can increase the
> initial heap from 25 layers to 27 layers, while decreasing the merge
> phase heap from 5 layers to 3 layers.  The total number of comparisons
> for the entire sort will be about the same, but the comparisons across
> the 27 layer heap are much more likely to need to go to main RAM,
> rather than come from L3 cache (or whatever the cache level is).
>

If I my assumption that fewer runs mean longer runs is plausible, may
it be correct to think that performsort step (performsort_done -
performsort_starting) time increases when such longer runs are created
due to larger workMem?


--
Amit Langote