Thread: Sort performance cliff with small work_mem

Sort performance cliff with small work_mem

From
Heikki Linnakangas
Date:
Hi,

I spent some time performance testing sorting, and spotted a funny 
phenomenon with very small work_mem settings. This example demonstrates 
it well:

I sorted about 1 GB worth of pre-sorted integers, with different 
settings of work_mem, between 64 kB (the minimum), and 100 kB. I also 
added some logging to print out the number of "runs" performed:


work_mem   elapsed time      runs
--------   ------------      ----
    64 kB       20167 ms   35301
    65 kB       19769 ms   34454
    66 kB       19844 ms   33646
    67 kB       20318 ms   32840
    68 kB       20022 ms   32105
    69 kB       19455 ms   31403
    70 kB       19509 ms   30700
    71 kB       19200 ms   30057
    72 kB       19248 ms   29441
    73 kB       19809 ms   29071
    74 kB       19840 ms   28657
    75 kB       19864 ms   28281
    76 kB       19634 ms   27914
    77 kB       19599 ms   27557
    78 kB       19429 ms   27184
    79 kB       19425 ms   26845
    80 kB       20196 ms   26515
    81 kB       21781 ms   26192
    82 kB       18883 ms   25877
    83 kB       20460 ms   25548
    84 kB       18910 ms   25249
    85 kB       29245 ms   2009692
    86 kB       26532 ms   1039496
    87 kB       25126 ms   701056
    88 kB       25241 ms   528867
    89 kB       24235 ms   418686
    90 kB       24038 ms   350528
    91 kB       24803 ms   301454
    92 kB       23192 ms   264434
    93 kB       23111 ms   233686
    94 kB       23295 ms   210807
    95 kB       26602 ms   192009
    96 kB       22990 ms   176289
    97 kB       22700 ms   162948
    98 kB       22370 ms   150727
    99 kB       22686 ms   140867
   100 kB       22413 ms   132217

Huh, something funny happened going from 84 kB to 85 kB. I traced it to 
this piece of code in inittapes().

>     /*
>      * Decrease availMem to reflect the space needed for tape buffers; but
>      * don't decrease it to the point that we have no room for tuples. (That
>      * case is only likely to occur if sorting pass-by-value Datums; in all
>      * other scenarios the memtuples[] array is unlikely to occupy more than
>      * half of allowedMem.  In the pass-by-value case it's not important to
>      * account for tuple space, so we don't care if LACKMEM becomes
>      * inaccurate.)
>      */
>     tapeSpace = (int64) maxTapes * TAPE_BUFFER_OVERHEAD;
> 
>     if (tapeSpace + GetMemoryChunkSpace(state->memtuples) < state->allowedMem)
>         USEMEM(state, tapeSpace);

With a small work_mem values, maxTapes is always 6, so tapeSpace is 48 
kB. With a small enough work_mem, 84 kB or below in this test case, 
there is not enough memory left at this point, so we don't subtract 
tapeSpace. However, with a suitably evil test case, you can get 
arbitrary close to the edge, so that we will subtract away almost all 
the remaining memory above, leaving only a few bytes for the tuples. In 
this example case, with work_mem='85 kB', each quicksorted run consists 
of only 15 tuples on average.

To fix, I propose that we change the above so that we always subtract 
tapeSpace, but if there is less than e.g. 32 kB of memory left after 
that (including, if it went below 0), then we bump availMem back up to 
32 kB. So we'd always reserve 32 kB to hold the tuples, even if that 
means that we exceed 'work_mem' slightly.

Memory accounting isn't very accurate in general. Even as the code 
stands, we will exceed the requested work_mem if it's small enough, and 
we don't account for every small allocation and overhead anyway. So I 
think that would be acceptable, even though it's kind of cheating.

Of course, sorting large sets with tiny work_mem settings isn't a very 
good idea to begin with. But I'd nevertheless like to avoid 
non-intuitive performance cliffs like this. It threw me off while 
performance testing another patch.

- Heikki


Re: Sort performance cliff with small work_mem

From
Robert Haas
Date:
On Wed, May 2, 2018 at 11:38 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> To fix, I propose that we change the above so that we always subtract
> tapeSpace, but if there is less than e.g. 32 kB of memory left after that
> (including, if it went below 0), then we bump availMem back up to 32 kB. So
> we'd always reserve 32 kB to hold the tuples, even if that means that we
> exceed 'work_mem' slightly.

Sounds very reasonable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Sort performance cliff with small work_mem

From
Peter Geoghegan
Date:
On Wed, May 2, 2018 at 8:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, May 2, 2018 at 11:38 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> To fix, I propose that we change the above so that we always subtract
>> tapeSpace, but if there is less than e.g. 32 kB of memory left after that
>> (including, if it went below 0), then we bump availMem back up to 32 kB. So
>> we'd always reserve 32 kB to hold the tuples, even if that means that we
>> exceed 'work_mem' slightly.
>
> Sounds very reasonable.

+1

-- 
Peter Geoghegan


Re: Sort performance cliff with small work_mem

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, May 2, 2018 at 11:38 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> To fix, I propose that we change the above so that we always subtract
>> tapeSpace, but if there is less than e.g. 32 kB of memory left after that
>> (including, if it went below 0), then we bump availMem back up to 32 kB. So
>> we'd always reserve 32 kB to hold the tuples, even if that means that we
>> exceed 'work_mem' slightly.

> Sounds very reasonable.

Agreed.  I think that was my code to start with, and the issue certainly
didn't occur to me at the time.

I don't like the idea of using hardwired "32kB" though; some multiple
of TAPE_BUFFER_OVERHEAD seems more plausible.  Perhaps MINORDER times
TAPE_BUFFER_OVERHEAD would be good?

            regards, tom lane


Re: Sort performance cliff with small work_mem

From
Peter Geoghegan
Date:
On Wed, May 2, 2018 at 8:38 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> With a small work_mem values, maxTapes is always 6, so tapeSpace is 48 kB.
> With a small enough work_mem, 84 kB or below in this test case, there is not
> enough memory left at this point, so we don't subtract tapeSpace. However,
> with a suitably evil test case, you can get arbitrary close to the edge, so
> that we will subtract away almost all the remaining memory above, leaving
> only a few bytes for the tuples. In this example case, with work_mem='85
> kB', each quicksorted run consists of only 15 tuples on average.

This is an extreme example of the memtuples array and memory for
tuples becoming unbalanced. You'll have 1024 memtuples (24KiB), which
is rather a lot more than the 15 tuples that you can fit in this
example case.

I don't feel strongly about it, but arguably the minimum additional
amount of memory should be big enough that you have some chance of
using all 1024 SortTuples (the whole memtuples array).

-- 
Peter Geoghegan


Re: Sort performance cliff with small work_mem

From
Heikki Linnakangas
Date:
On 02/05/18 19:41, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Wed, May 2, 2018 at 11:38 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> To fix, I propose that we change the above so that we always subtract
>>> tapeSpace, but if there is less than e.g. 32 kB of memory left after that
>>> (including, if it went below 0), then we bump availMem back up to 32 kB. So
>>> we'd always reserve 32 kB to hold the tuples, even if that means that we
>>> exceed 'work_mem' slightly.
> 
>> Sounds very reasonable.
> 
> Agreed.  I think that was my code to start with, and the issue certainly
> didn't occur to me at the time.
> 
> I don't like the idea of using hardwired "32kB" though; some multiple
> of TAPE_BUFFER_OVERHEAD seems more plausible.  Perhaps MINORDER times
> TAPE_BUFFER_OVERHEAD would be good?

I don't think the amount that we reserve to hold the tuples should 
depend on those things. The function is "allocating" memory for the tape 
buffers, yes, but now we're talking about keeping some memory for the 
tuples, while quicksorting the initial runs. That seems independent from 
the number of tapes or the tape buffer size.

I'm not sure what you could derive that from, to make it less arbitrary. 
At the moment, I'm thinking of just doing something like this:

/*
  * Minimum amount of memory reserved to hold the sorted tuples in
  * TSS_BUILDRUNS phase.  This specifies a minimum size for the merge runs,
  * when work_mem is very small.
  */
#define MIN_TUPLE_MEMORY    (32 * 1024)

Independently of this, perhaps we should put in special case in 
dumptuples(), so that it would never create a run with fewer than 
maxTapes tuples. The rationale is that you'll need to hold that many 
tuples in memory during the merge phase anyway, so it seems silly to 
bail out before that while building the initial runs. You're going to 
exceed work_mem by the roughly same amount anyway, just in a different 
phase. That's not the case in this example, but it might happen when 
sorting extremely wide tuples.

- Heikki


Re: Sort performance cliff with small work_mem

From
Peter Geoghegan
Date:
On Wed, May 2, 2018 at 10:43 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Independently of this, perhaps we should put in special case in
> dumptuples(), so that it would never create a run with fewer than maxTapes
> tuples. The rationale is that you'll need to hold that many tuples in memory
> during the merge phase anyway, so it seems silly to bail out before that
> while building the initial runs. You're going to exceed work_mem by the
> roughly same amount anyway, just in a different phase. That's not the case
> in this example, but it might happen when sorting extremely wide tuples.

-1 from me. What about the case where only some tuples are massive?

-- 
Peter Geoghegan


Re: Sort performance cliff with small work_mem

From
Peter Geoghegan
Date:
On Wed, May 2, 2018 at 10:43 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I'm not sure what you could derive that from, to make it less arbitrary. At
> the moment, I'm thinking of just doing something like this:
>
> /*
>  * Minimum amount of memory reserved to hold the sorted tuples in
>  * TSS_BUILDRUNS phase.  This specifies a minimum size for the merge runs,
>  * when work_mem is very small.
>  */
> #define MIN_TUPLE_MEMORY        (32 * 1024)

If you end up doing something like this, I suggest that you also
change this code to simply assign 1024 (or maybe a new preprocessor
constant):

state->memtupsize = Max(1024, ALLOCSET_SEPARATE_THRESHOLD /
sizeof(SortTuple) + 1);

The ALLOCSET_SEPARATE_THRESHOLD part can become a static assertion.

-- 
Peter Geoghegan


Re: Sort performance cliff with small work_mem

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> On Wed, May 2, 2018 at 10:43 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Independently of this, perhaps we should put in special case in
>> dumptuples(), so that it would never create a run with fewer than maxTapes
>> tuples. The rationale is that you'll need to hold that many tuples in memory
>> during the merge phase anyway, so it seems silly to bail out before that
>> while building the initial runs. You're going to exceed work_mem by the
>> roughly same amount anyway, just in a different phase. That's not the case
>> in this example, but it might happen when sorting extremely wide tuples.

> -1 from me. What about the case where only some tuples are massive?

Well, what about it?  If there are just a few wide tuples, then the peak
memory consumption will depend on how many of those happen to be in memory
at the same time ... but we have zero control over that in the merge
phase, so why sweat about it here?  I think Heikki's got a good idea about
setting a lower bound on the number of tuples we'll hold in memory during
run creation.

            regards, tom lane


Re: Sort performance cliff with small work_mem

From
Peter Geoghegan
Date:
On Wed, May 2, 2018 at 11:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> -1 from me. What about the case where only some tuples are massive?
>
> Well, what about it?  If there are just a few wide tuples, then the peak
> memory consumption will depend on how many of those happen to be in memory
> at the same time ... but we have zero control over that in the merge
> phase, so why sweat about it here?  I think Heikki's got a good idea about
> setting a lower bound on the number of tuples we'll hold in memory during
> run creation.

We don't have control over it, but I'm not excited about specifically
going out of our way to always use more memory in dumptuples() because
it's no worse than the worst case for merging. I am supportive of the
idea of making sure that the amount of memory left over for tuples is
reasonably in line with memtupsize at the point that the sort starts,
though.

-- 
Peter Geoghegan