Thread: memory leak in e94568ecc10 (pre-reading in external sort)

memory leak in e94568ecc10 (pre-reading in external sort)

From
Tomas Vondra
Date:
Hi,

it seems e94568ecc10 has a pretty bad memory leak. A simple
  pgbench -i -s 300

allocates ~32GB of memory before it fails
  vacuum...  set primary keys...  ERROR:  out of memory  DETAIL:  Failed on request of size 134184960.

The relevant bit from the memory context stats:

TupleSort main: 33278738504 total in 263 blocks; 78848 free (23 chunks);
33278659656 used


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: memory leak in e94568ecc10 (pre-reading in external sort)

From
Heikki Linnakangas
Date:
On 10/06/2016 07:50 AM, Tomas Vondra wrote:
> it seems e94568ecc10 has a pretty bad memory leak. A simple

Oops, fixed, thanks for the report!

To be precise, this wasn't a memory leak, just a gross overallocation of 
memory. The new code in tuplesort.c assumes that it's harmless to  call 
LogicalTapeRewind() on never-used tapes, but in fact, it allocated the 
read buffer for the tape. I fixed that by changing LogicalTapeRewind() 
to not allocate it, if the tape was completely empty.

This is related to earlier the discussion with Peter G, on whether we 
should change state->maxTapes to reflect the actual number of tape that 
were used, when that's less than maxTapes. I think his confusion about 
the loop in init_tape_buffers(), in 
CAM3SWZQ7=FCy1iUZ6jNzUUNnktAG6uitC1i-DoNxScP-9ZsHwQ@mail.gmail.com would 
also have been avoided, if we had done that. So I think we should 
reconsider that.

- Heikki




Re: memory leak in e94568ecc10 (pre-reading in external sort)

From
Peter Geoghegan
Date:
On Thu, Oct 6, 2016 at 12:00 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> This is related to earlier the discussion with Peter G, on whether we should
> change state->maxTapes to reflect the actual number of tape that were used,
> when that's less than maxTapes. I think his confusion about the loop in
> init_tape_buffers(), in
> CAM3SWZQ7=FCy1iUZ6jNzUUNnktAG6uitC1i-DoNxScP-9ZsHwQ@mail.gmail.com would
> also have been avoided, if we had done that. So I think we should reconsider
> that.

-1 on that from me. I don't think that you should modify a variable
that is directly linkable to Knuth's description of polyphase merge --
doing so seems like a bad idea. state->maxTapes (Knuth's T) really is
something that is established pretty early on, and doesn't change.

While the fix you pushed was probably a good idea anyway, I still
think you should not use state->maxTapes to exhaustively call
LogicalTapeAssignReadBufferSize() on every tape, even non-active
tapes. That's the confusing part. It's not as if your need for the
number of input tapes (the number of maybe-active tapes) is long
lived; you just need to instruct logtape.c on buffer sizing once, at
the start of mergeruns().

Besides, what I propose to do is really exactly the same as what you
also want to do, except it avoids actually changing state->maxTapes.
We'd just pass down what you propose to assign to state->maxTapes
directly, which differs (and not just in the common case where there
are inactive tapes -- it's always at least off-by-one). Right?

-- 
Peter Geoghegan



Re: memory leak in e94568ecc10 (pre-reading in external sort)

From
Peter Geoghegan
Date:
On Thu, Oct 6, 2016 at 8:44 AM, Peter Geoghegan <pg@heroku.com> wrote:
> Besides, what I propose to do is really exactly the same as what you
> also want to do, except it avoids actually changing state->maxTapes.
> We'd just pass down what you propose to assign to state->maxTapes
> directly, which differs (and not just in the common case where there
> are inactive tapes -- it's always at least off-by-one). Right?

What I mean is that you should pass down numTapes alongside
numInputTapes. The function init_tape_buffers() could either have an
additional argument (numTapes), or derive numTapes itself.


-- 
Peter Geoghegan



Re: memory leak in e94568ecc10 (pre-reading in external sort)

From
Heikki Linnakangas
Date:
On 10/06/2016 06:44 PM, Peter Geoghegan wrote:
> While the fix you pushed was probably a good idea anyway, I still
> think you should not use swhichtate->maxTapes to exhaustively call
> LogicalTapeAssignReadBufferSize() on every tape, even non-active
> tapes. That's the confusing part.

Admittedly that's confusing. Thinking about this some more, I came up
with the attached. I removed the separate
LogicalTapeAssignReadBufferSize() call altogether - the read buffer size
is now passed as argument to the LogicalTapeRewindForRead() call.

I split LogicalTapeRewind() into two: LogicalTapeRewindForRead() and
LogicalTapeRewindForWrite(). A boolean argument like 'forWrite' is
mentally challenging, because when reading a call site, you have to
remember which way round the boolean is. And now you also pass the extra
buffer_size argument when rewinding for reading, but not when writing.

I gave up on the logic to calculate the quotient and reminder of the
available memory, and assigning one extra buffer to some of the tapes.
I'm now just rounding it down to the nearest BLCKSZ boundary. That
simplifies the code further, although we're now not using all the memory
we could. I'm pretty sure that's OK, although I haven't done any
performance testing of this.

- Heikki


Attachment

Re: memory leak in e94568ecc10 (pre-reading in external sort)

From
Peter Geoghegan
Date:
On Mon, Oct 10, 2016 at 5:21 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Admittedly that's confusing. Thinking about this some more, I came up with
> the attached. I removed the separate LogicalTapeAssignReadBufferSize() call
> altogether - the read buffer size is now passed as argument to the
> LogicalTapeRewindForRead() call.
>
> I split LogicalTapeRewind() into two: LogicalTapeRewindForRead() and
> LogicalTapeRewindForWrite(). A boolean argument like 'forWrite' is mentally
> challenging, because when reading a call site, you have to remember which
> way round the boolean is. And now you also pass the extra buffer_size
> argument when rewinding for reading, but not when writing.

I like the general idea here.

> I gave up on the logic to calculate the quotient and reminder of the
> available memory, and assigning one extra buffer to some of the tapes. I'm
> now just rounding it down to the nearest BLCKSZ boundary. That simplifies
> the code further, although we're now not using all the memory we could. I'm
> pretty sure that's OK, although I haven't done any performance testing of
> this.

The only thing I notice is that this new struct field could use a comment:

> --- a/src/backend/utils/sort/tuplesort.c
> +++ b/src/backend/utils/sort/tuplesort.c
> @@ -366,6 +366,8 @@ struct Tuplesortstate
>     char       *slabMemoryEnd;  /* end of slab memory arena */
>     SlabSlot   *slabFreeHead;   /* head of free list */
>
> +   size_t      read_buffer_size;
> +

Also, something about trace_sort here:

> +#ifdef TRACE_SORT
> +   if (trace_sort)
> +       elog(LOG, "using " INT64_FORMAT " KB of memory for read buffers among %d input tapes",
> +            (state->availMem) / 1024, numInputTapes);
> +#endif
> +
> +   state->read_buffer_size = state->availMem / numInputTapes;
> +   USEMEM(state, state->availMem);

Maybe you should just make the trace_sort output talk about blocks at
this point?

-- 
Peter Geoghegan



Re: memory leak in e94568ecc10 (pre-reading in external sort)

From
Heikki Linnakangas
Date:
On 10/11/2016 12:56 AM, Peter Geoghegan wrote:
> Also, something about trace_sort here:
>
>> +#ifdef TRACE_SORT
>> +   if (trace_sort)
>> +       elog(LOG, "using " INT64_FORMAT " KB of memory for read buffers among %d input tapes",
>> +            (state->availMem) / 1024, numInputTapes);
>> +#endif
>> +
>> +   state->read_buffer_size = state->availMem / numInputTapes;
>> +   USEMEM(state, state->availMem);
>
> Maybe you should just make the trace_sort output talk about blocks at
> this point?

With # of blocks, you then have to mentally divide by 8 to get the 
actual memory used. I think kB is nicer in messages that are meant to be 
read by humans.

The bigger problem with this message is that it's not very accurate 
anymore. The actual amount of memory used is rounded down, and capped by 
MaxAllocSize*numInputTapes. And would it be better to print the per-tape 
buffer size, rather than the total?

- Heikki