Thread: Interesting misbehavior of repalloc()

Interesting misbehavior of repalloc()

From
Tom Lane
Date:
I was just tracing through a memory leak occurring when regexp_matches()
is executed a lot of times within one query, for instance this rather
stupid test case:

select count(*) from (select regexp_matches(repeat('xyxxy',100), '[xy]', 'g')  from generate_series(1,100000)) ss;

I thought it was because regexp_matches() wasn't bothering to clean up
the stuff it allocated in the per-query context, but after I fixed that
there was still a leak :-(.  It took me a while to realize what was
going on.  The problem stems from an improvement(?) I put into
AllocSetRealloc() awhile ago, which is to try to save memcpy cycles by
enlarging a chunk in-place if possible, that is, if it's the last chunk
in the containing memory block and there's enough room in the block.
This results in the following cycle:

1. regexp_matches asks for a 2K chunk.  There's nothing in the 2K chunk
freelist, so AllocSetAlloc allocates a new chunk from the end of the
context's current memory block.

2. regexp_matches needs more space and repalloc's the chunk to 4K.
AllocSetRealloc notices it can realloc in place, and does so.

3. When regexp_matches is done with the current call, it politely
releases the chunk, and AllocSetFree sticks it into the freelist for
4K chunks.

4. The next call of regexp_matches asks for a 2K chunk.  There's nothing
in the 2K chunk freelist, so AllocSetAlloc allocates a new chunk from
the end of the context's current memory block.

Lather, rinse, repeat --- each cycle adds another entry to the 4K-chunk
freelist, which we'll never use.

Obviously this is not regexp_matches' fault; it's doing everything by
the book.  There are similar usage patterns elsewhere (particularly
StringInfo) so I'm surprised we've not recognized the problem before.

Perhaps we should just remove lines 934-982 of aset.c, and always handle
small-chunk reallocs with the "brute force" case.  Can anyone see a way
to salvage something from the "realloc-in-place" idea?

One thought that comes to mind is to try to make AllocSetFree recognize
when it's pfree'ing the last chunk in a memory block, and to handle that
by decrementing the freeptr instead of putting the chunk into any
freelist.  This isn't a 100% solution because it only works if no new
chunk has been made since the repalloc enlargement.  It would handle the
sort of repetitive cycle the test case exposes, since after the first
cycle or two everything fixed-size is going into chunks that are just
repeatedly obtained from the freelists and put back again.  But I'm
afraid that this would just slow down most cases, by adding cycles to
AllocSetFree (plus the subsequent AllocSetAlloc takes longer, since the
freeptr-increment path is a bit slower than just taking an extant chunk
off the freelist).  The savings in memcpy work when AllocSetRealloc gets
to win probably don't justify adding any overhead to paths that don't
involve realloc.

Comments, better ideas?
        regards, tom lane


Re: Interesting misbehavior of repalloc()

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
[...]
> 3. When regexp_matches is done with the current call, it politely
> releases the chunk, and AllocSetFree sticks it into the freelist for
> 4K chunks.
>
> 4. The next call of regexp_matches asks for a 2K chunk.  There's nothing
> in the 2K chunk freelist, so AllocSetAlloc allocates a new chunk from
> the end of the context's current memory block.
>
> Lather, rinse, repeat --- each cycle adds another entry to the 4K-chunk
> freelist, which we'll never use.

This is likely to be naive, but perhaps it'll help others understand
too.  Would it be sensible to look at trying to fill a 2K request from
the next-larger (4K-chunk) freelist before allocating a new chunk?
Could it, essentially, "downgrade" the 4K chunk into 2 2K chunks when
that's what is being asked for (and the 2K freelist is empty, and the 4K
freelist isn't, etc)?

The realloc-in-place seems like a good idea in general, but this test
case does need to be handled in some clean way.  Perhaps allowing a
"downgrade" path along with the "upgrade" path would work well.

(If that's what you were suggesting already, then sorry for the noise)
Thanks,
    Stephen

Re: Interesting misbehavior of repalloc()

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> This is likely to be naive, but perhaps it'll help others understand
> too.  Would it be sensible to look at trying to fill a 2K request from
> the next-larger (4K-chunk) freelist before allocating a new chunk?

That doesn't sound like a very good idea --- in the worst case it could
lead to giving out 8K chunks to satisfy very small requests.  I realize
that you're thinking of looking at only the next-larger freelist, but
that would not be enough to solve the problem, because the repalloc'd
chunk might've been enlarged more than 2X before it finally got returned
to the freelists.  (I did not mention it before because it didn't seem
relevant, but the test case I was looking at actually does go through
2 rounds of doubling of the regexp_matches workspace before it returns
it to the freelists.)  So to solve the problem that way you'd have to be
willing to use *any* larger chunk to satisfy a request, and that seems
to me to be way too inefficient storage-wise.

Also, if you're thinking of dividing an existing chunk into two
half-size chunks, that doesn't work because of the need for header space
for each chunk --- you'd end up with some chunks slightly smaller than
standard, which would add its own inefficiencies.
        regards, tom lane


Re: Interesting misbehavior of repalloc()

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Perhaps we should just remove lines 934-982 of aset.c, and always handle
> small-chunk reallocs with the "brute force" case.  Can anyone see a way
> to salvage something from the "realloc-in-place" idea?
>
> One thought that comes to mind is to try to make AllocSetFree recognize
> when it's pfree'ing the last chunk in a memory block, and to handle that
> by decrementing the freeptr instead of putting the chunk into any
> freelist.  

We could also only do the realloc-in-place only if there isn't a 4k chunk in
the 4k freelist. I'm imagining that usually there wouldn't be.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com



Re: Interesting misbehavior of repalloc()

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> We could also only do the realloc-in-place only if there isn't a 4k chunk in
> the 4k freelist. I'm imagining that usually there wouldn't be.

Or in general, if there's a free chunk of the right size then copy to
it, else consider realloc-in-place.  Counterintuitive but it might work.
I'm not sure how often there wouldn't be a free chunk though ...
        regards, tom lane


Re: Interesting misbehavior of repalloc()

From
Tom Lane
Date:
I wrote:
> Gregory Stark <stark@enterprisedb.com> writes:
>> We could also only do the realloc-in-place only if there isn't a 4k chunk in
>> the 4k freelist. I'm imagining that usually there wouldn't be.

> Or in general, if there's a free chunk of the right size then copy to
> it, else consider realloc-in-place.  Counterintuitive but it might work.
> I'm not sure how often there wouldn't be a free chunk though ...

I experimented with this a bit.  Not doing enlarge-in-place when there's
a suitable free chunk turns out to be practically a one-line addition to
AllocSetRealloc, but the question is whether that forty-line block of
code is pulling its weight at all.  I added some debug code to log when
the different cases happen, and ran the regression tests.  (Which maybe
aren't very representative of real-world usage, but it's the best easy
test I can think of.)  What I got was
380 successful enlarge in place438 blocked by new rule about available chunk6078 other reallocs of small chunks

The "other reallocs" are ones where one of the existing limitations
prevent us from using realloc-in-place.

The successful enlargements broke down like this:
 12 realloc enlarge 16 -> 24  1 realloc enlarge 16 -> 32  1 realloc enlarge 16 -> 40  1 realloc enlarge 16 -> 64  1
reallocenlarge 16 -> 80139 realloc enlarge 256 -> 512119 realloc enlarge 512 -> 1024 80 realloc enlarge 1024 -> 2048 26
reallocenlarge 2048 -> 4096
 

Bearing in mind that the first number is the number of bytes of data
we'd have to copy if we don't enlarge-in-place, we're not saving that
much work.  (Cases involving larger chunks are passed off to libc's
realloc(), so there's never anything bigger than 2K of copying at
stake, at least when power-of-2 request sizes are used.)

I drilled down a bit deeper and found that most of the larger realloc's
are coming from just two places: enlargement of StringInfo buffers
(initially 256 bytes) and enlargement of scan.l's literalbuf (initially
128 bytes).  I changed the initial allocations to 1K for each of these,
and then the profile of successful realloc-in-place changes to
 12 realloc enlarge 16 -> 24  1 realloc enlarge 16 -> 32  1 realloc enlarge 16 -> 40  1 realloc enlarge 16 -> 64  1
reallocenlarge 16 -> 80 81 realloc enlarge 1024 -> 2048 25 realloc enlarge 2048 -> 4096
 

Here, all of the remaining larger realloc's are happening during CREATE
VIEW operations (while constructing the pg_rewrite rule text), which
probably need not be considered a performance-critical path.

Based on this, I conclude that the realloc-in-place code doesn't pull
its weight.  We should just remove it, and increase those penurious
initial allocations in stringinfo.c and scan.l to avoid most of the
use-cases for repalloc in the first place.

Does anyone have any other test cases to suggest?  Stuff like pgbench
isn't interesting --- it doesn't cause repalloc to be invoked at all.
        regards, tom lane