Thread: Interesting misbehavior of repalloc()
I was just tracing through a memory leak occurring when regexp_matches() is executed a lot of times within one query, for instance this rather stupid test case: select count(*) from (select regexp_matches(repeat('xyxxy',100), '[xy]', 'g') from generate_series(1,100000)) ss; I thought it was because regexp_matches() wasn't bothering to clean up the stuff it allocated in the per-query context, but after I fixed that there was still a leak :-(. It took me a while to realize what was going on. The problem stems from an improvement(?) I put into AllocSetRealloc() awhile ago, which is to try to save memcpy cycles by enlarging a chunk in-place if possible, that is, if it's the last chunk in the containing memory block and there's enough room in the block. This results in the following cycle: 1. regexp_matches asks for a 2K chunk. There's nothing in the 2K chunk freelist, so AllocSetAlloc allocates a new chunk from the end of the context's current memory block. 2. regexp_matches needs more space and repalloc's the chunk to 4K. AllocSetRealloc notices it can realloc in place, and does so. 3. When regexp_matches is done with the current call, it politely releases the chunk, and AllocSetFree sticks it into the freelist for 4K chunks. 4. The next call of regexp_matches asks for a 2K chunk. There's nothing in the 2K chunk freelist, so AllocSetAlloc allocates a new chunk from the end of the context's current memory block. Lather, rinse, repeat --- each cycle adds another entry to the 4K-chunk freelist, which we'll never use. Obviously this is not regexp_matches' fault; it's doing everything by the book. There are similar usage patterns elsewhere (particularly StringInfo) so I'm surprised we've not recognized the problem before. Perhaps we should just remove lines 934-982 of aset.c, and always handle small-chunk reallocs with the "brute force" case. Can anyone see a way to salvage something from the "realloc-in-place" idea? One thought that comes to mind is to try to make AllocSetFree recognize when it's pfree'ing the last chunk in a memory block, and to handle that by decrementing the freeptr instead of putting the chunk into any freelist. This isn't a 100% solution because it only works if no new chunk has been made since the repalloc enlargement. It would handle the sort of repetitive cycle the test case exposes, since after the first cycle or two everything fixed-size is going into chunks that are just repeatedly obtained from the freelists and put back again. But I'm afraid that this would just slow down most cases, by adding cycles to AllocSetFree (plus the subsequent AllocSetAlloc takes longer, since the freeptr-increment path is a bit slower than just taking an extant chunk off the freelist). The savings in memcpy work when AllocSetRealloc gets to win probably don't justify adding any overhead to paths that don't involve realloc. Comments, better ideas? regards, tom lane
* Tom Lane (tgl@sss.pgh.pa.us) wrote: [...] > 3. When regexp_matches is done with the current call, it politely > releases the chunk, and AllocSetFree sticks it into the freelist for > 4K chunks. > > 4. The next call of regexp_matches asks for a 2K chunk. There's nothing > in the 2K chunk freelist, so AllocSetAlloc allocates a new chunk from > the end of the context's current memory block. > > Lather, rinse, repeat --- each cycle adds another entry to the 4K-chunk > freelist, which we'll never use. This is likely to be naive, but perhaps it'll help others understand too. Would it be sensible to look at trying to fill a 2K request from the next-larger (4K-chunk) freelist before allocating a new chunk? Could it, essentially, "downgrade" the 4K chunk into 2 2K chunks when that's what is being asked for (and the 2K freelist is empty, and the 4K freelist isn't, etc)? The realloc-in-place seems like a good idea in general, but this test case does need to be handled in some clean way. Perhaps allowing a "downgrade" path along with the "upgrade" path would work well. (If that's what you were suggesting already, then sorry for the noise) Thanks, Stephen
Stephen Frost <sfrost@snowman.net> writes: > This is likely to be naive, but perhaps it'll help others understand > too. Would it be sensible to look at trying to fill a 2K request from > the next-larger (4K-chunk) freelist before allocating a new chunk? That doesn't sound like a very good idea --- in the worst case it could lead to giving out 8K chunks to satisfy very small requests. I realize that you're thinking of looking at only the next-larger freelist, but that would not be enough to solve the problem, because the repalloc'd chunk might've been enlarged more than 2X before it finally got returned to the freelists. (I did not mention it before because it didn't seem relevant, but the test case I was looking at actually does go through 2 rounds of doubling of the regexp_matches workspace before it returns it to the freelists.) So to solve the problem that way you'd have to be willing to use *any* larger chunk to satisfy a request, and that seems to me to be way too inefficient storage-wise. Also, if you're thinking of dividing an existing chunk into two half-size chunks, that doesn't work because of the need for header space for each chunk --- you'd end up with some chunks slightly smaller than standard, which would add its own inefficiencies. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Perhaps we should just remove lines 934-982 of aset.c, and always handle > small-chunk reallocs with the "brute force" case. Can anyone see a way > to salvage something from the "realloc-in-place" idea? > > One thought that comes to mind is to try to make AllocSetFree recognize > when it's pfree'ing the last chunk in a memory block, and to handle that > by decrementing the freeptr instead of putting the chunk into any > freelist. We could also only do the realloc-in-place only if there isn't a 4k chunk in the 4k freelist. I'm imagining that usually there wouldn't be. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > We could also only do the realloc-in-place only if there isn't a 4k chunk in > the 4k freelist. I'm imagining that usually there wouldn't be. Or in general, if there's a free chunk of the right size then copy to it, else consider realloc-in-place. Counterintuitive but it might work. I'm not sure how often there wouldn't be a free chunk though ... regards, tom lane
I wrote: > Gregory Stark <stark@enterprisedb.com> writes: >> We could also only do the realloc-in-place only if there isn't a 4k chunk in >> the 4k freelist. I'm imagining that usually there wouldn't be. > Or in general, if there's a free chunk of the right size then copy to > it, else consider realloc-in-place. Counterintuitive but it might work. > I'm not sure how often there wouldn't be a free chunk though ... I experimented with this a bit. Not doing enlarge-in-place when there's a suitable free chunk turns out to be practically a one-line addition to AllocSetRealloc, but the question is whether that forty-line block of code is pulling its weight at all. I added some debug code to log when the different cases happen, and ran the regression tests. (Which maybe aren't very representative of real-world usage, but it's the best easy test I can think of.) What I got was 380 successful enlarge in place438 blocked by new rule about available chunk6078 other reallocs of small chunks The "other reallocs" are ones where one of the existing limitations prevent us from using realloc-in-place. The successful enlargements broke down like this: 12 realloc enlarge 16 -> 24 1 realloc enlarge 16 -> 32 1 realloc enlarge 16 -> 40 1 realloc enlarge 16 -> 64 1 reallocenlarge 16 -> 80139 realloc enlarge 256 -> 512119 realloc enlarge 512 -> 1024 80 realloc enlarge 1024 -> 2048 26 reallocenlarge 2048 -> 4096 Bearing in mind that the first number is the number of bytes of data we'd have to copy if we don't enlarge-in-place, we're not saving that much work. (Cases involving larger chunks are passed off to libc's realloc(), so there's never anything bigger than 2K of copying at stake, at least when power-of-2 request sizes are used.) I drilled down a bit deeper and found that most of the larger realloc's are coming from just two places: enlargement of StringInfo buffers (initially 256 bytes) and enlargement of scan.l's literalbuf (initially 128 bytes). I changed the initial allocations to 1K for each of these, and then the profile of successful realloc-in-place changes to 12 realloc enlarge 16 -> 24 1 realloc enlarge 16 -> 32 1 realloc enlarge 16 -> 40 1 realloc enlarge 16 -> 64 1 reallocenlarge 16 -> 80 81 realloc enlarge 1024 -> 2048 25 realloc enlarge 2048 -> 4096 Here, all of the remaining larger realloc's are happening during CREATE VIEW operations (while constructing the pg_rewrite rule text), which probably need not be considered a performance-critical path. Based on this, I conclude that the realloc-in-place code doesn't pull its weight. We should just remove it, and increase those penurious initial allocations in stringinfo.c and scan.l to avoid most of the use-cases for repalloc in the first place. Does anyone have any other test cases to suggest? Stuff like pgbench isn't interesting --- it doesn't cause repalloc to be invoked at all. regards, tom lane