Re: Out of Memory errors are frustrating as heck! - Mailing list pgsql-performance

From Jeff Janes
Subject Re: Out of Memory errors are frustrating as heck!
Date
Msg-id CAMkU=1zsfYxLQo8NR9YhyBraLh_HuwDQB4LZ1z5S0NPVthbGYw@mail.gmail.com
Whole thread Raw
In response to Re: Out of Memory errors are frustrating as heck!  (Gunther <raj@gusw.net>)
Responses Re: Out of Memory errors are frustrating as heck!
List pgsql-performance
On Mon, Apr 15, 2019 at 9:49 PM Gunther <raj@gusw.net> wrote:

Jeff Janes had more

Breakpoint 2, AllocSetAlloc (context=0x1168230, size=8272) at aset.c:715
715     {
(gdb) p context->name
$8 = 0x96ce5b "ExecutorState"

I think that the above one might have been the one you wanted.
Not sure how you could tell that? It's the same place as everything else. If we can find out what you're looking for, may be we can set a break point earlier up the call chain?

It is just a hunch.  That size is similar to one you reported for the last-straw allocation on the case with the nested loops rather than hash joins. So it is reasonable (but surely not guaranteed) that they are coming from the same place in the code.  And since that other one occurred as the last straw, then that one must have not be part of the start up allocations, but rather the part where we should be at steady state, but are not.  So maybe this one is too.
 

I guess I should run this for a little longer. So I disable my breakpoints 

it went up pretty quick from 1.2 GB to 1.5 GB, but then it stopped growing fast, so now back to gdb and break:

Unfortunately, I think this means you missed your opportunity and are now getting backtraces of the innocent bystanders.
But why? If I see the memory still go up insanely fast, isn't that a sign for the leak?

You said it was no longer going up insanely fast by the time you got your breakpoints re-activated.   
Particularly since you report that the version using nested loops rather than hash joins also leaked, so it is probably not the hash-join specific code that is doing it.
How about it's in the DISTINCT? I noticed while peeking up the call chain, that it was already in the UNIQUE sort thing also.  I guess it's streaming the results from the hash join right into the unique sort step.

Isn't crashing before any rows are emitted from the hash join?  I thought it was, but I am not very confident on that.
 
What I've done before is compile with the comments removed from 
src/backend/utils/mmgr/aset.c:/* #define HAVE_ALLOCINFO */
I have just done that and it creates an insane amount of output from all the processes, I'm afraid there will be no way to keep that stuff separated. If there was a way of turning that one and off for one process only, then we could probably get more info...

Are you doing all this stuff on a production server in use by other people? 

Everything is also extremely slow that way. Like in a half hour the memory didn't even reach 100 MB.

and then look for allocations sizes which are getting allocated but not freed, and then you can go back to gdb to look for allocations of those specific sizes. 
I guess I should look for both, address and size to match it better.

You can analyze the log for specific addresses which are allocated but not freed, but once you find them you can't do much with them.  Those specific addresses probably won't repeat on the next run, so, you so can't do anything with the knowledge.  If you find a request size that is systematically analyzed but not freed, you can condition logging (or gdb) on that size.
 
This generates a massive amount of output, and it bypasses the logging configuration and goes directly to stderr--so it might not end up where you expect.
Yes, massive, like I said. Impossible to use. File system fills up rapidly. I made it so that it can be turned on and off, with the debugger.

int _alloc_info = 0;
#ifdef HAVE_ALLOCINFO
#define AllocFreeInfo(_cxt, _chunk) \       if(_alloc_info) \           fprintf(stderr, "AllocFree: %s: %p, %zu\n", \               (_cxt)->header.name, (_chunk), (_chunk)->size)
#define AllocAllocInfo(_cxt, _chunk) \       if(_alloc_info) \           fprintf(stderr, "AllocAlloc: %s: %p, %zu\n", \               (_cxt)->header.name, (_chunk), (_chunk)->size)
#else
#define AllocFreeInfo(_cxt, _chunk)
#define AllocAllocInfo(_cxt, _chunk)
#endif

so with this I do

(gdb) b AllocSetAlloc
(gdb) cont
(gdb) set _alloc_info=1
(gdb) disable
(gdb) cont

Thanks for this trick, I'll save this so I can refer back to it if I need to do this again some time.
 

then I wait, ... until it crashes again ... no, it's too much. It fills up my filesystem in no time with the logs.  It produced 3 GB in just a minute of run time.


You don't need to run it until it crashes, only until the preliminaries are done and the leak has started to happen.  I would think a minute would be more than enough, unless the leak doesn't start until some other join runs to completion or something.  You can turn on the logging only after ExecutorStat has obtained a size which is larger than we think it has any reason to be.  If you still have log around, you can still analyze it.
 

And also, I doubt we can find anything specifically by allocation size. It's just going to be 512 or whatever.


For allocations larger than 4096 bytes, it records the size requested, not the rounded-to-a-power-of-two size.  So if the leak is for large allocations, you can get considerable specificity here.  Even if not, at least you have some more of a clue than you had previously.
 

Isn't there some other way?

I wonder of valgrind or something like that could be of use.  I don't know enough about those tools to know.  One problem is that this is not really a leak.  If the query completely successfully, it would have freed the memory.  And when the query completed with an error, it also freed the memory.  So it just an inefficiency, not a true leak, and leak-detection tools might not work.  But as I said, I have not studied them.

Are you sure that you had the memory problem under a plan with no hash joins?  If so, that would seem to rule out some of the ideas that Tom has been pondering.  Unless there are two bugs.

Cheers,

Jeff

pgsql-performance by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: Postgres backup & restore
Next
From: Tom Lane
Date:
Subject: Re: Out of Memory errors are frustrating as heck!