Re: Out of Memory errors are frustrating as heck! - Mailing list pgsql-performance

From Tomas Vondra
Subject Re: Out of Memory errors are frustrating as heck!
Date
Msg-id 20190421011401.3fnclxwwrb5p4utr@development
Whole thread Raw
In response to Re: Out of Memory errors are frustrating as heck!  (Gunther <raj@gusw.net>)
Responses Re: Out of Memory errors are frustrating as heck!
List pgsql-performance
On Sat, Apr 20, 2019 at 08:33:46PM -0400, Gunther wrote:
>On 4/20/2019 16:01, Tomas Vondra wrote:
>>For me, this did the trick:
>> update pg_class set (relpages, reltuples) = (1000000, 1) where 
>>relname = 'tmp_r';
>> update pg_class set (relpages, reltuples) = (1, 1000000) where 
>>relname = 'tmp_q';
>>
>YES! For me too. My EXPLAIN ANALYZE actually succeeded.
>
> Hash Right Join  (cost=11009552.27..11377073.28 rows=11 width=4271) (actual time=511580.110..1058354.140
rows=113478386loops=1)
 
>   Hash Cond: (((q.documentinternalid)::text = (r.documentinternalid)::text) AND ((q.actinternalid)::text =
(r.actinternalid)::text))
>   ->  Seq Scan on tmp_q q  (cost=0.00..210021.00 rows=21000000 width=3417) (actual time=1.148..1.387 rows=236
loops=1)
>   ->  Hash  (cost=11009552.11..11009552.11 rows=11 width=928) (actual time=511577.002..511577.002 rows=113478127
loops=1)
>         Buckets: 16384 (originally 1024)  Batches: 131072 (originally 1)  Memory Usage: 679961kB
>         ->  Seq Scan on tmp_r r  (cost=0.00..11009552.11 rows=11 width=928) (actual time=4.077..344558.954
rows=113478127loops=1)
 
> Planning Time: 0.725 ms
> Execution Time: 1064128.721 ms
>
>But it used a lot of resident memory, and now it seems like I actually 
>have a leak! Because after the command returned as shown above, the 
>memory is still allocated:
>
>  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
> 7100 postgres  20   0 2164012   1.1g 251364 S   0.0 14.5  23:27.23 postgres: postgres integrator [local] idle
>
>and let's do the memory map dump:
>
>2019-04-20 22:09:52.522 UTC [7100] LOG:  duration: 1064132.171 ms  statement: explain analyze
>        SELECT *
>          FROM tmp_q q
>          RIGHT OUTER JOIN tmp_r r
>            USING(documentInternalId, actInternalId);
>TopMemoryContext: 153312 total in 8 blocks; 48168 free (70 chunks); 105144 used
>  HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
>  Operator lookup cache: 24576 total in 2 blocks; 10760 free (3 chunks); 13816 used
>  TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used
>  Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used
>  RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used
>  MessageContext: 8192 total in 1 blocks; 6896 free (1 chunks); 1296 used
>  Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
>  smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used
>  TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used
>  Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
>  TopPortalContext: 8192 total in 1 blocks; 7936 free (1 chunks); 256 used
>  Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used
>  CacheMemoryContext: 1154080 total in 20 blocks; 151784 free (1 chunks); 1002296 used
>    index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: pg_class_tblspc_relfilenode_index
>    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index
>    index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx
>    index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: entity_id_idx
>    ...
>    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_attribute_relid_attnum_index
>    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index
>  WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used
>  PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used
>  MdSmgr: 8192 total in 1 blocks; 4992 free (6 chunks); 3200 used
>  LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used
>  Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used
>  ErrorContext: 8192 total in 1 blocks; 7936 free (3 chunks); 256 used
>Grand total: 2082624 bytes in 240 blocks; 382760 free (175 chunks); 1699864 used
>
>strange, it shows no leak here. Now I run this test again, to see if 
>the memory grows further in top? This time I also leave the DISTINCT 
>step in the query. I am trying to hit the out of memory situation. 
>Well, I clearly saw memory growing now:
>

Unfortunately, interpreting RES is way more complicated. The trouble is
PostgreSQL does not get memory from kernel directly, it gets it through
glibc. So when we do free(), it's not guaranteed kernel gets it.

Also, I think glibc has multiple ways of getting memory from the kernel.
It can either to mmap or sbrk, and AFAIK it's easy to cause "islands" that
make it impossible to undo sbrk after freeing memory.

Memory leaks in PostgreSQL are usually about allocating memory in the
wrong context, and so are visible in MemoryContextStats. Permanent leaks
that don't show there are quite rare.

>
>Also, I think while we might have focused in on a peculiar planning 
>situation where a very unfortunate plan is chosen which stresses the 
>memory situation, the real reason for the final out of memory 
>situation has not yet been determined. Remember, I have seen 3 stages 
>in my original query:
>
>1. Steady state, sort-merge join active high CPU% memory at or below 100 kB
>2. Slow massive growth from over 200 kB to 1.5 GB or 1.8 GB
>3. Explosive growth within a second to over 2.2 GB
>
>It might well be that my initial query would have succeeded just fine 
>despite the unfortunate plan with the big memory consumption on the 
>oddly planned hash join, were it not for that third phase of explosive 
>growth. And we haven't been able to catch this, because it happens too 
>quickly.
>
>It seems to me that some better memory tracing would be necessary. 
>Looking at src/backend/utils/memdebug.c, mentions Valgrind. But to me, 
>Valgrind would be a huge toolbox to just look after one thing. I 
>wonder if we could not make a much simpler memory leak debugger tool.  
>One that is fast,  yet doesn't provide too much output to overwhelm 
>the log destination file system (and waste too much time). There are 
>already Debug macros there which, if enabled, just create an 
>absolutely crazy amount of undecipherable log file content, because 
>ALL backend processes would spit out this blabber. So I already 
>stopped that by adding a variable that must be set to 1 (using the 
>debugger on exactly one process for exactly the time desired):
>
>int _alloc_info = 0;
>#ifdef HAVE_ALLOCINFO
>#define AllocFreeInfo(_cxt, _chunk) \
>            if(_alloc_info) \
>                        fprintf(stderr, "AllocFree: %s: %p, %zu\n", \
>                                (_cxt)->header.name, (_chunk), (_chunk)->size)
>#define AllocAllocInfo(_cxt, _chunk) \
>                if(_alloc_info) \
>                        fprintf(stderr, "AllocAlloc: %s: %p, %zu\n", \
>                                (_cxt)->header.name, (_chunk), (_chunk)->size)
>#else
>#define AllocFreeInfo(_cxt, _chunk)
>#define AllocAllocInfo(_cxt, _chunk)
>#endif
>
>But now I am thinking that should be the hook to use a limited cache 
>where we can cancel out AllocSetAlloc with their AllocSetFree calls 
>that follow relatively rapidly, which apparently is the majority of 
>the log chatter created.
>
>The memory debugger would allocate a single fixed memory chunk like 8 
>or 16 kB as a cache per each memory context that is actually traced. 
>In each we would record the memory allocation in the shortest possible 
>way. With everything compressed. Instead of pointer, references to the 
>memory we would store whatever memory chunk index, a very brief 
>backtrace would be stored in a compressed form, by instruction pointer 
>($rip) of the caller and then variable length differences to the $rip 
>of the caller next up. These could even be stored in an index of the 
>100 most common caller chains to compress this data some more, while 
>minimizing the cost in searching. Now each allocated chunk would be 
>kept in this table and when it fills up, the oldest allocated chunk
>removed and written to the log file. When freed before being evicted 
>from the cache, the chunk gets silently removed. When a chunk is freed 
>that is no longer in the cache, the free event is recorded in the log. 
>That way only those chunks get written to the log files that have 
>endured beyond the capacity of the cache. Hopefully that would be the 
>chunks most likely involved in the memory leak. Once the log has been 
>created, it can be loaded into PostgreSQL table itself, and analyzed 
>to find the chunks that never get freed and from the compressed 
>backtrace figure out where they have been allocated.
>
>BTW, my explain analyze is still running. That Sort - Unique step is 
>taking forever on this data.
>
>OK, now I will try the various patches that people sent.
>

Maybe. But before wasting any more time on the memory leak investigation,
I suggest you first try the patch moving the BufFile allocations to a
separate context. That'll either confirm or disprove the theory.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




pgsql-performance by date:

Previous
From: Gunther
Date:
Subject: Re: Out of Memory errors are frustrating as heck!
Next
From: Gunther
Date:
Subject: Re: Out of Memory errors are frustrating as heck!