Thread: Out of Memory errors are frustrating as heck!

Out of Memory errors are frustrating as heck!

From

Gunther

Date:

14 April 2019, 20:23:34

For weeks now, I am banging my head at an "out of memory" situation. There is only one query I am running on an 8 GB system, whatever I try, I get knocked out on this out of memory. It is extremely impenetrable to understand and fix this error. I guess I could add a swap file, and then I would have to take the penalty of swapping. But how can I actually address an out of memory condition if the system doesn't tell me where it is happening?

You might want to see the query, but it is a huge plan, and I can't really break this down. It shouldn't matter though. But just so you can get a glimpse here is the plan:

Insert on businessoperation  (cost=5358849.28..5361878.44 rows=34619 width=1197) ->  Unique  (cost=5358849.28..5361532.25 rows=34619 width=1197)       ->  Sort  (cost=5358849.28..5358935.83 rows=34619 width=1197)             Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.is_current, documentinformationsubject.documentid, documentinformationsubject.documenttypecode, documentinformationsubject.subjectroleinternalid, documentinformationsubject.subjectentityinternalid, documentinformationsubject.subjectentityid, documentinformationsubject.subjectentityidroot, documentinformationsubject.subjectentityname, documentinformationsubject.subjectentitytel, documentinformationsubject.subjectentityemail, documentinformationsubject.otherentityinternalid, documentinformationsubject.confidentialitycode, documentinformationsubject.actinternalid, documentinformationsubject.code_code, documentinformationsubject.code_displayname, q.code_code, q.code_displayname, an.extension, an.root, documentinformationsubject_2.subjectentitycode, documentinformationsubject_2.subjectentitycodesystem, documentinformationsubject_2.effectivetime_low, documentinformationsubject_2.effectivetime_high, documentinformationsubject_2.statuscode, documentinformationsubject_2.code_code, agencyid.extension, agencyname.trivialname, documentinformationsubject_1.subjectentitycode, documentinformationsubject_1.subjectentityinternalid             ->  Nested Loop Left Join  (cost=2998335.54..5338133.63 rows=34619 width=1197)                   Join Filter: (((documentinformationsubject.documentinternalid)::text = (q.documentinternalid)::text) AND ((documentinformationsubject.actinternalid)::text = (r.targetinternalid)::text))                   ->  Merge Left Join  (cost=2998334.98..3011313.54 rows=34619 width=930)                         Merge Cond: (((documentinformationsubject.documentinternalid)::text = (documentinformationsubject_1.documentinternalid)::text) AND ((documentinformationsubject.documentid)::text = (documentinformationsubject_1.documentid)::text) AND ((documentinformationsubject.actinternalid)::text = (documentinformationsubject_1.actinternalid)::text))                         ->  Sort  (cost=1408783.87..1408870.41 rows=34619 width=882)                               Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.documentid, documentinformationsubject.actinternalid                               ->  Seq Scan on documentinformationsubject  (cost=0.00..1392681.22 rows=34619 width=882)                                     Filter: (((participationtypecode)::text = ANY ('{PPRF,PRF}'::text[])) AND ((classcode)::text = 'ACT'::text) AND ((moodcode)::text = 'DEF'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text))                         ->  Materialize  (cost=1589551.12..1594604.04 rows=1010585 width=159)                               ->  Sort  (cost=1589551.12..1592077.58 rows=1010585 width=159)                                     Sort Key: documentinformationsubject_1.documentinternalid, documentinformationsubject_1.documentid, documentinformationsubject_1.actinternalid                                     ->  Seq Scan on documentinformationsubject documentinformationsubject_1  (cost=0.00..1329868.64 rows=1010585 width=159)                                           Filter: ((participationtypecode)::text = 'PRD'::text)                   ->  Materialize  (cost=0.56..2318944.31 rows=13 width=341)                         ->  Nested Loop Left Join  (cost=0.56..2318944.24 rows=13 width=341)                               ->  Nested Loop Left Join  (cost=0.00..2318893.27 rows=1 width=281)                                     Join Filter: ((agencyname.entityinternalid)::text = (documentinformationsubject_2.otherentityinternalid)::text)                                     ->  Nested Loop Left Join  (cost=0.00..2286828.33 rows=1 width=291)                                           Join Filter: ((agencyid.entityinternalid)::text = (documentinformationsubject_2.otherentityinternalid)::text)                                           ->  Nested Loop Left Join  (cost=0.00..2284826.24 rows=1 width=239)                                                 Join Filter: (((q.documentinternalid)::text = (documentinformationsubject_2.documentinternalid)::text) AND ((q.actinternalid)::text = (documentinformationsubject_2.actinternalid)::text))                                                 ->  Nested Loop  (cost=0.00..954957.59 rows=1 width=136)                                                       Join Filter: ((q.actinternalid)::text = (r.sourceinternalid)::text)                                                       ->  Seq Scan on actrelationship r  (cost=0.00..456015.26 rows=1 width=74)                                                             Filter: ((typecode)::text = 'SUBJ'::text)                                                       ->  Seq Scan on documentinformation q  (cost=0.00..497440.84 rows=120119 width=99)                                                             Filter: (((classcode)::text = 'CNTRCT'::text) AND ((moodcode)::text = 'EVN'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text))                                                 ->  Seq Scan on documentinformationsubject documentinformationsubject_2  (cost=0.00..1329868.64 rows=1 width=177)                                                       Filter: ((participationtypecode)::text = 'AUT'::text)                                           ->  Seq Scan on entity_id agencyid  (cost=0.00..1574.82 rows=34182 width=89)                                     ->  Seq Scan on bestname agencyname  (cost=0.00..27066.08 rows=399908 width=64)                               ->  Index Scan using act_id_fkidx on act_id an  (cost=0.56..50.85 rows=13 width=134)                                     Index Cond: ((q.actinternalid)::text = (actinternalid)::text)

I have monitored the activity with vmstat and iostat, and it looks like the memory grabbing happens rapidly after a Sort Merge step. I see in the iostat a heavy read and write activity, which I attribute to a sort-merge step, then that is followed by a sudden spike in write activity, and then the out of memory crash.

procs -----------------------memory---------------------- ---swap-- -----io---- -system-- --------cpu-------- -----timestamp-----r  b         swpd         free         buff        cache   si   so    bi    bo   in   cs  us  sy  id  wa  st                 UTC0  2            0       119344            0      7616672    0    0 11681  3107    9    0   6   1  72  21   0 2019-04-14 16:19:520  2            0       128884            0      7607288    0    0  2712 55386  500  509   3   2  15  80   0 2019-04-14 16:19:540  2            0       116984            0      7619916    0    0   880 59241  548  525   2   2   9  87   0 2019-04-14 16:19:560  2            0       131492            0      7604816    0    0   128 56512  518  401   1   1  12  86   0 2019-04-14 16:19:58...0  2            0       134508            0      7601480    0    0     0 58562  428  353   0   1   4  95   0 2019-04-14 16:21:460  2            0       125360            0      7611320    0    0     0 59392  481  369   0   1  11  89   0 2019-04-14 16:21:480  2            0       122896            0      7612872    0    0     0 58564  431  342   0   1  17  82   0 2019-04-14 16:21:501  1            0       121456            0      7614248    0    0    54 57347  487  399   0   1  13  85   0 2019-04-14 16:21:520  2            0       122820            0      7613324    0    0    12 59964  460  346   0   1  20  79   0 2019-04-14 16:21:540  2            0       120344            0      7616528    0    0  1844 55691  645  676   5   3   6  85   0 2019-04-14 16:21:560  2            0       124900            0      7611404    0    0   936 58261  795 1215   2   3  13  83   0 2019-04-14 16:21:580  2            0       124572            0      7612192    0    0  1096 55340  518  487   1   2   0  97   0 2019-04-14 16:22:000  2            0       123040            0      7612740    0    0   888 57574  573  620   1   2   5  92   0 2019-04-14 16:22:020  2            0       125112            0      7610592    0    0   124 59164  498  480   1   1  13  85   0 2019-04-14 16:22:041  1            0       129440            0      7607592    0    0   568 60196  563  612   2   2   8  88   0 2019-04-14 16:22:060  2            0       124020            0      7612364    0    0     0 58260  629  725   0   1   8  91   0 2019-04-14 16:22:082  1            0       124480            0      7611848    0    0     0 58852  447  331   0   1   1  98   0 2019-04-14 16:22:100  3            0       137636            0      7598484    0    0 11908 44995  619  714   1   1  11  87   0 2019-04-14 16:22:120  2            0       123128            0      7613392    0    0 29888 28901  532  972   1   1  29  68   0 2019-04-14 16:22:140  2            0       126260            0      7609984    0    0 39872 18836  706 1435   1   2  28  70   0 2019-04-14 16:22:160  2            0       130748            0      7605536    0    0 36096 22488  658 1272   2   1   8  89   0 2019-04-14 16:22:18
...
 0  2            0       127216            0      7609192    0    0 29192 29696  472  949   1   1  23  75   0 2019-04-14 16:22:400  2            0       147428            0      7588556    0    0 29120 29696  523  974   1   1  19  79   0 2019-04-14 16:22:420  1            0       120644            0      7615388    0    0 32320 25276  566  998   1   2  49  47   0 2019-04-14 16:22:440  1            0       128456            0      7607904    0    0 58624     0  621 1103   3   2  49  46   0 2019-04-14 16:22:460  1            0       127836            0      7608260    0    0 58624     0  631 1119   3   2  50  46   0 2019-04-14 16:22:480  1            0       126712            0      7609616    0    0 58624     0  616 1110   2   2  50  47   0 2019-04-14 16:22:50
...0  1            0       157408            0      7578060    0    0 58628     0  736 1206   3   3  50  44   0 2019-04-14 16:27:220  1            0       142420            0      7593400    0    0 58688     0  623 1099   1   4  50  45   0 2019-04-14 16:27:240  1            0       247016            0      7488184    0    0 58568     0  649 1113   1   4  50  45   0 2019-04-14 16:27:260  1            0       123232            0      7612088    0    0 58412   215  675 1141   2   3  50  46   0 2019-04-14 16:27:280  2            0       144920            0      7586576    0    0 48376 11046  788 1455   1   5  34  60   0 2019-04-14 16:27:301  1            0       125636            0      7595704    0    0 36736 21381  702 1386   1   4  21  74   0 2019-04-14 16:27:320  3            0       156700            0      7559328    0    0 35556 23364  709 1367   1   3  22  74   0 2019-04-14 16:27:340  2            0       315580            0      7382748    0    0 33608 24731  787 1407   1   5  18  76   0 2019-04-14 16:27:36
...0  2            0       684412            0      6152040    0    0 29832 28356  528  994   1   2  32  66   0 2019-04-14 16:38:040  2            0       563512            0      6272264    0    0 29696 29506  546  987   1   2  32  65   0 2019-04-14 16:38:060  2            0       595488            0      6241068    0    0 27292 30858  549  971   1   2  26  71   0 2019-04-14 16:38:080  2            0       550120            0      6285352    0    0 28844 29696  567  995   1   2  29  68   0 2019-04-14 16:38:101  1            0       432380            0      6402964    0    0 28992 29696  557  979   1   2  37  61   0 2019-04-14 16:38:120  2            0       445796            0      6384412    0    0 26768 32134  628 1029   1   4  27  69   0 2019-04-14 16:38:140  2            0       374972            0      6453592    0    0 28172 30839  529  962   1   2  43  54   0 2019-04-14 16:38:160  2            0       317824            0      6507992    0    0 29172 29386  560 1001   1   3  27  68   0 2019-04-14 16:38:180  3            0       215092            0      6609132    0    0 33116 25210  621 1148   1   3  19  77   0 2019-04-14 16:38:200  2            0       194836            0      6621524    0    0 27786 30959  704 1152   0   5  18  77   0 2019-04-14 16:38:220  3            0       315648            0      6500196    0    0 31434 27226  581 1073   0   3  31  65   0 2019-04-14 16:38:24
 0  2            0       256180            0      6554676    0    0 29828 29017  668 1174   0   4  20  76   0 2019-04-14 16:38:26 <<< CRASH0  1            0       378220            0      6552496    0    0  4348 53686 2210 3816   1   5  46  49   0 2019-04-14 16:38:280  1            0       389888            0      6536296    0    0  2704 56529 2454 4178   0   5  42  52   0 2019-04-14 16:38:300  2            0       923572            0      5998992    0    0  1612 56863 2384 3928   0   6  16  78   0 2019-04-14 16:38:320  0            0       908336            0      6006696    0    0  3584 49280 8961 17334   0  19  39  42   0 2019-04-14 16:38:340  1            0      1306480            0      5607088    0    0   264 63632 18605 37933   3  58  35   4   0 2019-04-14 16:38:362  1            0      1355448            0      5558576    0    0     8 59222 14817 30296   2  46  24  27   0 2019-04-14 16:38:382  2            0      1358224            0      5555884    0    0     0 58544 14226 28331   2  44   3  50   0 2019-04-14 16:38:402  1            0      1446348            0      5468748    0    0     0 58846 14376 29185   2  44  11  42   0 2019-04-14 16:38:420  0            0      2639648            0      4357608    0    0     0 28486 12909 26770   2  44  49   5   0 2019-04-14 16:38:440  0            0      2639524            0      4357800    0    0     0     0  158  154   0   0 100   0   0 2019-04-14 16:38:460  0            0      2687316            0      4309976    0    0     0     0  181  188   0   2  98   0   0 2019-04-14 16:38:480  0            0      2706920            0      4300116    0    0     0   105  137  263   0   0 100   0   0 2019-04-14 16:38:500  0            0      2706672            0      4300232    0    0     0     0  142  204   0   0 100   0   0 2019-04-14 16:38:520  0            0      2815116            0      4191928    0    0     0     0  116  242   0   0 100   0   0 2019-04-14 16:38:540  0            0      2815364            0      4192008    0    0     0     0  116  239   0   0 100   0   0 2019-04-14 16:38:560  0            0      2815116            0      4192164    0    0     0     0  159  236   0   0 100   0   0 2019-04-14 16:38:58

ending after the out of memory crash, that occurred exactly at the marked point 16:38:26.355 UTC.

We can't really see anything too worrisome. There is always lots of memory used by cache, which could have been mobilized. The only possible explanation I can think of is that in that moment of the crash the memory utilization suddenly skyrocketed in less than a second, so that the 2 second vmstat interval wouldn't show it??? Nah.

I have already much reduced work_mem, which has helped in some other cases before. Now I am going to reduce the shared_buffers now, but that seems counter-intuitive because we are sitting on all that cache memory unused!

Might this be a bug? It feels like a bug. It feels like those out of memory issues should be handled more gracefully (garbage collection attempt?) and that somehow there should be more information so the person can do anything about it.

Any ideas?

-Gunther

Re: Out of Memory errors are frustrating as heck!

From

Tom Lane

Date:

14 April 2019, 21:19:11

Gunther <raj@gusw.net> writes:
> For weeks now, I am banging my head at an "out of memory" situation. 
> There is only one query I am running on an 8 GB system, whatever I try, 
> I get knocked out on this out of memory. It is extremely impenetrable to 
> understand and fix this error. I guess I could add a swap file, and then 
> I would have to take the penalty of swapping. But how can I actually 
> address an out of memory condition if the system doesn't tell me where 
> it is happening?

> You might want to see the query, but it is a huge plan, and I can't 
> really break this down. It shouldn't matter though. But just so you can 
> get a glimpse here is the plan:

Is that the whole plan?  With just three sorts and two materializes,
it really shouldn't use more than more-or-less 5X work_mem.  What do
you have work_mem set to, anyway?  Is this a 64-bit build of PG?

Also, are the estimated rowcounts shown here anywhere close to what
you expect in reality?  If there are any AFTER INSERT triggers on the
insertion target table, you'd also be accumulating per-row trigger
queue entries ... but if there's only circa 35K rows to be inserted,
it's hard to credit that eating more than a couple hundred KB, either.

> Might this be a bug?

It's conceivable that you've hit some memory-leakage bug, but if so you
haven't provided nearly enough info for anyone else to reproduce it.
You haven't even shown us the actual error message :-(

https://wiki.postgresql.org/wiki/Guide_to_reporting_problems

            regards, tom lane

Re: Out of Memory errors are frustrating as heck!

From

Jeff Janes

Date:

14 April 2019, 21:19:50

On Sun, Apr 14, 2019 at 4:51 PM Gunther <raj@gusw.net> wrote:

For weeks now, I am banging my head at an "out of memory" situation. There is only one query I am running on an 8 GB system, whatever I try, I get knocked out on this out of memory.

Is PostgreSQL throwing an error with OOM, or is getting killed -9 by the OOM killer? Do you get a core file you can inspect with gdb?

You might want to see the query, but it is a huge plan, and I can't really break this down. It shouldn't matter though. But just so you can get a glimpse here is the plan:
Insert on businessoperation  (cost=5358849.28..5361878.44 rows=34619 width=1197) ->  Unique  (cost=5358849.28..5361532.25 rows=34619 width=1197)       

Maybe it is memory for trigger or constraint checking, although I don't know why that would appear instantly. What triggers or constraints do you have on businessoperation?

What if you just run the SELECT without the INSERT? Or insert into a temp table rather than into businessoperation? And if that doesn't crash, what if you then insert to businessoperation from the temp table?

Also, what version?

Cheers,

Jeff

Re: Out of Memory errors are frustrating as heck!

From

Gunther

Date:

15 April 2019, 01:05:48

Thanks for looking at my problem Tom Lane and Jeff Janes. Sorry for not having given enough detail.

The version is 10.2 latest. The database was originally built with 10.1 and then just started with 10.2. No dump and reload or pg_upgrade. Underlying system is 64bit Amazon Linux (CentOS like) running on an AMD64 VM (m5a) right now.

I said "crash" and that is wrong. Not a signal nor core dump. It is the ERROR: out of memory. Only the query crashes. Although I don't know if may be the backend server might have left a core dump? Where would that be? Would it help anyone if I started the server with the -c option to get a core dump? I guess I could re-compile with gcc -g debugging symbols all on and then run with that -c option, and then use gdb to find out which line it was failing at and then inspect the query plan data structure? Would that be helpful? Does anyone want the coredump to inspect?

The short version is:

Grand total: 1437014672 bytes in 168424 blocks; 11879744 free (3423 chunks); 1425134928 used
2019-04-14 16:38:26.355 UTC [11061] ERROR:  out of memory
2019-04-14 16:38:26.355 UTC [11061] DETAIL:  Failed on request of size 8272 in memory context "ExecutorState".

Here is the out of memory error dump in its full glory.

TopMemoryContext: 2197400 total in 7 blocks; 42952 free (15 chunks); 2154448 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 blocks; 7720 free (2 chunks); 472 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 2097152 total in 9 blocks; 396480 free (10 chunks); 1700672 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used   PortalContext: 1024 total in 1 blocks; 624 free (0 chunks); 400 used:     ExecutorState: 1416621920 total in 168098 blocks; 8494152 free (3102 chunks); 1408127768 used       HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used         HashBatchContext: 57432 total in 3 blocks; 16072 free (6 chunks); 41360 used       HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 40120 used       TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 40120 used       HashTableContext: 8454256 total in 6 blocks; 64848 free (32 chunks); 8389408 used         HashBatchContext: 106640 total in 3 blocks; 7936 free (0 chunks); 98704 used       TupleSort main: 452880 total in 8 blocks; 126248 free (27 chunks); 326632 used         Caller tuples: 4194304 total in 10 blocks; 1434888 free (20 chunks); 2759416 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used CacheMemoryContext: 1101328 total in 14 blocks; 386840 free (1 chunks); 714488 used   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: businessop_docid_ndx   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: businessop_sbjentityidroot_ndx   index info: 2048 total in 2 blocks; 704 free (1 chunks); 1344 used: businessop_sbjroleiid_ndx   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index   index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx   index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: entity_id_idx   index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: act_id_fkidx   index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: act_id_idx   index info: 2048 total in 2 blocks; 592 free (1 chunks); 1456 used: pg_constraint_conrelid_contypid_conname_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: actrelationship_pkey   index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: actrelationship_target_idx   index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: actrelationship_source_idx   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: documentinformation_pk   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_statistic_ext_relid_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: docinfsubj_ndx_seii   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: docinfsubj_ndx_sbjentcodeonly   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2618_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_index_indrelid_index   relation rules: 229376 total in 31 blocks; 5136 free (0 chunks); 224240 used: v_businessoperation   index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: pg_db_role_setting_databaseid_rol_index   index info: 2048 total in 2 blocks; 624 free (2 chunks); 1424 used: pg_opclass_am_name_nsp_index   index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_foreign_data_wrapper_name_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_enum_oid_index   index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_class_relname_nsp_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_server_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_pubname_index   index info: 2048 total in 2 blocks; 592 free (3 chunks); 1456 used: pg_statistic_relid_att_inh_index   index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_cast_source_target_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_language_name_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_transform_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_collation_oid_index   index info: 3072 total in 2 blocks; 1136 free (2 chunks); 1936 used: pg_amop_fam_strat_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_index_indexrelid_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_template_tmplname_index   index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_ts_config_map_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_opclass_oid_index   index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_foreign_data_wrapper_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_event_trigger_evtname_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_statistic_ext_name_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_dict_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_event_trigger_oid_index   index info: 3072 total in 2 blocks; 1216 free (3 chunks); 1856 used: pg_conversion_default_index   index info: 3072 total in 2 blocks; 1216 free (3 chunks); 1856 used: pg_operator_oprname_l_r_n_index   index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_trigger_tgrelid_tgname_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_enum_typid_label_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_config_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_user_mapping_oid_index   index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_opfamily_am_name_nsp_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_table_relid_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_type_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_aggregate_fnoid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_constraint_oid_index   index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_rewrite_rel_rulename_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_parser_prsname_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_config_cfgname_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_parser_oid_index   index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_publication_rel_prrelid_prpubid_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_operator_oid_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_namespace_nspname_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_template_oid_index   index info: 2048 total in 2 blocks; 624 free (2 chunks); 1424 used: pg_amop_opr_fam_index   index info: 2048 total in 2 blocks; 672 free (3 chunks); 1376 used: pg_default_acl_role_nsp_obj_index   index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_collation_name_enc_nsp_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_rel_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_range_rngtypid_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_dict_dictname_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_type_typname_nsp_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_opfamily_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_statistic_ext_oid_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index   index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_proc_proname_args_nsp_index   index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_partitioned_table_partrelid_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_transform_type_lang_index   index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_attribute_relid_attnum_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_proc_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_language_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_namespace_oid_index   index info: 3072 total in 2 blocks; 1136 free (2 chunks); 1936 used: pg_amproc_fam_proc_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_server_name_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_attribute_relid_attnam_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_conversion_oid_index   index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_user_mapping_user_server_index   index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_subscription_rel_srrelid_srsubid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_sequence_seqrelid_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_conversion_name_nsp_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_authid_oid_index   index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_auth_members_member_role_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_subscription_oid_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_tablespace_oid_index   index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_shseclabel_object_index   index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_replication_origin_roname_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_database_datname_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_subscription_subname_index   index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_replication_origin_roiident_index   index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_auth_members_role_member_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_database_oid_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_authid_rolname_index WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 7256 free (1 chunks); 936 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (4 chunks); 256 used
Grand total: 1437014672 bytes in 168424 blocks; 11879744 free (3423 chunks); 1425134928 used
2019-04-14 16:38:26.355 UTC [11061] ERROR:  out of memory
2019-04-14 16:38:26.355 UTC [11061] DETAIL:  Failed on request of size 8272 in memory context "ExecutorState".

I am delighted that Tom Lane is hinting that my query plan doesn't look so crazy, and it isn't. And delighted there may be a known bug involved.

I wonder if this is one data issue. May be a few rows have excessively long text fields? But even checking for that is rather difficult because there are many tables and columns involved. Would really be nice if the error would say exactly what plan step that ExecutorState referred to, so one could narrow it down.

regards,
-Gunther

On 4/14/2019 17:19, Tom Lane wrote:

Gunther <raj@gusw.net> writes:

For weeks now, I am banging my head at an "out of memory" situation. 
There is only one query I am running on an 8 GB system, whatever I try, 
I get knocked out on this out of memory. It is extremely impenetrable to 
understand and fix this error. I guess I could add a swap file, and then 
I would have to take the penalty of swapping. But how can I actually 
address an out of memory condition if the system doesn't tell me where 
it is happening?

You might want to see the query, but it is a huge plan, and I can't 
really break this down. It shouldn't matter though. But just so you can 
get a glimpse here is the plan:

Is that the whole plan?  With just three sorts and two materializes,
it really shouldn't use more than more-or-less 5X work_mem.  What do
you have work_mem set to, anyway?  Is this a 64-bit build of PG?

Also, are the estimated rowcounts shown here anywhere close to what
you expect in reality?  If there are any AFTER INSERT triggers on the
insertion target table, you'd also be accumulating per-row trigger
queue entries ... but if there's only circa 35K rows to be inserted,
it's hard to credit that eating more than a couple hundred KB, either.

Might this be a bug?

It's conceivable that you've hit some memory-leakage bug, but if so you
haven't provided nearly enough info for anyone else to reproduce it.
You haven't even shown us the actual error message :-(

https://wiki.postgresql.org/wiki/Guide_to_reporting_problems
		regards, tom lane

Re: Out of Memory errors are frustrating as heck!

From

Peter

Date:

15 April 2019, 01:29:55

On Sun, Apr 14, 2019 at 05:19:11PM -0400, Tom Lane wrote:
! Gunther <raj@gusw.net> writes:
! > For weeks now, I am banging my head at an "out of memory" situation. 
! > There is only one query I am running on an 8 GB system, whatever I try, 
! > I get knocked out on this out of memory. It is extremely impenetrable to 
! > understand and fix this error. I guess I could add a swap file, and then 
! > I would have to take the penalty of swapping. But how can I actually 
! > address an out of memory condition if the system doesn't tell me where 
! > it is happening?

Well, esactly with a swap space. No offense intended, but if You
don't have a swap space, You should not complain about unintellegibe
Out-of-memory situations.
Swapspace is not usually used to run applications from (that would
indeed give horrible performance), it is used to not get out-of-memory
errors. With a swapspace, the out-of-memory situation will persist,
and so one has time to take measurements and analyze system
behaviour and from that, one can better understand what is causing 
the problem, and decide what actions should be taken, on an informed
base (e.g. correct flaws in the system tuning, fix bad query, buy 
more memory, or what may be applicable)

If I remember correctly, I did even see a DTRACE flag in my build,
so what more is to wish? :))

P.

Re: Out of Memory errors are frustrating as heck!

From

Justin Pryzby

Date:

15 April 2019, 01:48:11

On Sun, Apr 14, 2019 at 09:05:48PM -0400, Gunther wrote:
> Thanks for looking at my problem Tom Lane and Jeff Janes. Sorry for not
> having given enough detail.
> 
> The version is 10.2 latest.

v10.7 is available; could you upgrade ?

What are these set to ?  shared_buffers? work_mem?

Was postgres locally compiled, packaged by distribution, or PGDG RPM/DEB ?

Can you show \d businessoperation ?

> The short version is:
> 
> Grand total: 1437014672 bytes in 168424 blocks; 11879744 free (3423 chunks); 1425134928 used
> 2019-04-14 16:38:26.355 UTC [11061] ERROR:  out of memory
> 2019-04-14 16:38:26.355 UTC [11061] DETAIL:  Failed on request of size 8272 in memory context "ExecutorState".

Could you rerun the query with \set VERBOSITY verbose to show the file/line
that's failing ?

If you wanted to show a stack trace, you could attach gdb to PID from SELECT
pg_backend_pid(), "b"reak on errdetail, run the query, and then "bt" when it
fails.

Justin

Re: Out of Memory errors are frustrating as heck!

From

Jeff Janes

Date:

15 April 2019, 02:48:49

On Sun, Apr 14, 2019 at 9:06 PM Gunther <raj@gusw.net> wrote:

Thanks for looking at my problem Tom Lane and Jeff Janes. Sorry for not having given enough detail.

The version is 10.2 latest. The database was originally built with 10.1 and then just started with 10.2.

Do you mean 11.2? The latest in the 10 series is 10.7. If you do mean 10.2, there a fix for a memory leak bug since then that might plausibly be relevant (bdc7f686d1b8f423cb)

I said "crash" and that is wrong. Not a signal nor core dump. It is the ERROR: out of memory. Only the query crashes. Although I don't know if may be the backend server might have left a core dump?

I don't think there would be a core dump on only an ERROR, and probably not worthwhile to trick it into generating one.

The short version is:

Grand total: 1437014672 bytes in 168424 blocks; 11879744 free (3423 chunks); 1425134928 used
2019-04-14 16:38:26.355 UTC [11061] ERROR:  out of memory
2019-04-14 16:38:26.355 UTC [11061] DETAIL:  Failed on request of size 8272 in memory context "ExecutorState".

I don't know why a 8GB system with a lot of cache that could be evicted would get an OOM when something using 1.5GB asks for 8272 bytes more. But that is a question of how the kernel works, rather than how PostgreSQL works. But I also think the log you quote above belongs to a different event than the vmstat trace in your first email.

      ExecutorState: 1416621920 total in 168098 blocks; 8494152 free (3102 chunks); 1408127768 used       HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used         HashBatchContext: 57432 total in 3 blocks; 16072 free (6 chunks); 41360 used

This does not seem to match your query plan. Why would a plan with no Hash Joins have a HashBatchContext? I think this log must be from a different query than the one that generated the plan you posted. Perhaps one was before you lowered work_mem and one was after?

Cheers,

Jeff

Re: Out of Memory errors are frustrating as heck!

From

Gunther

Date:

15 April 2019, 03:04:18

Thanks Justin Pryzby too, and Jeff Janes, responding to both of you for efficiency. Answers and more logs and the gdb backtrace below.

The version is 10.2 latest.

v10.7 is available; could you upgrade ?

Sorry I meant 11.2 actually latest.

What are these set to ?  shared_buffers? work_mem?

shared_buffers=2G (of 8 total), then 1G, didn't help.

work_mem=4M by now (I had once been successful of avoiding out of memory by reducing work mem from 64M to 8M. But as Tom Lane says, it shouldn't be using more than 5 x work_mem in this query plan.

Jeff Janes said:

I don't know why a 8GB system with a lot of cache that could be evicted would get an OOM when something using 1.5GB asks for 8272 bytes more. But that is a question of how the kernel works, rather than how PostgreSQL works. But I also think the log you quote above belongs to a different event than the vmstat trace in your first email.

and I agree, except that the vmstat log and the error really belong together, same timestamp. Nothing else running on that machine this Sunday. Yes I ran this several times with different parameters, so some mixup is possible, but always ending in the same crash anyway. So here again, without the vmstat log, which really wouldn't be any different than I showed you. (See below for the ENABLE_NESTLOOP=off setting, not having those settings same between explain and actual execution might account for the discrepancy that you saw.)

integrator=# SET ENABLE_NESTLOOP TO OFF;
SET
integrator=# \set VERBOSITY verbose
integrator=# explain INSERT INTO reports.BusinessOperation SELECT * FROM reports.v_BusinessOperation;
integrator=# \pset pager off
Pager usage is off.
integrator=# \pset format unaligned
Output format is unaligned.
integrator=# explain INSERT INTO reports.BusinessOperation SELECT * FROM reports.v_BusinessOperation;
QUERY PLAN
Insert on businessoperation  (cost=5850091.58..5853120.74 rows=34619 width=1197) ->  Unique  (cost=5850091.58..5852774.55 rows=34619 width=1197)       ->  Sort  (cost=5850091.58..5850178.13 rows=34619 width=1197)             Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.is_current, documentinformationsubject.documentid, documentinformationsubject.documenttypecode, documentinformationsubject.subjectroleinternalid, documentinformationsubject.subjectentityinternalid, documentinformationsubject.subjectentityid, documentinformationsubject.subjectentityidroot, documentinformationsubject.subjectentityname, documentinformationsubject.subjectentitytel, documentinformationsubject.subjectentityemail, documentinformationsubject.otherentityinternalid, documentinformationsubject.confidentialitycode, documentinformationsubject.actinternalid, documentinformationsubject.code_code, documentinformationsubject.code_displayname, q.code_code, q.code_displayname, an.extension, an.root, documentinformationsubject_2.subjectentitycode, documentinformationsubject_2.subjectentitycodesystem, documentinformationsubject_2.effectivetime_low, documentinformationsubject_2.effectivetime_high, documentinformationsubject_2.statuscode, documentinformationsubject_2.code_code, agencyid.extension, agencyname.trivialname, documentinformationsubject_1.subjectentitycode, documentinformationsubject_1.subjectentityinternalid             ->  Hash Right Join  (cost=4489522.06..5829375.93 rows=34619 width=1197)                   Hash Cond: (((q.documentinternalid)::text = (documentinformationsubject.documentinternalid)::text) AND ((r.targetinternalid)::text = (documentinformationsubject.actinternalid)::text))                   ->  Hash Right Join  (cost=1473632.24..2808301.92 rows=13 width=341)                         Hash Cond: (((documentinformationsubject_2.documentinternalid)::text = (q.documentinternalid)::text) AND ((documentinformationsubject_2.actinternalid)::text = (q.actinternalid)::text))                         ->  Hash Left Join  (cost=38864.03..1373533.69 rows=1 width=219)                               Hash Cond: ((documentinformationsubject_2.otherentityinternalid)::text = (agencyname.entityinternalid)::text)                               ->  Hash Left Join  (cost=2503.10..1332874.75 rows=1 width=229)                                     Hash Cond: ((documentinformationsubject_2.otherentityinternalid)::text = (agencyid.entityinternalid)::text)                                     ->  Seq Scan on documentinformationsubject documentinformationsubject_2  (cost=0.00..1329868.64 rows=1 width=177)                                           Filter: ((participationtypecode)::text = 'AUT'::text)                                     ->  Hash  (cost=1574.82..1574.82 rows=34182 width=89)                                           ->  Seq Scan on entity_id agencyid  (cost=0.00..1574.82 rows=34182 width=89)                               ->  Hash  (cost=27066.08..27066.08 rows=399908 width=64)                                     ->  Seq Scan on bestname agencyname  (cost=0.00..27066.08 rows=399908 width=64)                         ->  Hash  (cost=1434768.02..1434768.02 rows=13 width=233)                               ->  Hash Right Join  (cost=953906.58..1434768.02 rows=13 width=233)                                     Hash Cond: ((an.actinternalid)::text = (q.actinternalid)::text)                                     ->  Seq Scan on act_id an  (cost=0.00..425941.04 rows=14645404 width=134)                                     ->  Hash  (cost=953906.57..953906.57 rows=1 width=136)                                           ->  Hash Join  (cost=456015.28..953906.57 rows=1 width=136)                                                 Hash Cond: ((q.actinternalid)::text = (r.sourceinternalid)::text)                                                 ->  Seq Scan on documentinformation q  (cost=0.00..497440.84 rows=120119 width=99)                                                       Filter: (((classcode)::text = 'CNTRCT'::text) AND ((moodcode)::text = 'EVN'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text))                                                 ->  Hash  (cost=456015.26..456015.26 rows=1 width=74)                                                       ->  Seq Scan on actrelationship r  (cost=0.00..456015.26 rows=1 width=74)                                                             Filter: ((typecode)::text = 'SUBJ'::text)                   ->  Hash  (cost=3011313.54..3011313.54 rows=34619 width=930)                         ->  Merge Left Join  (cost=2998334.98..3011313.54 rows=34619 width=930)                               Merge Cond: (((documentinformationsubject.documentinternalid)::text = (documentinformationsubject_1.documentinternalid)::text) AND ((documentinformationsubject.documentid)::text = (documentinformationsubject_1.documentid)::text) AND ((documentinformationsubject.actinternalid)::text = (documentinformationsubject_1.actinternalid)::text))                               ->  Sort  (cost=1408783.87..1408870.41 rows=34619 width=882)                                     Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.documentid, documentinformationsubject.actinternalid                                     ->  Seq Scan on documentinformationsubject  (cost=0.00..1392681.22 rows=34619 width=882)                                           Filter: (((participationtypecode)::text = ANY ('{PPRF,PRF}'::text[])) AND ((classcode)::text = 'ACT'::text) AND ((moodcode)::text = 'DEF'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text))                               ->  Materialize  (cost=1589551.12..1594604.04 rows=1010585 width=159)                                     ->  Sort  (cost=1589551.12..1592077.58 rows=1010585 width=159)                                           Sort Key: documentinformationsubject_1.documentinternalid, documentinformationsubject_1.documentid, documentinformationsubject_1.actinternalid                                           ->  Seq Scan on documentinformationsubject documentinformationsubject_1  (cost=0.00..1329868.64 rows=1010585 width=159)                                                 Filter: ((participationtypecode)::text = 'PRD'::text)

and the error memory status dump (I hope my grey boxes help a bit to lighten this massive amount of data...

TopMemoryContext: 4294552 total in 7 blocks; 42952 free (15 chunks); 4251600 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 blocks; 7720 free (2 chunks); 472 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 2097152 total in 9 blocks; 396480 free (10 chunks); 1700672 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used   PortalContext: 1024 total in 1 blocks; 624 free (0 chunks); 400 used:     ExecutorState: 2234123384 total in 266261 blocks; 3782328 free (17244 chunks); 2230341056 used       HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used         HashBatchContext: 57432 total in 3 blocks; 16072 free (6 chunks); 41360 used       HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 40120 used       TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 40120 used       HashTableContext: 8454256 total in 6 blocks; 64848 free (32 chunks); 8389408 used         HashBatchContext: 100711712 total in 3065 blocks; 7936 free (0 chunks); 100703776 used       TupleSort main: 452880 total in 8 blocks; 126248 free (27 chunks); 326632 used         Caller tuples: 1048576 total in 8 blocks; 21608 free (14 chunks); 1026968 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used CacheMemoryContext: 1101328 total in 14 blocks; 386840 free (1 chunks); 714488 used   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: businessop_docid_ndx   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: businessop_sbjentityidroot_ndx   index info: 2048 total in 2 blocks; 704 free (1 chunks); 1344 used: businessop_sbjroleiid_ndx   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index   index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx   index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: entity_id_idx   index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: act_id_fkidx   index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: act_id_idx   index info: 2048 total in 2 blocks; 592 free (1 chunks); 1456 used: pg_constraint_conrelid_contypid_conname_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: actrelationship_pkey   index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: actrelationship_target_idx   index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: actrelationship_source_idx   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: documentinformation_pk   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_statistic_ext_relid_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: docinfsubj_ndx_seii   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: docinfsubj_ndx_sbjentcodeonly   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2618_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_index_indrelid_index   relation rules: 229376 total in 31 blocks; 5136 free (0 chunks); 224240 used: v_businessoperation   index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: pg_db_role_setting_databaseid_rol_index   index info: 2048 total in 2 blocks; 624 free (2 chunks); 1424 used: pg_opclass_am_name_nsp_index   index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_foreign_data_wrapper_name_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_enum_oid_index   index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_class_relname_nsp_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_server_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_pubname_index   index info: 2048 total in 2 blocks; 592 free (3 chunks); 1456 used: pg_statistic_relid_att_inh_index   index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_cast_source_target_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_language_name_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_transform_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_collation_oid_index   index info: 3072 total in 2 blocks; 1136 free (2 chunks); 1936 used: pg_amop_fam_strat_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_index_indexrelid_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_template_tmplname_index   index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_ts_config_map_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_opclass_oid_index   index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_foreign_data_wrapper_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_event_trigger_evtname_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_statistic_ext_name_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_dict_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_event_trigger_oid_index   index info: 3072 total in 2 blocks; 1216 free (3 chunks); 1856 used: pg_conversion_default_index   index info: 3072 total in 2 blocks; 1216 free (3 chunks); 1856 used: pg_operator_oprname_l_r_n_index   index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_trigger_tgrelid_tgname_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_enum_typid_label_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_config_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_user_mapping_oid_index   index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_opfamily_am_name_nsp_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_table_relid_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_type_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_aggregate_fnoid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_constraint_oid_index   index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_rewrite_rel_rulename_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_parser_prsname_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_config_cfgname_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_parser_oid_index   index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_publication_rel_prrelid_prpubid_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_operator_oid_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_namespace_nspname_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_template_oid_index   index info: 2048 total in 2 blocks; 624 free (2 chunks); 1424 used: pg_amop_opr_fam_index   index info: 2048 total in 2 blocks; 672 free (3 chunks); 1376 used: pg_default_acl_role_nsp_obj_index   index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_collation_name_enc_nsp_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_rel_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_range_rngtypid_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_dict_dictname_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_type_typname_nsp_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_opfamily_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_statistic_ext_oid_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index   index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_proc_proname_args_nsp_index   index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_partitioned_table_partrelid_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_transform_type_lang_index   index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_attribute_relid_attnum_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_proc_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_language_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_namespace_oid_index   index info: 3072 total in 2 blocks; 1136 free (2 chunks); 1936 used: pg_amproc_fam_proc_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_server_name_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_attribute_relid_attnam_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_conversion_oid_index   index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_user_mapping_user_server_index   index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_subscription_rel_srrelid_srsubid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_sequence_seqrelid_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_conversion_name_nsp_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_authid_oid_index   index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_auth_members_member_role_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_subscription_oid_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_tablespace_oid_index   index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_shseclabel_object_index   index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_replication_origin_roname_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_database_datname_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_subscription_subname_index   index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_replication_origin_roiident_index   index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_auth_members_role_member_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_database_oid_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_authid_rolname_index WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 7256 free (1 chunks); 936 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (4 chunks); 256 used
Grand total: 2354072632 bytes in 269647 blocks; 5754640 free (17559 chunks); 2348317992 used

Was postgres locally compiled, packaged by distribution, or PGDG RPM/DEB ?

Locally compiled. I just recompiled with --enable-debug, ready to deploy that to create a core dump and check it out.

Can you show \d businessoperation ?

               Table "reports.businessoperation"         Column           |          Type          | Modifiers
---------------------------+------------------------+-----------documentinternalid        | character varying(255) |is_current                | character(1)           |documentid                | character varying(555) |documenttypecode          | character varying(512) |subjectroleinternalid     | character varying(255) |subjectentityinternalid   | character varying(255) |subjectentityid           | character varying(555) |subjectentityidroot       | character varying(555) |subjectentityname         | character varying      |subjectentitytel          | text                   |subjectentityemail        | text                   |otherentityinternalid     | character varying(255) |confidentialitycode       | character varying(512) |actinternalid             | character varying(255) |operationcode             | character varying(512) |operationname             | text                   |operationqualifiercode    | character varying(512) |operationqualifiername    | character varying(512) |approvalnumber            | character varying(555) |approvalnumbersystem      | character varying(555) |approvalstatecode         | character varying(512) |approvalstatecodesystem   | character varying(512) |approvaleffectivetimelow  | character varying(512) |approvaleffectivetimehigh | character varying(512) |approvalstatuscode        | character varying(32)  |licensecode               | character varying(512) |agencyid                  | character varying(555) |agencyname                | text                   |productitemcode           | character varying(512) |productinternalid         | character varying(255) |

Could you rerun the query with \set VERBOSITY verbose to show the file/line
that's failing ?

Here goes:

integrator=# \set VERBOSITY verbose
integrator=# SET ENABLE_NESTLOOP TO OFF;
SET
integrator=# INSERT INTO reports.BusinessOperation SELECT * FROM reports.v_BusinessOperation;
ERROR:  53200: out of memory
DETAIL:  Failed on request of size 32800 in memory context "HashBatchContext".
LOCATION:  MemoryContextAlloc, mcxt.c:798

you notice that I set ENABLE_NESTLOOP to off, that is because the planner goes off thinking the NL plan is marginally more efficient, but in fact it will take 5 hours to get to the same out of memory crash, while the no NL plan gets there in half an hour. That verbose setting didn't help much I guess.

If you wanted to show a stack trace, you could attach gdb to PID from SELECT
pg_backend_pid(), "b"reak on errdetail, run the query, and then "bt" when it
fails.

gdb -p 27930
GNU gdb (GDB) Red Hat Enterprise Linux 8.0.1-30.amzn2.0.3
...
Attaching to process 27930
Reading symbols from /usr/local/pgsql/bin/postgres...done.
...
(gdb) b errdetail
Breakpoint 1 at 0x82b210: file elog.c, line 872.
(gdb) cont
Continuing.
Breakpoint 1, errdetail (fmt=fmt@entry=0x9d9958 "Failed on request of size %zu in memory context \"%s\".") at elog.c:872
872     {
(gdb) bt
#0  errdetail (fmt=fmt@entry=0x9d9958 "Failed on request of size %zu in memory context \"%s\".") at elog.c:872
#1  0x000000000084e320 in MemoryContextAlloc (context=0x1111600, size=size@entry=32800) at mcxt.c:794
#2  0x000000000060ce7a in dense_alloc (size=384, size@entry=381, hashtable=<optimized out>, hashtable=<optimized out>)   at nodeHash.c:2696
#3  0x000000000060d788 in ExecHashTableInsert (hashtable=hashtable@entry=0x10ead08, slot=<optimized out>, hashvalue=194758122)   at nodeHash.c:1614
#4  0x0000000000610c6f in ExecHashJoinNewBatch (hjstate=0x10806b0) at nodeHashjoin.c:1051
#5  ExecHashJoinImpl (parallel=false, pstate=0x10806b0) at nodeHashjoin.c:539
#6  ExecHashJoin (pstate=0x10806b0) at nodeHashjoin.c:565
#7  0x000000000061ce4e in ExecProcNode (node=0x10806b0) at ../../../src/include/executor/executor.h:247
#8  ExecSort (pstate=0x1080490) at nodeSort.c:107
#9  0x000000000061d2c4 in ExecProcNode (node=0x1080490) at ../../../src/include/executor/executor.h:247
#10 ExecUnique (pstate=0x107ff60) at nodeUnique.c:73
#11 0x0000000000619732 in ExecProcNode (node=0x107ff60) at ../../../src/include/executor/executor.h:247
#12 ExecModifyTable (pstate=0x107fd20) at nodeModifyTable.c:2025
#13 0x00000000005f75ba in ExecProcNode (node=0x107fd20) at ../../../src/include/executor/executor.h:247
#14 ExecutePlan (execute_once=<optimized out>, dest=0x7f0442721998, direction=<optimized out>, numberTuples=0,   sendTuples=<optimized out>, operation=CMD_INSERT, use_parallel_mode=<optimized out>, planstate=0x107fd20, estate=0x107f830)   at execMain.c:1723
#15 standard_ExecutorRun (queryDesc=0x1086880, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364
#16 0x000000000072a972 in ProcessQuery (plan=<optimized out>,   sourceText=0xf4a710 "INSERT INTO reports.BusinessOperation SELECT * FROM reports.v_BusinessOperation;", params=0x0,   queryEnv=0x0, dest=0x7f0442721998, completionTag=0x7fff2e4cad30 "") at pquery.c:161
#17 0x000000000072abb0 in PortalRunMulti (portal=portal@entry=0xfb06b0, isTopLevel=isTopLevel@entry=true,   setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x7f0442721998, altdest=altdest@entry=0x7f0442721998,   completionTag=completionTag@entry=0x7fff2e4cad30 "") at pquery.c:1286
#18 0x000000000072b661 in PortalRun (portal=portal@entry=0xfb06b0, count=count@entry=9223372036854775807,   isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x7f0442721998,   altdest=altdest@entry=0x7f0442721998, completionTag=0x7fff2e4cad30 "") at pquery.c:799
#19 0x00000000007276e8 in exec_simple_query (   query_string=0xf4a710 "INSERT INTO reports.BusinessOperation SELECT * FROM reports.v_BusinessOperation;") at postgres.c:1145
#20 0x0000000000729534 in PostgresMain (argc=<optimized out>, argv=argv@entry=0xf76ce8, dbname=<optimized out>,   username=<optimized out>) at postgres.c:4182
#21 0x00000000006be215 in BackendRun (port=0xf6dfe0) at postmaster.c:4361
#22 BackendStartup (port=0xf6dfe0) at postmaster.c:4033
#23 ServerLoop () at postmaster.c:1706
#24 0x00000000006bf122 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0xf45320) at postmaster.c:1379
#25 0x00000000004822dc in main (argc=3, argv=0xf45320) at main.c:228

That's it.

Thank you all very much for your interest in this case.

-Gunther

Re: Out of Memory errors are frustrating as heck!

From

Tom Lane

Date:

15 April 2019, 03:24:59

Gunther <raj@gusw.net> writes:
>        ExecutorState: 2234123384 total in 266261 blocks; 3782328 free (17244 chunks); 2230341056 used

Oooh, that looks like a memory leak right enough.  The ExecutorState
should not get that big for any reasonable query.

Your error and stack trace show a failure in HashBatchContext,
which is probably the last of these four:

>            HashBatchContext: 57432 total in 3 blocks; 16072 free (6 chunks); 41360 used
>            HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used
>            HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used
>            HashBatchContext: 100711712 total in 3065 blocks; 7936 free (0 chunks); 100703776 used

Perhaps that's more than it should be, but it's silly to obsess over 100M
when there's a 2.2G problem elsewhere.  I think it's likely that it was
just coincidence that the failure happened right there.  Unfortunately,
that leaves us with no info about where the actual leak is coming from.

The memory map shows that there were three sorts and four hashes going
on, so I'm not sure I believe that this corresponds to the query plan
you showed us before.

Any chance of extracting a self-contained test case that reproduces this?

            regards, tom lane

Re: Out of Memory errors are frustrating as heck!

From

Gunther

Date:

15 April 2019, 03:59:45

On 4/14/2019 23:24, Tom Lane wrote:
>>         ExecutorState: 2234123384 total in 266261 blocks; 3782328 free (17244 chunks); 2230341056 used
> Oooh, that looks like a memory leak right enough.  The ExecutorState
> should not get that big for any reasonable query.
2.2 GB is massive yes.
> Your error and stack trace show a failure in HashBatchContext,
> which is probably the last of these four:
>
>>             HashBatchContext: 57432 total in 3 blocks; 16072 free (6 chunks); 41360 used
>>             HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used
>>             HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used
>>             HashBatchContext: 100711712 total in 3065 blocks; 7936 free (0 chunks); 100703776 used
> Perhaps that's more than it should be, but it's silly to obsess over 100M
> when there's a 2.2G problem elsewhere.
Yes.
>    I think it's likely that it was
> just coincidence that the failure happened right there.  Unfortunately,
> that leaves us with no info about where the actual leak is coming from.

Strange though, that the vmstat tracking never showed that the cache 
allocated memory goes much below 6 GB. Even if this 2.2 GB memory leak 
is there, and even if I had 2 GB of shared_buffers, I would still have 
enough for the OS to give me.

Is there any doubt that this might be a problem with Linux? Because if 
you want, I can whip out a FreeBSD machine, compile pgsql, and attach 
the same disk, and try it there. I am longing to have a reason to move 
back to FreeBSD anyway. But I have tons of stuff to do, so if you do not 
have reason to suspect Linux to do wrong here, I prefer skipping that 
futile attempt

> The memory map shows that there were three sorts and four hashes going
> on, so I'm not sure I believe that this corresponds to the query plan
> you showed us before.
Like I said, the first explain was not using the same constraints (no 
NL). Now what I sent last should all be consistent. Memory dump and 
explain plan and gdb backtrace.
> Any chance of extracting a self-contained test case that reproduces this?

With 18 million rows involved in the base tables, hardly.

But I am ready to try some other things with the debugger that you want 
me to try. If we have a memory leak issue, we might just as well try to  
plug it!

I could even to give someone of you access to the system that runs this.

thanks,
-Gunther

Re: Out of Memory errors are frustrating as heck!

From

Justin Pryzby

Date:

15 April 2019, 04:15:00

On Sun, Apr 14, 2019 at 11:59:45PM -0400, Gunther wrote:
> On 4/14/2019 23:24, Tom Lane wrote:
> >Any chance of extracting a self-contained test case that reproduces this?
> With 18 million rows involved in the base tables, hardly.

Were you able to reproduce the problem with SELECT (without INSERT) ?
How many rows does it output ?  Show explain analyze if possible.  If that
still errors, can you make it work with a small enough LIMIT ?

We haven't seen the view - maybe it's very complicated, but can you reproduce
with a simpler one ?  Fewer joins ?  Or fewer join conditions ?

Justin

Re: Out of Memory errors are frustrating as heck!

From

Jeff Janes

Date:

15 April 2019, 13:32:09

On Sun, Apr 14, 2019 at 11:04 PM Gunther <raj@gusw.net> wrote:

Could you rerun the query with \set VERBOSITY verbose to show the file/line
that's failing ?
Here goes:
integrator=# \set VERBOSITY verbose
integrator=# SET ENABLE_NESTLOOP TO OFF;
SET
integrator=# INSERT INTO reports.BusinessOperation SELECT * FROM reports.v_BusinessOperation;
ERROR:  53200: out of memory
DETAIL:  Failed on request of size 32800 in memory context "HashBatchContext".
LOCATION:  MemoryContextAlloc, mcxt.c:798
you notice that I set ENABLE_NESTLOOP to off, that is because the planner goes off thinking the NL plan is marginally more efficient, but in fact it will take 5 hours to get to the same out of memory crash, while the no NL plan gets there in half an hour. That verbose setting didn't help much I guess.

I think the backtrace of the enable_nestloop=on plan would be more useful. Here someone has filled up memory, and then we see HashBatchContext trip over it that. But it isn't the one the one that caused the problem, so the backtrace doesn't help. With the previous plan, it was an allocation into ExecutorState which tripped over the problem, and it is likely that it is coming from the same series of allocations that caused the problem.

To get it to happen faster, maybe you could run the server with a small setting of "ulimit -v"? Or, you could try to capture it live in gdb. Unfortunately I don't know how to set a breakpoint for allocations into a specific context, and setting a breakpoint for any memory allocation is probably going to fire too often to be useful.

Yes, the verbose option didn't help (but the gdb backtrace made up for it--kind of--we really need the backtrace of the allocations into ExecutorState). It isn't helpful to know that a memory allocation failed in the mcxt.c code. To bad it doesn't report the location of the caller of that code. I know in Perl you can use Carp::croak to do that, but I don't know to do it in C.

But really the first thing I want to know now is what if you just do the select, without the insert?

explain analyze SELECT * FROM reports.v_BusinessOperation

If that works, what about "create temporary table foo as SELECT * FROM reports.v_BusinessOperation" ?

And if that works, what about "INSERT INTO reports.BusinessOperation SELECT * FROM foo"?

If the ERROR happens in the first or last of these, it might be much easier to analyze in that simplified context. If it happens in the middle one, then we probably haven't achieved much. (And if there is no ERROR at all, then you have workaround, but we still haven't found the fundamental bug).

Are you not showing the view definition for proprietary reasons, or just because you don't think it will be useful? If the latter, please send it as an attachment, I can't guarantee it will be useful, but there is only one way find out.

Cheers,

Jeff

Re: Out of Memory errors are frustrating as heck!

From

Jeff Janes

Date:

15 April 2019, 13:44:40

On Sun, Apr 14, 2019 at 11:59 PM Gunther <raj@gusw.net> wrote:

Is there any doubt that this might be a problem with Linux? Because if
you want, I can whip out a FreeBSD machine, compile pgsql, and attach
the same disk, and try it there. I am longing to have a reason to move
back to FreeBSD anyway. But I have tons of stuff to do, so if you do not
have reason to suspect Linux to do wrong here, I prefer skipping that
futile attempt

I think the PostgreSQL leaking in the first place would be independent of Linux being ungraceful about it. So repeating it on BSD probably wouldn't help us here. If you want to take up the 2nd issue with the kernel folks, having some evidence from BSD might (but not very likely) be helpful for that, but that would be for a different mailing list.

Cheers,

Jeff

Re: Out of Memory errors are frustrating as heck!

From

Tomas Vondra

Date:

15 April 2019, 15:26:30

On Sun, Apr 14, 2019 at 11:59:45PM -0400, Gunther wrote:
>On 4/14/2019 23:24, Tom Lane wrote:
>>>        ExecutorState: 2234123384 total in 266261 blocks; 3782328 free (17244 chunks); 2230341056 used
>>Oooh, that looks like a memory leak right enough.  The ExecutorState
>>should not get that big for any reasonable query.
>2.2 GB is massive yes.
>>Your error and stack trace show a failure in HashBatchContext,
>>which is probably the last of these four:
>>
>>>            HashBatchContext: 57432 total in 3 blocks; 16072 free (6 chunks); 41360 used
>>>            HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used
>>>            HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used
>>>            HashBatchContext: 100711712 total in 3065 blocks; 7936 free (0 chunks); 100703776 used
>>Perhaps that's more than it should be, but it's silly to obsess over 100M
>>when there's a 2.2G problem elsewhere.
>Yes.
>>   I think it's likely that it was
>>just coincidence that the failure happened right there.  Unfortunately,
>>that leaves us with no info about where the actual leak is coming from.
>
>Strange though, that the vmstat tracking never showed that the cache 
>allocated memory goes much below 6 GB. Even if this 2.2 GB memory leak 
>is there, and even if I had 2 GB of shared_buffers, I would still have 
>enough for the OS to give me.
>

Depends on how the kernel is configured. What are vm.overcommit_memory
and vm.overcommit_ratio set to, for example?

It may easily be the case that the kernel is only allowing 50% of RAM to
be committed to user space, and then refusing to allocate more despite
having free memory. That's fairly common issue on swapless systems.

Try running the query again, watch

    cat /proc/meminfo | grep Commit

and if it crashes when Committed_AS hits the CommitLimit.

That doesn't explain where the memory leak is, though :-(


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Out of Memory errors are frustrating as heck!

From

Tom Lane

Date:

15 April 2019, 15:28:42

Jeff Janes <jeff.janes@gmail.com> writes:
> To get it to happen faster, maybe you could run the server with a small
> setting of "ulimit -v"?  Or, you could try to capture it live in gdb.
> Unfortunately I don't know how to set a breakpoint for allocations into a
> specific context, and setting a breakpoint for any memory allocation is
> probably going to fire too often to be useful.

If you can use gdb at all, it's not that hard to break on allocations
into a specific context; I've done it many times.  The strategy is
basically

1. Let query run long enough for memory usage to start increasing,
then attach to backend with gdb.

2. Set breakpoint at, probably, AllocSetAlloc.  (In some cases,
reallocs could be the problem, but I doubt it here.)  Then "c".

3. When it stops, "p *context" and see if this is the context
you're looking for.  In this case, since we want to know about
allocations into ExecutorState and we know there's only one
active one, you just have to look at the context name.  In general
you might have to look at the backtrace.  Anyway, if it isn't the
one you want, just "c" until you get to an allocation into the
one you do want.

4. Once you have found out the address of the context you care
about, make the breakpoint conditional on the context argument
being that one.  It might look like this:

Breakpoint 1, AllocSetAlloc (context=0x1483be0, size=480) at aset.c:715
715     {
(gdb) p *context
$1 = {type = T_AllocSetContext, isReset = false, allowInCritSection = false, 
  methods = 0xa33f40, parent = 0x0, firstchild = 0x1537f30, prevchild = 0x0, 
  nextchild = 0x0, name = 0xa3483f "TopMemoryContext", ident = 0x0, 
  reset_cbs = 0x0}
(gdb) cond 1  context == 0x1483be0

5. Now repeatedly "c", and check the stack trace each time, for a
dozen or two times to get a feeling for where the allocations are
being requested.

In some cases you might be able to find the context address in a
more efficient way than what I suggest in #3 --- for instance,
you could instead set a breakpoint where the context is created
and snag its address immediately, or you could dig around in
backend data structures to find it.  But these ways generally
require more familiarity with the code than just watching the
requests go by.

> Are you not showing the view definition for proprietary reasons, or just
> because you don't think it will be useful?

As far as that goes, I think the most likely theory right now is that
some particular function being used in the view is leaking memory.
So yes, we need to see the view ... or you can try removing bits of
it to see if the leak goes away.

            regards, tom lane

Re: Out of Memory errors are frustrating as heck!

From

Tomas Vondra

Date:

15 April 2019, 15:38:49

On Sun, Apr 14, 2019 at 05:19:50PM -0400, Jeff Janes wrote:
>   On Sun, Apr 14, 2019 at 4:51 PM Gunther <raj@gusw.net> wrote:
>
>     For weeks now, I am banging my head at an "out of memory" situation.
>     There is only one query I am running on an 8 GB system, whatever I try,
>     I get knocked out on this out of memory.
>
>   Is PostgreSQL throwing an error with OOM, or is getting killed -9 by the
>   OOM killer?  Do you get a core file you can inspect with gdb?
>
>     You might want to see the query, but it is a huge plan, and I can't
>     really break this down. It shouldn't matter though. But just so you can
>     get a glimpse here is the plan:
>
> Insert on businessoperation  (cost=5358849.28..5361878.44 rows=34619 width=1197)
>   ->  Unique  (cost=5358849.28..5361532.25 rows=34619 width=1197)
>        
>
>   Maybe it is memory for trigger or constraint checking, although I don't
>   know why that would appear instantly.  What triggers or constraints do you
>   have on businessoperation? 

Yeah, that would be my guess too. If I had to guess, something likely gets
confused and allocates memory in es_query_ctx instead of the per-tuple
context (es_per_tuple_exprcontext).

Triggers, constraints and expr evaluation all seem like a plausible
candidates. It's going to be hard to nail the exact place, though :-(

>   What if you just run the SELECT without the INSERT?  Or insert into a temp
>   table rather than into businessoperation?  And if that doesn't crash, what
>   if you then insert to businessoperation from the temp table?
>    

Yeah. What's the schema of "businessoperation"? Anything special about
it? Triggers, expression indexes, check constraints, ...

Gunther, you mentioned you build postgres from sources. Would it be
possible to add some sort of extra debugging to see where the memory is
allocated from? It's a bit heavy-handed, though.

Or maybe splitting es_query_ctx into smaller contexts. That might be
easier to evaluate than sifting throuht god-knows-how-many-gbs of log.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Out of Memory errors are frustrating as heck!

From

Tom Lane

Date:

15 April 2019, 15:53:38

I wrote:
> If you can use gdb at all, it's not that hard to break on allocations
> into a specific context; I've done it many times.  The strategy is
> basically
> 1. Let query run long enough for memory usage to start increasing,
> then attach to backend with gdb.

BTW, just to clarify that strategy a bit: the usage pattern we expect
for ExecutorState is that there are a bunch of individual allocations
during executor startup, but then none while the query is running
(or at least, if there are any in that phase, they get freed before
moving on to the next row).  The form of the leak, almost certainly,
is that some allocation is happening per-row and *not* getting freed.
So there's no point in groveling through the startup behavior.  What
we want to know about is what happens after we reach the ought-to-be
steady state behavior.

            regards, tom lane

Re: Out of Memory errors are frustrating as heck!

From

Gunther

Date:

15 April 2019, 16:34:38

OK Guys, you are very kind to continue taking an interest in this matter.

I will try what I can to help squish the bug.

Tomas Vondra just added a good idea that explains why I get the out of memory with still having so much cache available:

# sysctl vm.overcommit_memory
vm.overcommit_memory = 2
# sysctl vm.overcommit_ratio
vm.overcommit_ratio = 50

as he predicted.

# cat /proc/meminfo |grep Commit
CommitLimit:     3955192 kB
Committed_AS:    2937352 kB

So I thing that explains why it turns into an out of memory error. We don't worry or wonder about that any more. I will change that parameter in the future to allow for some spikes. But it's not going to resolve the underlying memory leak issue.

Now I run explain analyze SELECT ... without the INSERT.

integrator=#  \set VERBOSITY verbose
integrator=#
integrator=# \pset pager off
Pager usage is off.
integrator=# \pset format unaligned
Output format is unaligned.
integrator=# \set VERBOSITY verbose
integrator=#
integrator=# SET ENABLE_NESTLOOP TO OFF;
SET
integrator=# explain analyze SELECT * FROM reports.v_BusinessOperation;
ERROR:  53200: out of memory
DETAIL:  Failed on request of size 32800 in memory context "HashBatchContext".
LOCATION:  MemoryContextAlloc, mcxt.c:798

And since that failed already, I guess we don't need to worry about the temporary table insert.

About adding LIMIT, I don't think it makes sense in the outer query, since the error is probably happening earlier. I did put a LIMIT 100 on one of the tables we join to, and it helped. But that doesn't really tell us anything I think.

Then yes, I can try the backtrace with the NLs enabled. It will just take a long long time and unfortunately it is extremely likely that I lose the console and then will be unable to get back to it. OK, screen(1) resolves that problem too. Will do, after I reported the above.

But now you have already produced more ideas ...

Maybe it is memory for trigger or constraint checking, although I don't
know why that would appear instantly. What triggers or constraints do you
have on businessoperation?

Yeah, that would be my guess too. If I had to guess, something likely gets
confused and allocates memory in es_query_ctx instead of the per-tuple
context (es_per_tuple_exprcontext).

Triggers, constraints and expr evaluation all seem like a plausible
candidates. It's going to be hard to nail the exact place, though

I think triggers and constraints is ruled out, because the problem happens without the INSERT.

That leaves us with expression evaluation. And OK, now you really wanna see the query, although it should be in the plan too. But for what it is worth:

SELECT DISTINCT       documentInternalId, is_current,documentId,documentTypeCode,subjectRoleInternalId,subjectEntityInternalId,subjectEntityId,subjectEntityIdRoot,subjectEntityName,subjectEntityTel,subjectEntityEmail,otherEntityInternalId,confidentialityCode,actInternalId, code_code as operationCode, code_displayName AS operationName,operationQualifierCode,operationQualifierName,approvalNumber,approvalNumberSystem,approvalStateCode,approvalStateCodeSystem,approvalEffectiveTimeLow,approvalEffectiveTimeHigh,approvalStatusCode,licenseCode,agencyId,agencyName,productItemCode,productInternalId FROM reports.DocumentInformationSubject  LEFT OUTER JOIN (SELECT documentInternalId, documentId, actInternalId,                  subjectEntityCode as productItemCode,		  subjectEntityInternalId as productInternalId	     FROM reports.DocumentInformationSubject	    WHERE participationTypeCode = 'PRD') prd   USING(documentInternalId, documentId, actInternalId) LEFT OUTER JOIN (   SELECT documentInternalId,          q.code_code AS operationQualifierCode, 	   q.code_displayName AS operationQualifierName, 	   r.targetInternalId AS actInternalId,   actInternalId AS approvalInternalId, 	   an.extension AS approvalNumber, 	   an.root AS approvalNumberSystem,   qs.subjectEntityCode AS approvalStateCode,   qs.subjectEntityCodeSystem AS approvalStateCodeSystem, 	   qs.effectivetime_low AS approvalEffectiveTimeLow, 	   qs.effectivetime_high AS approvalEffectiveTimeHigh, 	   qs.statusCode AS approvalStatusCode,   qs.code_code AS licenseCode,   agencyId.extension AS agencyId,   agencyName.trivialName AS agencyName     FROM reports.DocumentInformation q     LEFT OUTER JOIN (SELECT * FROM reports.DocumentInformationSubject WHERE participationTypeCode = 'AUT') qs       USING(documentInternalId, actInternalId)     INNER JOIN integrator.ActRelationship r        ON(    r.sourceInternalId = actInternalId  	   AND r.typeCode = 'SUBJ')     LEFT OUTER JOIN integrator.Act_id an USING(actInternalId)     LEFT OUTER JOIN integrator.Entity_id agencyId ON(agencyId.entityInternalId = otherEntityInternalId)     LEFT OUTER JOIN reports.BestName agencyName ON(agencyName.entityInternalId = otherEntityInternalId)    WHERE q.classCode = 'CNTRCT'      AND q.moodCode = 'EVN'      AND q.code_codeSystem = '2.16.840.1.113883.3.26.1.1'         ) q   USING(documentInternalId, actInternalId)WHERE classCode = 'ACT'   AND moodCode = 'DEF'   AND code_codeSystem = '2.16.840.1.113883.3.26.1.1'   AND participationTypeCode IN ('PPRF','PRF');

You see that the expressions are all just equal operations, some IN, nothing outlandish.

Now I will try what Tom Lane suggested. Here you go. And I have it stopped at this state, so if you want me to inspect anything else, I can do it.

With screen(1) I can be sure I won't lose my stuff when my internet goes down. Nice.

I have one screen session with 3 windows:

psql
gdb
misc (vmstat, etc.)

Now I have let this run for a good long time while setting up my screen stuff. And then:

ps -x

look for the postgres job with the EXPLAIN ... that's $PID, then:

gdb -p $PID

Then first I do

cont

but then it stops at SIGUSR1, because of the parallel workers signalling each other.

handle SIGUSR1 nostop

suppresses that stopping. Then I break CTRL-C, and set the breakpoint where Tom Lane said:

b AllocSetAlloc

once it stops there I do

Breakpoint 1, AllocSetAlloc (context=0x1168230, size=8) at aset.c:715
715     {
(gdb) p context->name
$4 = 0x96ce5b "ExecutorState"

So I should even be able to set a conditional breakpoint.

(gdb) delete
Delete all breakpoints? (y or n) y
(gdb) b AllocSetAlloc if  strcmp(context->name, "ExecutorState") == 0
Breakpoint 2 at 0x848ed0: file aset.c, line 715.
(gdb) cont
(gdb) cont
Continuing.

Breakpoint 2, AllocSetAlloc (context=0x1168230, size=10) at aset.c:715
715     {
(gdb) cont
Continuing.

Program received signal SIGUSR1, User defined signal 1.

Breakpoint 2, AllocSetAlloc (context=0x1168230, size=152) at aset.c:715
715     {
(gdb) cont
Continuing.

Program received signal SIGUSR1, User defined signal 1.

Breakpoint 2, AllocSetAlloc (context=0x1168230, size=201) at aset.c:715
715     {
(gdb) cont
Continuing.

Breakpoint 2, AllocSetAlloc (context=0x1168230, size=8272) at aset.c:715
715     {
(gdb) p context->name
$8 = 0x96ce5b "ExecutorState"

Nice. Now the question is, am I at the place where memory gets squeezed? And I think yes. With top

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
31752 postgres  20   0 2772964   1.2g 329640 t   0.0 16.5   8:46.59 postgres: postgres integrator [local] EXPLAIN

I guess I should run this for a little longer. So I disable my breakpoints

(gdb) info breakpoints
Num     Type           Disp Enb Address            What
2       breakpoint     keep y   0x0000000000848ed0 in AllocSetAlloc at aset.c:715       stop only if  strcmp(context->name, "ExecutorState") == 0       breakpoint already hit 6 times
(gdb) disable 2
(gdb) cont
Continuing.

while watching top:

31752 postgres  20   0 2777060   1.3g 329920 D  33.2 17.9   8:52.07 postgres: postgres integrator [local] EXPLAIN

31752 postgres  20   0 2777060   1.4g 329920 D  33.2 17.9   8:52.07 postgres: postgres integrator [local] EXPLAIN

31752 postgres  20   0 2777060   1.5g 329920 D  33.2 17.9   8:52.07 postgres: postgres integrator [local] EXPLAIN

it went up pretty quick from 1.2 GB to 1.5 GB, but then it stopped growing fast, so now back to gdb and break:

^C
Program received signal SIGINT, Interrupt.
0x00007f048f336d71 in read () from /lib64/libpthread.so.0
(gdb) enable 2
(gdb) cont
Continuing.

Breakpoint 2, AllocSetAlloc (context=0x1168230, size=385) at aset.c:715
715     {

Now I give you a bt so we have something to look at:

#0  AllocSetAlloc (context=0x1168230, size=385) at aset.c:715
#1  0x000000000084e6cd in palloc (size=385) at mcxt.c:938
#2  0x000000000061019c in ExecHashJoinGetSavedTuple (file=file@entry=0x8bbc528, hashvalue=hashvalue@entry=0x7fff2e4ca76c,   tupleSlot=0x10856b8, hjstate=0x11688e0) at nodeHashjoin.c:1277
#3  0x0000000000610c83 in ExecHashJoinNewBatch (hjstate=0x11688e0) at nodeHashjoin.c:1042
#4  ExecHashJoinImpl (parallel=false, pstate=0x11688e0) at nodeHashjoin.c:539
#5  ExecHashJoin (pstate=0x11688e0) at nodeHashjoin.c:565
#6  0x00000000005fde68 in ExecProcNodeInstr (node=0x11688e0) at execProcnode.c:461
#7  0x000000000061ce4e in ExecProcNode (node=0x11688e0) at ../../../src/include/executor/executor.h:247
#8  ExecSort (pstate=0x11687d0) at nodeSort.c:107
#9  0x00000000005fde68 in ExecProcNodeInstr (node=0x11687d0) at execProcnode.c:461
#10 0x000000000061d2c4 in ExecProcNode (node=0x11687d0) at ../../../src/include/executor/executor.h:247
#11 ExecUnique (pstate=0x11685e0) at nodeUnique.c:73
#12 0x00000000005fde68 in ExecProcNodeInstr (node=0x11685e0) at execProcnode.c:461
#13 0x00000000005f75ba in ExecProcNode (node=0x11685e0) at ../../../src/include/executor/executor.h:247
#14 ExecutePlan (execute_once=<optimized out>, dest=0xcc60e0 <donothingDR>, direction=<optimized out>, numberTuples=0,   sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x11685e0, estate=0x1168340)   at execMain.c:1723
#15 standard_ExecutorRun (queryDesc=0x119b6d8, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364
#16 0x000000000059c6f8 in ExplainOnePlan (plannedstmt=plannedstmt@entry=0x1199a68, into=into@entry=0x0, es=es@entry=0x1141d48,   queryString=<optimized out>, params=0x0, queryEnv=queryEnv@entry=0x0, planduration=0x7fff2e4ca990) at explain.c:535
#17 0x000000000059c9ef in ExplainOneQuery (query=<optimized out>, cursorOptions=<optimized out>, into=0x0, es=0x1141d48,   queryString=0xf4af30 "explain analyze\nSELECT DISTINCT\n        documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"..., params=0x0, queryEnv=0x0)   at explain.c:371
#18 0x000000000059ce37 in ExplainQuery (pstate=pstate@entry=0xf74608, stmt=stmt@entry=0x11ef240,   queryString=queryString@entry=0xf4af30 "explain analyze\nSELECT DISTINCT\n        documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"...,   params=params@entry=0x0, queryEnv=queryEnv@entry=0x0, dest=dest@entry=0xf74578) at explain.c:254
#19 0x000000000072ca5d in standard_ProcessUtility (pstmt=0x11ef390,   queryString=0xf4af30 "explain analyze\nSELECT DISTINCT\n        documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"...,   context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0xf74578, completionTag=0x7fff2e4cab20 "") at utility.c:675
#20 0x000000000072a052 in PortalRunUtility (portal=0xfb06b0, pstmt=0x11ef390, isTopLevel=<optimized out>,   setHoldSnapshot=<optimized out>, dest=<optimized out>, completionTag=0x7fff2e4cab20 "") at pquery.c:1178
#21 0x000000000072add2 in FillPortalStore (portal=portal@entry=0xfb06b0, isTopLevel=isTopLevel@entry=true) at pquery.c:1038
#22 0x000000000072b855 in PortalRun (portal=portal@entry=0xfb06b0, count=count@entry=9223372036854775807,   isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0xf4c570, altdest=altdest@entry=0xf4c570,   completionTag=0x7fff2e4cad30 "") at pquery.c:768
#23 0x00000000007276e8 in exec_simple_query (   query_string=0xf4af30 "explain analyze\nSELECT DISTINCT\n        documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"...) at postgres.c:1145
#24 0x0000000000729534 in PostgresMain (argc=<optimized out>, argv=argv@entry=0xf76ce8, dbname=<optimized out>,   username=<optimized out>) at postgres.c:4182
#25 0x00000000006be215 in BackendRun (port=0xf6efe0) at postmaster.c:4361
#26 BackendStartup (port=0xf6efe0) at postmaster.c:4033
#27 ServerLoop () at postmaster.c:1706
#28 0x00000000006bf122 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0xf45320) at postmaster.c:1379
#29 0x00000000004822dc in main (argc=3, argv=0xf45320) at main.c:228

But who knows if that's it. I continue and watch top again...

31752 postgres  20   0 3112352   1.8g 329920 D  32.2 23.7   9:43.75 postgres: postgres integrator [local] EXPLAIN

it went quickly to 1.6, then after some time to 1.7, then 1.8, and I stop again:

^C
Program received signal SIGINT, Interrupt.
0x00007f048f336d71 in read () from /lib64/libpthread.so.0
(gdb) enable 2
(gdb) cont
Continuing.

Breakpoint 2, AllocSetAlloc (context=0x1168230, size=375) at aset.c:715
715     {
bt
#0  AllocSetAlloc (context=0x1168230, size=375) at aset.c:715
#1  0x000000000084e6cd in palloc (size=375) at mcxt.c:938
#2  0x000000000061019c in ExecHashJoinGetSavedTuple (file=file@entry=0x21df688, hashvalue=hashvalue@entry=0x7fff2e4ca76c,   tupleSlot=0x10856b8, hjstate=0x11688e0) at nodeHashjoin.c:1277
#3  0x0000000000610c83 in ExecHashJoinNewBatch (hjstate=0x11688e0) at nodeHashjoin.c:1042
#4  ExecHashJoinImpl (parallel=false, pstate=0x11688e0) at nodeHashjoin.c:539
#5  ExecHashJoin (pstate=0x11688e0) at nodeHashjoin.c:565
#6  0x00000000005fde68 in ExecProcNodeInstr (node=0x11688e0) at execProcnode.c:461
#7  0x000000000061ce4e in ExecProcNode (node=0x11688e0) at ../../../src/include/executor/executor.h:247
#8  ExecSort (pstate=0x11687d0) at nodeSort.c:107
#9  0x00000000005fde68 in ExecProcNodeInstr (node=0x11687d0) at execProcnode.c:461
#10 0x000000000061d2c4 in ExecProcNode (node=0x11687d0) at ../../../src/include/executor/executor.h:247
#11 ExecUnique (pstate=0x11685e0) at nodeUnique.c:73
#12 0x00000000005fde68 in ExecProcNodeInstr (node=0x11685e0) at execProcnode.c:461
#13 0x00000000005f75ba in ExecProcNode (node=0x11685e0) at ../../../src/include/executor/executor.h:247
#14 ExecutePlan (execute_once=<optimized out>, dest=0xcc60e0 <donothingDR>, direction=<optimized out>, numberTuples=0,   sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x11685e0, estate=0x1168340)   at execMain.c:1723
#15 standard_ExecutorRun (queryDesc=0x119b6d8, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364
#16 0x000000000059c6f8 in ExplainOnePlan (plannedstmt=plannedstmt@entry=0x1199a68, into=into@entry=0x0, es=es@entry=0x1141d48,   queryString=<optimized out>, params=0x0, queryEnv=queryEnv@entry=0x0, planduration=0x7fff2e4ca990) at explain.c:535
#17 0x000000000059c9ef in ExplainOneQuery (query=<optimized out>, cursorOptions=<optimized out>, into=0x0, es=0x1141d48,   queryString=0xf4af30 "explain analyze\nSELECT DISTINCT\n        documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"..., params=0x0, queryEnv=0x0)   at explain.c:371
#18 0x000000000059ce37 in ExplainQuery (pstate=pstate@entry=0xf74608, stmt=stmt@entry=0x11ef240,   queryString=queryString@entry=0xf4af30 "explain analyze\nSELECT DISTINCT\n        documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"...,   params=params@entry=0x0, queryEnv=queryEnv@entry=0x0, dest=dest@entry=0xf74578) at explain.c:254
#19 0x000000000072ca5d in standard_ProcessUtility (pstmt=0x11ef390,   queryString=0xf4af30 "explain analyze\nSELECT DISTINCT\n        documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"...,   context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0xf74578, completionTag=0x7fff2e4cab20 "") at utility.c:675
#20 0x000000000072a052 in PortalRunUtility (portal=0xfb06b0, pstmt=0x11ef390, isTopLevel=<optimized out>,   setHoldSnapshot=<optimized out>, dest=<optimized out>, completionTag=0x7fff2e4cab20 "") at pquery.c:1178
#21 0x000000000072add2 in FillPortalStore (portal=portal@entry=0xfb06b0, isTopLevel=isTopLevel@entry=true) at pquery.c:1038
#22 0x000000000072b855 in PortalRun (portal=portal@entry=0xfb06b0, count=count@entry=9223372036854775807,   isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0xf4c570, altdest=altdest@entry=0xf4c570,   completionTag=0x7fff2e4cad30 "") at pquery.c:768
#23 0x00000000007276e8 in exec_simple_query (   query_string=0xf4af30 "explain analyze\nSELECT DISTINCT\n        documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"...) at postgres.c:1145
#24 0x0000000000729534 in PostgresMain (argc=<optimized out>, argv=argv@entry=0xf76ce8, dbname=<optimized out>,   username=<optimized out>) at postgres.c:4182
#25 0x00000000006be215 in BackendRun (port=0xf6efe0) at postmaster.c:4361
#26 BackendStartup (port=0xf6efe0) at postmaster.c:4033
#27 ServerLoop () at postmaster.c:1706
#28 0x00000000006bf122 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0xf45320) at postmaster.c:1379
#29 0x00000000004822dc in main (argc=3, argv=0xf45320) at main.c:228

Good, now I leave this all sitting like that for you to ask me what else you might want to see.

We are now close to the edge of the cliff.

-Gunther

Re: Out of Memory errors are frustrating as heck!

From

Jeff Janes

Date:

15 April 2019, 16:51:56

On Mon, Apr 15, 2019 at 11:28 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jeff Janes <jeff.janes@gmail.com> writes:
> To get it to happen faster, maybe you could run the server with a small
> setting of "ulimit -v"? Or, you could try to capture it live in gdb.
> Unfortunately I don't know how to set a breakpoint for allocations into a
> specific context, and setting a breakpoint for any memory allocation is
> probably going to fire too often to be useful.

If you can use gdb at all, it's not that hard to break on allocations
into a specific context; I've done it many times. The strategy is
basically

1. Let query run long enough for memory usage to start increasing,
then attach to backend with gdb.

2. Set breakpoint at, probably, AllocSetAlloc. (In some cases,
reallocs could be the problem, but I doubt it here.) Then "c".

3. When it stops, "p *context" and see if this is the context
you're looking for. In this case, since we want to know about
allocations into ExecutorState and we know there's only one
active one, you just have to look at the context name. In general
you might have to look at the backtrace. Anyway, if it isn't the
one you want, just "c" until you get to an allocation into the
one you do want.

4. Once you have found out the address of the context you care
about, make the breakpoint conditional on the context argument
being that one. It might look like this:

Breakpoint 1, AllocSetAlloc (context=0x1483be0, size=480) at aset.c:715
715 {
(gdb) p *context
$1 = {type = T_AllocSetContext, isReset = false, allowInCritSection = false,
methods = 0xa33f40, parent = 0x0, firstchild = 0x1537f30, prevchild = 0x0,
nextchild = 0x0, name = 0xa3483f "TopMemoryContext", ident = 0x0,
reset_cbs = 0x0}
(gdb) cond 1 context == 0x1483be0

5. Now repeatedly "c", and check the stack trace each time, for a
dozen or two times to get a feeling for where the allocations are
being requested.

In some cases you might be able to find the context address in a
more efficient way than what I suggest in #3 --- for instance,
you could instead set a breakpoint where the context is created
and snag its address immediately, or you could dig around in
backend data structures to find it. But these ways generally
require more familiarity with the code than just watching the
requests go by.

Thanks for the recipe. I can use gdb at all, just not very skillfully :)

With that as a starting point, experimentally, this seems to work to short circuit the loop described in your step 3 (which I fear could be thousands of iterations in some situations):

cond 1 strcmp(context.name,"ExecutorState")==0

Also, I've found that in the last few versions of PostgreSQL, processes might get unreasonable numbers of SIGUSR1 (maybe due to parallelization?) and so to avoid having to stand on the 'c' button, you might need this:

handle SIGUSR1 noprint nostop

Cheers,

Jeff

Re: Out of Memory errors are frustrating as heck!

From

Tom Lane

Date:

15 April 2019, 17:32:51

Gunther <raj@gusw.net> writes:
> Now I give you a bt so we have something to look at:

> #0  AllocSetAlloc (context=0x1168230, size=385) at aset.c:715
> #1  0x000000000084e6cd in palloc (size=385) at mcxt.c:938
> #2  0x000000000061019c in ExecHashJoinGetSavedTuple (file=file@entry=0x8bbc528,
hashvalue=hashvalue@entry=0x7fff2e4ca76c,
>      tupleSlot=0x10856b8, hjstate=0x11688e0) at nodeHashjoin.c:1277

I'm pretty sure that's not the droid we're looking for.
ExecHashJoinGetSavedTuple does palloc a new tuple, but it immediately
sticks it into a TupleTableSlot that will be responsible for freeing
it (when the next tuple is stuck into the same slot).  I'd suggest
continuing a few times and looking for other code paths leading
to AllocSetAlloc in this context.

My first thought on noticing the SELECT DISTINCT was that you might be
hitting the grouping-function-related leak that Andres fixed in 9cf37a527;
but that fix did make it into 11.2 (by just a couple of days...).  Still,
maybe there's another issue in the same area.

            regards, tom lane

Re: Out of Memory errors are frustrating as heck!

From

Alvaro Herrera

Date:

15 April 2019, 17:38:48

On 2019-Apr-15, Gunther wrote:

> #0  AllocSetAlloc (context=0x1168230, size=385) at aset.c:715
> #1  0x000000000084e6cd in palloc (size=385) at mcxt.c:938
> #2  0x000000000061019c in ExecHashJoinGetSavedTuple (file=file@entry=0x8bbc528,
hashvalue=hashvalue@entry=0x7fff2e4ca76c,
>     tupleSlot=0x10856b8, hjstate=0x11688e0) at nodeHashjoin.c:1277
> #3  0x0000000000610c83 in ExecHashJoinNewBatch (hjstate=0x11688e0) at nodeHashjoin.c:1042

Seems that ExecHashJoinGetSavedTuple stores a minimalTuple and sets the
shouldFree flag to "true", and then in ExecHashJoinNewBatch, callee
ExecFetchSlotMinimalTuple sets shouldFree to false inconditionally when
the slot uses minimal tuple ops.  Maybe that's correct, but it does
sound like a memory leak is not entirely impossible.  I wonder if this
fixes it, without causing crashes elsewhere.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

mintup-free.patch

Re: Out of Memory errors are frustrating as heck!

From

Tom Lane

Date:

15 April 2019, 17:45:00

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Seems that ExecHashJoinGetSavedTuple stores a minimalTuple and sets the
> shouldFree flag to "true", and then in ExecHashJoinNewBatch, callee
> ExecFetchSlotMinimalTuple sets shouldFree to false inconditionally when
> the slot uses minimal tuple ops.  Maybe that's correct, but it does
> sound like a memory leak is not entirely impossible.  I wonder if this
> fixes it, without causing crashes elsewhere.

This discussion is about v11, not HEAD.  Still, I agree that that
coding in HEAD seems a bit fishy.

            regards, tom lane

Re: Out of Memory errors are frustrating as heck!

From

Jeff Janes

Date:

15 April 2019, 18:14:07

On Mon, Apr 15, 2019 at 12:34 PM Gunther <raj@gusw.net> wrote:

Breakpoint 2, AllocSetAlloc (context=0x1168230, size=8272) at aset.c:715
715     {
(gdb) p context->name
$8 = 0x96ce5b "ExecutorState"

I think that the above one might have been the one you wanted.

I guess I should run this for a little longer. So I disable my breakpoints

it went up pretty quick from 1.2 GB to 1.5 GB, but then it stopped growing fast, so now back to gdb and break:

Unfortunately, I think this means you missed your opportunity and are now getting backtraces of the innocent bystanders.

Particularly since you report that the version using nested loops rather than hash joins also leaked, so it is probably not the hash-join specific code that is doing it.

What I've done before is compile with the comments removed from

src/backend/utils/mmgr/aset.c:/* #define HAVE_ALLOCINFO */

and then look for allocations sizes which are getting allocated but not freed, and then you can go back to gdb to look for allocations of those specific sizes. This generates a massive amount of output, and it bypasses the logging configuration and goes directly to stderr--so it might not end up where you expect.

Thanks for the view definition. Nothing in it stood out to me as risky.

Cheers,

Jeff

Re: Out of Memory errors are frustrating as heck!

From

Gunther

Date:

16 April 2019, 01:49:50

Wow, we are getting somewhere.

Tom (BTW, your mail server rejects my direct mail, but I'm glad you got it through the list), you say:

I'm pretty sure that's not the droid we're looking for.
ExecHashJoinGetSavedTuple does palloc a new tuple, but it immediately
sticks it into a TupleTableSlot that will be responsible for freeing
it (when the next tuple is stuck into the same slot).  I'd suggest
continuing a few times and looking for other code paths leading
to AllocSetAlloc in this context.

I did continue a "few times", but few as in a dozen, it's always the same

(gdb) bt 6
#0  AllocSetAlloc (context=0x1168230, size=375) at aset.c:715
#1  0x000000000084e6cd in palloc (size=375) at mcxt.c:938
#2  0x000000000061019c in ExecHashJoinGetSavedTuple (file=file@entry=0x21df688, hashvalue=hashvalue@entry=0x7fff2e4ca76c,   tupleSlot=0x10856b8, hjstate=0x11688e0) at nodeHashjoin.c:1277
#3  0x0000000000610c83 in ExecHashJoinNewBatch (hjstate=0x11688e0) at nodeHashjoin.c:1042
#4  ExecHashJoinImpl (parallel=false, pstate=0x11688e0) at nodeHashjoin.c:539
#5  ExecHashJoin (pstate=0x11688e0) at nodeHashjoin.c:565
(More stack frames follow...)

So I decided to just let it go until it exits the ExecHashJoin function:

#6  0x00000000005fde68 in ExecProcNodeInstr (node=0x11688e0) at execProcnode.c:461
461             result = node->ExecProcNodeReal(node);
(gdb) list
456     {
457             TupleTableSlot *result;
458
459             InstrStartNode(node->instrument);
460
461             result = node->ExecProcNodeReal(node);
462
463             InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
464
465             return result;
(gdb) break 463
Breakpoint 3 at 0x5fde68: file execProcnode.c, line 463.
(gdb) disable 2
(gdb) cont
Continuing.

Breakpoint 3, ExecProcNodeInstr (node=0x11688e0) at execProcnode.c:463
463             InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);

oops, that was fast, so up further ...

(gdb) cont
Continuing.

Breakpoint 4, ExecSort (pstate=0x11687d0) at nodeSort.c:109
109                             if (TupIsNull(slot))
(gdb) cont
Continuing.

Breakpoint 3, ExecProcNodeInstr (node=0x11688e0) at execProcnode.c:463
463             InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
(gdb) cont
Continuing.

Breakpoint 4, ExecSort (pstate=0x11687d0) at nodeSort.c:109
109                             if (TupIsNull(slot))
(gdb) up
#1  0x00000000005fde68 in ExecProcNodeInstr (node=0x11687d0) at execProcnode.c:461
461             result = node->ExecProcNodeReal(node);
(gdb) up
#2  0x000000000061d2c4 in ExecProcNode (node=0x11687d0) at ../../../src/include/executor/executor.h:247
247             return node->ExecProcNode(node);
(gdb) up
#3  ExecUnique (pstate=0x11685e0) at nodeUnique.c:73
73                      slot = ExecProcNode(outerPlan);
(gdb) list
68              for (;;)
69              {
70                      /*
71                       * fetch a tuple from the outer subplan
72                       */
73                      slot = ExecProcNode(outerPlan);
74                      if (TupIsNull(slot))
75                      {
76                              /* end of subplan, so we're done */
77                              ExecClearTuple(resultTupleSlot);

... but whatever I do, ultimately I get to that allocation routine through the same path.

Since that is the bulk of the activity, and memory was still growing while we come through this path, I assume that this is it.

My first thought on noticing the SELECT DISTINCT was that you might be
hitting the grouping-function-related leak that Andres fixed in 9cf37a527;
but that fix did make it into 11.2 (by just a couple of days...).  Still,
maybe there's another issue in the same area.

I don't know about that one, I only know that I am running 11.2 freshly compiled.

The change suggested by Alvaro Herrera wasn't applicable.

Jeff Janes had more

Breakpoint 2, AllocSetAlloc (context=0x1168230, size=8272) at aset.c:715
715     {
(gdb) p context->name
$8 = 0x96ce5b "ExecutorState"
I think that the above one might have been the one you wanted.

Not sure how you could tell that? It's the same place as everything else. If we can find out what you're looking for, may be we can set a break point earlier up the call chain?

I guess I should run this for a little longer. So I disable my breakpoints
it went up pretty quick from 1.2 GB to 1.5 GB, but then it stopped growing fast, so now back to gdb and break:
Unfortunately, I think this means you missed your opportunity and are now getting backtraces of the innocent bystanders.

But why? If I see the memory still go up insanely fast, isn't that a sign for the leak?

Particularly since you report that the version using nested loops rather than hash joins also leaked, so it is probably not the hash-join specific code that is doing it.

How about it's in the DISTINCT? I noticed while peeking up the call chain, that it was already in the UNIQUE sort thing also. I guess it's streaming the results from the hash join right into the unique sort step.

What I've done before is compile with the comments removed from
src/backend/utils/mmgr/aset.c:/* #define HAVE_ALLOCINFO */

I have just done that and it creates an insane amount of output from all the processes, I'm afraid there will be no way to keep that stuff separated. If there was a way of turning that one and off for one process only, then we could probably get more info...

Everything is also extremely slow that way. Like in a half hour the memory didn't even reach 100 MB.

and then look for allocations sizes which are getting allocated but not freed, and then you can go back to gdb to look for allocations of those specific sizes.

I guess I should look for both, address and size to match it better.

This generates a massive amount of output, and it bypasses the logging configuration and goes directly to stderr--so it might not end up where you expect.

Yes, massive, like I said. Impossible to use. File system fills up rapidly. I made it so that it can be turned on and off, with the debugger.

int _alloc_info = 0;
#ifdef HAVE_ALLOCINFO
#define AllocFreeInfo(_cxt, _chunk) \       if(_alloc_info) \           fprintf(stderr, "AllocFree: %s: %p, %zu\n", \               (_cxt)->header.name, (_chunk), (_chunk)->size)
#define AllocAllocInfo(_cxt, _chunk) \       if(_alloc_info) \           fprintf(stderr, "AllocAlloc: %s: %p, %zu\n", \               (_cxt)->header.name, (_chunk), (_chunk)->size)
#else
#define AllocFreeInfo(_cxt, _chunk)
#define AllocAllocInfo(_cxt, _chunk)
#endif

so with this I do

(gdb) b AllocSetAlloc
(gdb) cont
(gdb) set _alloc_info=1
(gdb) disable
(gdb) cont

then I wait, ... until it crashes again ... no, it's too much. It fills up my filesystem in no time with the logs. It produced 3 GB in just a minute of run time.

And also, I doubt we can find anything specifically by allocation size. It's just going to be 512 or whatever.

Isn't there some other way?

I'm going to try without that DISTINCT step, or perhaps by dismantling this query until it works without this excessive memory growth.

-Gunther

Re: Out of Memory errors are frustrating as heck!

From

Tom Lane

Date:

16 April 2019, 02:28:03

Gunther <raj@gusw.net> writes:
> Tom (BTW, your mail server rejects my direct mail,

[ raised eyebrow ]  It's coming through fine AFAICS.

>> I'm pretty sure that's not the droid we're looking for.
>> ExecHashJoinGetSavedTuple does palloc a new tuple, but it immediately
>> sticks it into a TupleTableSlot that will be responsible for freeing
>> it (when the next tuple is stuck into the same slot).

> I did continue a "few times", but few as in a dozen, it's always the same

Well, I still don't believe that ExecHashJoinGetSavedTuple is the issue.
It has a mechanism for freeing the allocation at the right time, and
if that were broken then all hash joins would be leaking.  It's easy
to prove that that's not so, both by experiment and by the lack of
other reports.

It's barely conceivable that in your particular query, there's something
acting to break that which doesn't manifest typically; but I think it's
much more likely that you simply haven't found the culprit allocation.
It's quite feasible that many many ExecHashJoinGetSavedTuple calls would
go by in between problem allocations.

Another line of thought is that maybe the problem is with realloc'ing
something larger and larger?  You could try trapping AllocSetRealloc
to see.

(BTW, it looks like we *do* have a leak with simple hash joins in
HEAD.  But not v11.)

            regards, tom lane

Re: Out of Memory errors are frustrating as heck!

From

Alvaro Herrera

Date:

16 April 2019, 02:39:12

On 2019-Apr-15, Tom Lane wrote:

> It's barely conceivable that in your particular query, there's something
> acting to break that which doesn't manifest typically; but I think it's
> much more likely that you simply haven't found the culprit allocation.
> It's quite feasible that many many ExecHashJoinGetSavedTuple calls would
> go by in between problem allocations.

A possibly useful thing to do is use "commands" in gdb to print out a
stack trace for each allocation that touches the problem memory context
and collect them into some output, then classify the allocations based
on the stack trace on each.  No need to do it manually.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Out of Memory errors are frustrating as heck!

From

Gunther

Date:

16 April 2019, 02:39:16

On 4/15/2019 21:49, Gunther wrote:

I'm going to try without that DISTINCT step, or perhaps by dismantling this query until it works without this excessive memory growth.

It also failed. Out of memory. The resident memory size of the backend was 1.5 GB before it crashed.

TopMemoryContext: 4335600 total in 8 blocks; 41208 free (16 chunks); 4294392 used HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used Operator lookup cache: 24576 total in 2 blocks; 10760 free (3 chunks); 13816 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 blocks; 5416 free (2 chunks); 2776 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 524288 total in 7 blocks; 186848 free (7 chunks); 337440 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used   PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used   PortalContext: 1105920 total in 138 blocks; 10368 free (8 chunks); 1095552 used:     ExecutorState: 2238648944 total in 266772 blocks; 3726944 free (16276 chunks); 2234922000 used       HashTableContext: 16384 total in 2 blocks; 4032 free (5 chunks); 12352 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 40120 used       TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 40120 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8454256 total in 6 blocks; 64848 free (32 chunks); 8389408 used         HashBatchContext: 66935744 total in 2037 blocks; 7936 free (0 chunks); 66927808 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used CacheMemoryContext: 1101328 total in 14 blocks; 383480 free (0 chunks); 717848 used   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index   index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx   index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: entity_id_idx   index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: act_id_fkidx   index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: act_id_idx   index info: 2048 total in 2 blocks; 592 free (1 chunks); 1456 used: pg_constraint_conrelid_contypid_conname_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: actrelationship_pkey   index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: actrelationship_target_idx   index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: actrelationship_source_idx   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: documentinformation_pk   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_statistic_ext_relid_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: docinfsubj_ndx_seii   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: docinfsubj_ndx_sbjentcodeonly   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_index_indrelid_index   index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: pg_db_role_setting_databaseid_rol_index   index info: 2048 total in 2 blocks; 624 free (2 chunks); 1424 used: pg_opclass_am_name_nsp_index   index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_foreign_data_wrapper_name_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_enum_oid_index   index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_class_relname_nsp_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_server_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_pubname_index   index info: 2048 total in 2 blocks; 592 free (3 chunks); 1456 used: pg_statistic_relid_att_inh_index   index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_cast_source_target_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_language_name_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_transform_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_collation_oid_index   index info: 3072 total in 2 blocks; 1136 free (2 chunks); 1936 used: pg_amop_fam_strat_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_index_indexrelid_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_template_tmplname_index   index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_ts_config_map_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_opclass_oid_index   index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_foreign_data_wrapper_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_event_trigger_evtname_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_statistic_ext_name_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_dict_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_event_trigger_oid_index   index info: 3072 total in 2 blocks; 1216 free (3 chunks); 1856 used: pg_conversion_default_index   index info: 3072 total in 2 blocks; 1136 free (2 chunks); 1936 used: pg_operator_oprname_l_r_n_index   index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_trigger_tgrelid_tgname_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_enum_typid_label_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_config_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_user_mapping_oid_index   index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_opfamily_am_name_nsp_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_table_relid_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_type_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_aggregate_fnoid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_constraint_oid_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_rewrite_rel_rulename_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_parser_prsname_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_config_cfgname_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_parser_oid_index   index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_publication_rel_prrelid_prpubid_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_operator_oid_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_namespace_nspname_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_template_oid_index   index info: 2048 total in 2 blocks; 624 free (2 chunks); 1424 used: pg_amop_opr_fam_index   index info: 2048 total in 2 blocks; 672 free (3 chunks); 1376 used: pg_default_acl_role_nsp_obj_index   index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_collation_name_enc_nsp_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_rel_oid_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_range_rngtypid_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_dict_dictname_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_type_typname_nsp_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_opfamily_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_statistic_ext_oid_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index   index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_proc_proname_args_nsp_index   index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_partitioned_table_partrelid_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_transform_type_lang_index   index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_attribute_relid_attnum_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_proc_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_language_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_namespace_oid_index   index info: 3072 total in 2 blocks; 1136 free (2 chunks); 1936 used: pg_amproc_fam_proc_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_server_name_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_attribute_relid_attnam_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_conversion_oid_index   index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_user_mapping_user_server_index   index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_subscription_rel_srrelid_srsubid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_sequence_seqrelid_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_conversion_name_nsp_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_authid_oid_index   index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_auth_members_member_role_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_subscription_oid_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_tablespace_oid_index   index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_shseclabel_object_index   index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_replication_origin_roname_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_database_datname_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_subscription_subname_index   index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_replication_origin_roiident_index   index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_auth_members_role_member_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_database_oid_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_authid_rolname_index WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 7208 free (1 chunks); 984 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (5 chunks); 256 used
Grand total: 2322733048 bytes in 269225 blocks; 5406896 free (16556 chunks); 2317326152 used

So what I am wondering now, is there seems to be an EXPLOSION of memory consumption near the time of the crash. That ExecutorState has 2,238,648,944 but just until the very last second(s) the RES memory as per top was 1.5 GB I swear. I looked at it. It went like this:

1.5 GB for a very long time
1.1 GB -- and I thought, yeah! it worked! it's shrinking now
and then it was gone, and there was the memory error.

So how can top tell me 1.5 GB while here the ExecutorState allocations alone have 2 GB???

And isn't even 1.5 GB way too big?

Is there a way of dumping that memory map info during normal runtime, by calling a function with the debugger? So one can see how it grows? It's like checking out memory leaks with Java where I keep looking at the heap_info summary. Tom Lane said that this ExecutorState should not grow to anything like this size, right?

-Gunther

Re: Out of Memory errors are frustrating as heck!

From

Tom Lane

Date:

16 April 2019, 02:50:51

Gunther <raj@gusw.net> writes:
> Is there a way of dumping that memory map info during normal runtime, by 
> calling a function with the debugger?

Sure, "call MemoryContextStats(TopMemoryContext)"

(or actually, since you know which context is the problematic one,
just print that one context)

            regards, tom lane

Re: Out of Memory errors are frustrating as heck!

From

Tom Lane

Date:

16 April 2019, 03:03:55

Gunther <raj@gusw.net> writes:
> So what I am wondering now, is there seems to be an EXPLOSION of memory 
> consumption near the time of the crash. That ExecutorState has 
> 2,238,648,944 but just until the very last second(s) the RES memory as 
> per top was 1.5 GB I swear.

That's not hugely surprising really, especially in a complex query.
It could be performing some preliminary join that doesn't leak, and
then when it starts to perform the join that does have the leak,
kaboom.  Also, given that you seem to be invoking multi-batch joins,
maybe the preliminary phase is fine and there's only a leak when
reading back a batch.

Anyway, the upshot is that you need to investigate what's happening
while the memory consumption is increasing.  The behavior before
that starts to happen isn't going to be very interesting.  It might
be a bit tricky to catch that if it only takes a few seconds to blow
up, but you could try "c 10000" or so to step through a lot of
AllocSetAlloc calls, repeating till the bad stuff starts to happen,
and then going back to looking at just where the calls are coming
from.

            regards, tom lane

Re: Out of Memory errors are frustrating as heck!

From

Gunther

Date:

16 April 2019, 05:23:45

I saw your replies, if there was a way of using gdb commands to have a conditional breakpoint which will only fire if the n-th caller in the chain is not a certain source location, then one could exclude the bulk of these allocations and focus better.

But I decided I try to re-factor this query. And I made an interesting observation.

There is a left outer join in parenthesis

... LEFT OUTER JOIN (SELECT ....) q ...

the biggest parenthesis. I turned this into a temporary table, tmp_bulk. Then I change the main query to

... LEFT OUTER JOIN tmp_bulk q ...

now I am running it again. But what I noticed is that the tmp_bulk table is tiny! It only has like 250 rows. So this means the vast majority of the right left rows in that join are unmatched. The plan is all different now. Heavy CPU% load. Must be merge sorting? No memory growth, not yet.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND5394 postgres  20   0 1284448 287880 271528 R  99.3  3.6   9:21.83 postgres: postgres integrator [local] EXPLAIN5425 postgres  20   0 1278556  93184  82296 S  27.6  1.2   0:38.72 postgres: parallel worker for PID 5394

No, I never trust when a database job has high CPU% and low IO for a long time. So I do

SET ENABLE_MERGEJOIN TO OFF;

and then do it again. Now I have high IO and low CPU%.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 5394 postgres  20   0 1280904 282036 273616 D   2.3  3.6  13:01.51 postgres: postgres integrator [local] EXPLAIN
 5510 postgres  20   0 1278892  87524  80804 D   2.3  1.1   0:05.20 postgres: parallel worker for PID 5394
 5509 postgres  20   0 1278860  87484  80796 D   2.3  1.1   0:05.30 postgres: parallel worker for PID 5394

Still I slip into the high CPU% situation, I guess I'll have to wait it out ...

... and still waiting. No growth whatsoever. The plan is now so totally different that it probably won't trigger the problem.

The original plan that causes the leak involved right joins. This one only left joins. Even after ANALYZE tmp_bulk it still comes up with the same plan. And that plan isn't quick to succeed but also doesn't trigger the memory leak.

So what I can tell is this: that growth to 1.5 GB is consistently happening. It isn't just happening in the beginning and then the rest is just a follow-up problem. Also there seems to be a final spike in growth from 1.5 GB to 2.2 GB that happens inside a second. That seems very strange.

Back to the debugger and do a better job of conditional breakpoints ... I already have an idea how I'll do that. I set a flag when I enter the

Anyway, the upshot is that you need to investigate what's happening
while the memory consumption is increasing.  The behavior before
that starts to happen isn't going to be very interesting.  It might
be a bit tricky to catch that if it only takes a few seconds to blow
up, but you could try "c 10000" or so to step through a lot of
AllocSetAlloc calls, repeating till the bad stuff starts to happen,
and then going back to looking at just where the calls are coming
from.

Isn't 1.5 GB already way too big? There are 3 phases really.

steady state at less than 500 M
slow massive growth to 1 G to 1.5 - 1.8 G
explosion within 1 second from whatever the final size of slow massive growth to the final 2.2 G

I thought that slow massive growth is already a sign of a leak?

I will now filter the calls that come through ExecHashJoinGetSavedTuple

I figured I can do this:

(gdb) info frame
Stack level 0, frame at 0x7ffcbf92fdd0:
 rip = 0x849030 in AllocSetAlloc (aset.c:718); saved rip = 0x84e7dd
 called by frame at 0x7ffcbf92fdf0
 source language c.
 Arglist at 0x7ffcbf92fdc0, args: context=0x29a6450, size=371
 Locals at 0x7ffcbf92fdc0, Previous frame's sp is 0x7ffcbf92fdd0
 Saved registers:
  rip at 0x7ffcbf92fdc8

so is the saved $rip is 0x84e7dd then we are coming this way. Therefore I set my new breakpoint like this:

(gdb) b AllocSetAlloc if  (int)strcmp(context->name, "ExecutorState") == 0 && *(int *)$rsp != 0x84e7dd
Breakpoint 6 at 0x849030: file aset.c, line 718.
(gdb) info b
Num     Type           Disp Enb Address            What
6       breakpoint     keep y   0x0000000000849030 in AllocSetAlloc at aset.c:718
        stop only if  (int)strcmp(context->name, "ExecutorState") == 0 && *(int *)$rsp != 0x84e7dd

And there we go:

Breakpoint 6, AllocSetAlloc (context=0x29a6450, size=8) at aset.c:718
718     {
(gdb) bt 8
#0  AllocSetAlloc (context=0x29a6450, size=8) at aset.c:718
#1  0x000000000084e8ad in palloc0 (size=size@entry=8) at mcxt.c:969
#2  0x0000000000702b63 in makeBufFileCommon (nfiles=nfiles@entry=1) at buffile.c:119
#3  0x0000000000702e4c in makeBufFile (firstfile=68225) at buffile.c:138
#4  BufFileCreateTemp (interXact=interXact@entry=false) at buffile.c:201
#5  0x000000000061060b in ExecHashJoinSaveTuple (tuple=0x2ba1018, hashvalue=<optimized out>, fileptr=0x6305b00) at nodeHashjoin.c:1220
#6  0x000000000060d766 in ExecHashTableInsert (hashtable=hashtable@entry=0x2b50ad8, slot=<optimized out>, hashvalue=<optimized out>)   at nodeHash.c:1663
#7  0x0000000000610c8f in ExecHashJoinNewBatch (hjstate=0x29a6be0) at nodeHashjoin.c:1051
(More stack frames follow...)

and on

(gdb) info frame
Stack level 0, frame at 0x7ffcbf92fd90:rip = 0x849030 in AllocSetAlloc (aset.c:718); saved rip = 0x84e8adcalled by frame at 0x7ffcbf92fdb0source language c.Arglist at 0x7ffcbf92fd80, args: context=0x29a6450, size=8Locals at 0x7ffcbf92fd80, Previous frame's sp is 0x7ffcbf92fd90Saved registers: rip at 0x7ffcbf92fd88
(gdb) b AllocSetAlloc if  (int)strcmp(context->name, "ExecutorState") == 0 && *(int *)$rsp != 0x84e7dd && 0x84e8ad != *(int *)$rsp
Note: breakpoint 6 also set at pc 0x849030.
Breakpoint 7 at 0x849030: file aset.c, line 718.
(gdb) delete 6

Now if I continue I don't seem to be stopping any more.

Does this help now?

-Gunther

Re: Out of Memory errors are frustrating as heck!

From

Gunther

Date:

16 April 2019, 06:33:19

It is confirmed, these two call paths are the only ones. At least probably the only ones to occur with enough of a frequency.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
28576 postgres  20   0 2695304   1.0g 200764 R  11.3 13.8   4:20.13 postgres: postgres integrator [local] EXPLAIN
28580 postgres  20   0  646616 432784  36968 S  98.7  5.5   8:53.28 gdb -p 28576

there is a problem with gdb, it also has a memoy leak and is very expensive with the checking of my conditional breakpoint. So I can't run it all the way through.

Also here captured with

(gdb) call MemoryContextStats(TopPortalContext)

TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used PortalContext: 1482752 total in 184 blocks; 11216 free (8 chunks); 1471536 used:   ExecutorState: 1369337168 total in 163397 blocks; 248840 free (36 chunks); 1369088328 used     HashTableContext: 32768 total in 3 blocks; 17304 free (10 chunks); 15464 used       HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used     HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used       HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used     HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used       HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used     HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used       HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used     HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used       HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used     TupleSort main: 32824 total in 2 blocks; 144 free (0 chunks); 32680 used       Caller tuples: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used     ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used     HashTableContext: 8454256 total in 6 blocks; 64848 free (32 chunks); 8389408 used       HashBatchContext: 106640 total in 3 blocks; 7936 free (0 chunks); 98704 used     TupleSort main: 452880 total in 8 blocks; 126248 free (27 chunks); 326632 used       Caller tuples: 4194304 total in 10 blocks; 1496136 free (20 chunks); 2698168 used      ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used     ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used     ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used     ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used     ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used     ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used     ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used     ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used     ...     ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used     ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used     ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used     ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used     ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
Grand total: 1384601904 bytes in 163660 blocks; 2303840 free (145 chunks); 1382298064 used

there is the developing memory leak. Now let's see if we can trace individual increments ...

(gdb) info break
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x0000000000849030 in AllocSetAlloc at aset.c:718       stop only if  (int)strcmp(context->name, "ExecutorState") == 0 && *(int *)$rsp != 0x84e7dd && 0x84e8ad != *(int *)$rsp       breakpoint already hit 4 times
(gdb) delete 1
(gdb) break AllocSetAlloc if (int)strcmp(context->name, "ExecutorState") == 0 && *(int *)$rsp != 0x84e7dd
Breakpoint 2 at 0x849030: file aset.c, line 718.
(gdb) cont
Continuing.
^CError in testing breakpoint condition:
Quit

Breakpoint 2, AllocSetAlloc (context=0x2a1d190, size=381) at aset.c:718
718     {
(gdb) bt 4
#0  AllocSetAlloc (context=0x2a1d190, size=381) at aset.c:718
#1  0x000000000084e7dd in palloc (size=381) at mcxt.c:938
#2  0x00000000006101bc in ExecHashJoinGetSavedTuple (file=file@entry=0x4b4a198, hashvalue=hashvalue@entry=0x7ffcbf92fe5c,   tupleSlot=0x2ae0ab8, hjstate=0x2a1d920) at nodeHashjoin.c:1277
#3  0x0000000000610ca3 in ExecHashJoinNewBatch (hjstate=0x2a1d920) at nodeHashjoin.c:1042
(More stack frames follow...)
(gdb)  call MemoryContextStats(TopPortalContext)

doesn't show an increase of ExecutorState total:

TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used PortalContext: 1482752 total in 184 blocks; 11216 free (8 chunks); 1471536 used:   ExecutorState: 1369337168 total in 163397 blocks; 248840 free (36 chunks); 1369088328 used

exact same as before:

    ExecutorState: 1369337168 total in 163397 blocks; 248840 free (36 chunks); 1369088328 used

but now we get an increase to:

    ExecutorState: 1369345496 total in 163398 blocks; 248840 free (36 chunks); 1369096656 used

after I did this:

(gdb) cont
Continuing.

Breakpoint 2, AllocSetAlloc (context=0x2a1d190, size=8) at aset.c:718
718     {
(gdb) bt 4
#0  AllocSetAlloc (context=0x2a1d190, size=8) at aset.c:718
#1  0x000000000084e8ad in palloc0 (size=size@entry=8) at mcxt.c:969
#2  0x0000000000702b63 in makeBufFileCommon (nfiles=nfiles@entry=1) at buffile.c:119
#3  0x0000000000702e4c in makeBufFile (firstfile=163423) at buffile.c:138
(More stack frames follow...)
(gdb)  call MemoryContextStats(TopPortalContext)

So now we have it confirmed don't we? No! No we have not! We stop at the entrance of the allocate method. So when I interrupted, there was no call yet. Then at the next stop the increase was from the previous.

Continuing ... this now is from a stop at the makeBufFileCommon

    ExecutorState: 1369345496 total in 163398 blocks; 248816 free (36 chunks); 1369096680 used

And again, now stopped before

    ExecutorState: 1369345496 total in 163398 blocks; 248792 free (36 chunks); 1369096704 used
   ExecutorState: 1369345496 total in 163398 blocks; 248792 free (36 chunks); 1369096704 used

I don't see a growth between individual invocations. Anyway, these are the two ways to get there:

(gdb) bt 4
#0  AllocSetAlloc (context=0x2a1d190, size=4) at aset.c:718
#1  0x000000000084e7dd in palloc (size=size@entry=4) at mcxt.c:938
#2  0x0000000000702e59 in makeBufFile (firstfile=163423) at buffile.c:140
#3  BufFileCreateTemp (interXact=interXact@entry=false) at buffile.c:201
(More stack frames follow...)
(gdb) cont
Continuing.

Breakpoint 3, AllocSetAlloc (context=0x2a1d190, size=394) at aset.c:718
718     {
(gdb) bt 3
#0  AllocSetAlloc (context=0x2a1d190, size=394) at aset.c:718
#1  0x000000000084e7dd in palloc (size=394) at mcxt.c:938
#2  0x00000000006101bc in ExecHashJoinGetSavedTuple (file=file@entry=0x4b4a198, hashvalue=hashvalue@entry=0x7ffcbf92fe5c,   tupleSlot=0x2ae0ab8, hjstate=0x2a1d920) at nodeHashjoin.c:1277
(More stack frames follow...)

But now it increased

    ExecutorState: 1369353824 total in 163399 blocks; 248792 free (36 chunks); 1369105032 used

It increases every 3 times I stop at the breakpoint.

-Gunther

Re: Out of Memory errors are frustrating as heck!

From

Tom Lane

Date:

16 April 2019, 15:30:19

Gunther <raj@gusw.net> writes:
> And there we go:

> Breakpoint 6, AllocSetAlloc (context=0x29a6450, size=8) at aset.c:718
> 718     {
> (gdb) bt 8
> #0  AllocSetAlloc (context=0x29a6450, size=8) at aset.c:718
> #1  0x000000000084e8ad in palloc0 (size=size@entry=8) at mcxt.c:969
> #2  0x0000000000702b63 in makeBufFileCommon (nfiles=nfiles@entry=1) at buffile.c:119
> #3  0x0000000000702e4c in makeBufFile (firstfile=68225) at buffile.c:138
> #4  BufFileCreateTemp (interXact=interXact@entry=false) at buffile.c:201
> #5  0x000000000061060b in ExecHashJoinSaveTuple (tuple=0x2ba1018, hashvalue=<optimized out>, fileptr=0x6305b00) at
nodeHashjoin.c:1220
> #6  0x000000000060d766 in ExecHashTableInsert (hashtable=hashtable@entry=0x2b50ad8, slot=<optimized out>,
hashvalue=<optimizedout>) 
>      at nodeHash.c:1663
> #7  0x0000000000610c8f in ExecHashJoinNewBatch (hjstate=0x29a6be0) at nodeHashjoin.c:1051

Hmm ... this matches up with a vague thought I had that for some reason
the hash join might be spawning a huge number of separate batches.
Each batch would have a couple of files with associated in-memory
state including an 8K I/O buffer, so you could account for the
"slow growth" behavior you're seeing by periodic decisions to
increase the number of batches.

You might try watching calls to ExecHashIncreaseNumBatches
and see if that theory holds water.

This could only happen with a very unfriendly distribution of the
hash keys, I think.  There's a heuristic in there to shut off
growth of nbatch if we observe that we're making no progress at
all, but perhaps this is a skewed distribution that's not quite
skewed enough to trigger that.

            regards, tom lane

Re: Out of Memory errors are frustrating as heck!

From

Jeff Janes

Date:

16 April 2019, 19:28:40

On Mon, Apr 15, 2019 at 9:49 PM Gunther <raj@gusw.net> wrote:

Jeff Janes had more
Breakpoint 2, AllocSetAlloc (context=0x1168230, size=8272) at aset.c:715
715     {
(gdb) p context->name
$8 = 0x96ce5b "ExecutorState"
I think that the above one might have been the one you wanted.
Not sure how you could tell that? It's the same place as everything else. If we can find out what you're looking for, may be we can set a break point earlier up the call chain?

It is just a hunch. That size is similar to one you reported for the last-straw allocation on the case with the nested loops rather than hash joins. So it is reasonable (but surely not guaranteed) that they are coming from the same place in the code. And since that other one occurred as the last straw, then that one must have not be part of the start up allocations, but rather the part where we should be at steady state, but are not. So maybe this one is too.

I guess I should run this for a little longer. So I disable my breakpoints
it went up pretty quick from 1.2 GB to 1.5 GB, but then it stopped growing fast, so now back to gdb and break:
Unfortunately, I think this means you missed your opportunity and are now getting backtraces of the innocent bystanders.
But why? If I see the memory still go up insanely fast, isn't that a sign for the leak?

You said it was no longer going up insanely fast by the time you got your breakpoints re-activated.

Particularly since you report that the version using nested loops rather than hash joins also leaked, so it is probably not the hash-join specific code that is doing it.
How about it's in the DISTINCT? I noticed while peeking up the call chain, that it was already in the UNIQUE sort thing also. I guess it's streaming the results from the hash join right into the unique sort step.

Isn't crashing before any rows are emitted from the hash join? I thought it was, but I am not very confident on that.

What I've done before is compile with the comments removed from
src/backend/utils/mmgr/aset.c:/* #define HAVE_ALLOCINFO */
I have just done that and it creates an insane amount of output from all the processes, I'm afraid there will be no way to keep that stuff separated. If there was a way of turning that one and off for one process only, then we could probably get more info...

Are you doing all this stuff on a production server in use by other people?

Everything is also extremely slow that way. Like in a half hour the memory didn't even reach 100 MB.
and then look for allocations sizes which are getting allocated but not freed, and then you can go back to gdb to look for allocations of those specific sizes.
I guess I should look for both, address and size to match it better.

You can analyze the log for specific addresses which are allocated but not freed, but once you find them you can't do much with them. Those specific addresses probably won't repeat on the next run, so, you so can't do anything with the knowledge. If you find a request size that is systematically analyzed but not freed, you can condition logging (or gdb) on that size.

This generates a massive amount of output, and it bypasses the logging configuration and goes directly to stderr--so it might not end up where you expect.
Yes, massive, like I said. Impossible to use. File system fills up rapidly. I made it so that it can be turned on and off, with the debugger.
int _alloc_info = 0;
#ifdef HAVE_ALLOCINFO
#define AllocFreeInfo(_cxt, _chunk) \       if(_alloc_info) \           fprintf(stderr, "AllocFree: %s: %p, %zu\n", \               (_cxt)->header.name, (_chunk), (_chunk)->size)
#define AllocAllocInfo(_cxt, _chunk) \       if(_alloc_info) \           fprintf(stderr, "AllocAlloc: %s: %p, %zu\n", \               (_cxt)->header.name, (_chunk), (_chunk)->size)
#else
#define AllocFreeInfo(_cxt, _chunk)
#define AllocAllocInfo(_cxt, _chunk)
#endif

so with this I do

(gdb) b AllocSetAlloc
(gdb) cont
(gdb) set _alloc_info=1
(gdb) disable
(gdb) cont

Thanks for this trick, I'll save this so I can refer back to it if I need to do this again some time.

then I wait, ... until it crashes again ... no, it's too much. It fills up my filesystem in no time with the logs. It produced 3 GB in just a minute of run time.

You don't need to run it until it crashes, only until the preliminaries are done and the leak has started to happen. I would think a minute would be more than enough, unless the leak doesn't start until some other join runs to completion or something. You can turn on the logging only after ExecutorStat has obtained a size which is larger than we think it has any reason to be. If you still have log around, you can still analyze it.

And also, I doubt we can find anything specifically by allocation size. It's just going to be 512 or whatever.

For allocations larger than 4096 bytes, it records the size requested, not the rounded-to-a-power-of-two size. So if the leak is for large allocations, you can get considerable specificity here. Even if not, at least you have some more of a clue than you had previously.

Isn't there some other way?

I wonder of valgrind or something like that could be of use. I don't know enough about those tools to know. One problem is that this is not really a leak. If the query completely successfully, it would have freed the memory. And when the query completed with an error, it also freed the memory. So it just an inefficiency, not a true leak, and leak-detection tools might not work. But as I said, I have not studied them.

Are you sure that you had the memory problem under a plan with no hash joins? If so, that would seem to rule out some of the ideas that Tom has been pondering. Unless there are two bugs.

Cheers,

Jeff

Re: Out of Memory errors are frustrating as heck!

From

Tom Lane

Date:

16 April 2019, 19:43:33

Jeff Janes <jeff.janes@gmail.com> writes:
> On Mon, Apr 15, 2019 at 9:49 PM Gunther <raj@gusw.net> wrote:
>> Isn't there some other way?

> I wonder of valgrind or something like that could be of use.  I don't know
> enough about those tools to know.  One problem is that this is not really a
> leak.  If the query completely successfully, it would have freed the
> memory.  And when the query completed with an error, it also freed the
> memory.  So it just an inefficiency, not a true leak, and leak-detection
> tools might not work.  But as I said, I have not studied them.

valgrind is a useful idea, given that Gunther is building his own
postgres (so he could compile it with -DUSE_VALGRIND + --enable-cassert,
which are needed to get valgrind to understand palloc allocations).
I don't recall details right now, but it is possible to trigger
a valgrind report intra-session similar to what you get by default
at process exit.  You could wait till the memory has bloated a
good deal and then ask for one of those reports that classify
allocations by call chain (I think you want the memcheck tool for
this, not the default valgrind tool).

However --- at least for the case involving hash joins, I think we
have a decent fix on the problem location already: it seems to be a
matter of continually deciding to increase nbatch, and now what we
need to investigate is why that's happening.

If there's a leak that shows up without any hash joins in the plan,
then that's a separate matter for investigation.

            regards, tom lane

Re: Out of Memory errors are frustrating as heck!

From

Gavin Flower

Date:

16 April 2019, 22:39:26

On 15/04/2019 08:23, Gunther wrote:
>
> For weeks now, I am banging my head at an "out of memory" situation. 
> There is only one query I am running on an 8 GB system, whatever I 
> try, I get knocked out on this out of memory. It is extremely 
> impenetrable to understand and fix this error. I guess I could add a 
> swap file, and then I would have to take the penalty of swapping. But 
> how can I actually address an out of memory condition if the system 
> doesn't tell me where it is happening?
>
[...]

I strongly suigest having a swap file, I've got 32GB, and I've used 
2.9GB of my swap space after 4 days, but I'm not really pushing my 
system.  For me, mostly stuff that is only used once, or not at all, is 
swapped out.  If you do have a memory leak, then it might be easier to 
diagnose, if you don't run out on Memory.

I suspect that most things will run a little better with some swap space.

Cherers,
Gavin

Re: Out of Memory errors are frustrating as heck!

From

Gunther

Date:

17 April 2019, 02:24:53

On 4/16/2019 11:30, Tom Lane wrote:

Breakpoint 6, AllocSetAlloc (context=0x29a6450, size=8) at aset.c:718
718     {
(gdb) bt 8
#0  AllocSetAlloc (context=0x29a6450, size=8) at aset.c:718
#1  0x000000000084e8ad in palloc0 (size=size@entry=8) at mcxt.c:969
#2  0x0000000000702b63 in makeBufFileCommon (nfiles=nfiles@entry=1) at buffile.c:119
#3  0x0000000000702e4c in makeBufFile (firstfile=68225) at buffile.c:138
#4  BufFileCreateTemp (interXact=interXact@entry=false) at buffile.c:201
#5  0x000000000061060b in ExecHashJoinSaveTuple (tuple=0x2ba1018, hashvalue=<optimized out>, fileptr=0x6305b00) at nodeHashjoin.c:1220
#6  0x000000000060d766 in ExecHashTableInsert (hashtable=hashtable@entry=0x2b50ad8, slot=<optimized out>, hashvalue=<optimized out>)    at nodeHash.c:1663
#7  0x0000000000610c8f in ExecHashJoinNewBatch (hjstate=0x29a6be0) at nodeHashjoin.c:1051

Hmm ... this matches up with a vague thought I had that for some reason
the hash join might be spawning a huge number of separate batches.
Each batch would have a couple of files with associated in-memory
state including an 8K I/O buffer, so you could account for the
"slow growth" behavior you're seeing by periodic decisions to
increase the number of batches.

You might try watching calls to ExecHashIncreaseNumBatches
and see if that theory holds water.

OK, checking that ... well yes, this breaks quickly into that, here is one backtrace

This could only happen with a very unfriendly distribution of the
hash keys, I think.  There's a heuristic in there to shut off
growth of nbatch if we observe that we're making no progress at
all, but perhaps this is a skewed distribution that's not quite
skewed enough to trigger that.

Your hunch is pretty right on. There is something very weirdly distributed in this particular join situation.

#0  ExecHashIncreaseNumBatches (hashtable=hashtable@entry=0x2ae8ca8) at nodeHash.c:893
#1  0x000000000060d84a in ExecHashTableInsert (hashtable=hashtable@entry=0x2ae8ca8, slot=slot@entry=0x2ae0238,   hashvalue=<optimized out>) at nodeHash.c:1655
#2  0x000000000060fd9c in MultiExecPrivateHash (node=<optimized out>) at nodeHash.c:186
#3  MultiExecHash (node=node@entry=0x2ac6dc8) at nodeHash.c:114
#4  0x00000000005fe42f in MultiExecProcNode (node=node@entry=0x2ac6dc8) at execProcnode.c:501
#5  0x000000000061073d in ExecHashJoinImpl (parallel=false, pstate=0x2a1dd40) at nodeHashjoin.c:290
#6  ExecHashJoin (pstate=0x2a1dd40) at nodeHashjoin.c:565
#7  0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dd40) at execProcnode.c:461
#8  0x000000000061ce6e in ExecProcNode (node=0x2a1dd40) at ../../../src/include/executor/executor.h:247
#9  ExecSort (pstate=0x2a1dc30) at nodeSort.c:107
#10 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dc30) at execProcnode.c:461
#11 0x000000000061d2e4 in ExecProcNode (node=0x2a1dc30) at ../../../src/include/executor/executor.h:247
#12 ExecUnique (pstate=0x2a1d9b0) at nodeUnique.c:73
#13 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1d9b0) at execProcnode.c:461
#14 0x00000000005f75da in ExecProcNode (node=0x2a1d9b0) at ../../../src/include/executor/executor.h:247
#15 ExecutePlan (execute_once=<optimized out>, dest=0xcc60e0 <donothingDR>, direction=<optimized out>, numberTuples=0,   sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x2a1d9b0, estate=0x2a1d6c0)   at execMain.c:1723
#16 standard_ExecutorRun (queryDesc=0x2a7a478, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364
#17 0x000000000059c718 in ExplainOnePlan (plannedstmt=plannedstmt@entry=0x2a787f8, into=into@entry=0x0, es=es@entry=0x28f1048,   queryString=<optimized out>, params=0x0, queryEnv=queryEnv@entry=0x0, planduration=0x7ffcbf930080) at explain.c:535

But this is still in the warm-up phase, we don't know if it is at the place where memory grows too much.

Let's see if I can count the occurrences ... I do cont 100. Now resident memory slowly grows, but not too much just 122 kB and CPU is at 88%. I think we haven't hit the problematic part of the plan. There is a sort merge at some leaf, which I believe is innocent. My gut feeling from looking at CPU% high that we are in one of those since NL is disabled.

Next stage is that memory shot up to 264 kB and CPU% down to 8.6. Heavy IO (write and read).

Yes! And now entering the 3rd stage, where memory shots up to 600 kB. This is where it starts "breaking out". And only now the 100 breakpoint conts are used up. And within a second another 100. And even 1000 go by in a second. cont 10000 goes by in 4 seconds. And during that time resident memory increased to over 700 kB. Let's measure:

736096 + cont 10000 --> 740056, that is 3960 bytes for 10000 conts, or 0.396 bytes per cont. Prediction: cont 10000 will now arrive at 744016? Aaaand ... BINGO! 744016 exactly! cont 50000 will take about 20 seconds and will boost memory to 763816. Bets are on ... drumroll ... 35, 36 , ... nope. This time didn't pan out. Breakpoint already hit 75727 times ignore next 5585 hits ... memory now 984052. So it took longer this time and memory increment was larger. We are now getting toward the edge of the cliff. Before we do here is the backtrace now:

#0  ExecHashIncreaseNumBatches (hashtable=hashtable@entry=0x2ae8ca8) at nodeHash.c:893
#1  0x000000000060d84a in ExecHashTableInsert (hashtable=hashtable@entry=0x2ae8ca8, slot=<optimized out>, hashvalue=<optimized out>)   at nodeHash.c:1655
#2  0x0000000000610c8f in ExecHashJoinNewBatch (hjstate=0x2a1dd40) at nodeHashjoin.c:1051
#3  ExecHashJoinImpl (parallel=false, pstate=0x2a1dd40) at nodeHashjoin.c:539
#4  ExecHashJoin (pstate=0x2a1dd40) at nodeHashjoin.c:565
#5  0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dd40) at execProcnode.c:461
#6  0x000000000061ce6e in ExecProcNode (node=0x2a1dd40) at ../../../src/include/executor/executor.h:247
#7  ExecSort (pstate=0x2a1dc30) at nodeSort.c:107
(More stack frames follow...)
(gdb) bt 18
#0  ExecHashIncreaseNumBatches (hashtable=hashtable@entry=0x2ae8ca8) at nodeHash.c:893
#1  0x000000000060d84a in ExecHashTableInsert (hashtable=hashtable@entry=0x2ae8ca8, slot=<optimized out>, hashvalue=<optimized out>)   at nodeHash.c:1655
#2  0x0000000000610c8f in ExecHashJoinNewBatch (hjstate=0x2a1dd40) at nodeHashjoin.c:1051
#3  ExecHashJoinImpl (parallel=false, pstate=0x2a1dd40) at nodeHashjoin.c:539
#4  ExecHashJoin (pstate=0x2a1dd40) at nodeHashjoin.c:565
#5  0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dd40) at execProcnode.c:461
#6  0x000000000061ce6e in ExecProcNode (node=0x2a1dd40) at ../../../src/include/executor/executor.h:247
#7  ExecSort (pstate=0x2a1dc30) at nodeSort.c:107
#8  0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dc30) at execProcnode.c:461
#9  0x000000000061d2e4 in ExecProcNode (node=0x2a1dc30) at ../../../src/include/executor/executor.h:247
#10 ExecUnique (pstate=0x2a1d9b0) at nodeUnique.c:73
#11 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1d9b0) at execProcnode.c:461
#12 0x00000000005f75da in ExecProcNode (node=0x2a1d9b0) at ../../../src/include/executor/executor.h:247
#13 ExecutePlan (execute_once=<optimized out>, dest=0xcc60e0 <donothingDR>, direction=<optimized out>, numberTuples=0,   sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x2a1d9b0, estate=0x2a1d6c0)   at execMain.c:1723
#14 standard_ExecutorRun (queryDesc=0x2a7a478, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364
#15 0x000000000059c718 in ExplainOnePlan (plannedstmt=plannedstmt@entry=0x2a787f8, into=into@entry=0x0, es=es@entry=0x28f1048,   queryString=<optimized out>, params=0x0, queryEnv=queryEnv@entry=0x0, planduration=0x7ffcbf930080) at explain.c:535

By the way, I ran the explain analyze of the plan while removing all the final result columns from the outer-most select, replacing them with simply SELECT 1 FROM .... And here is that plan. I am presenting it to you because you might glean something about the whatever skewed distribution.

 Hash Right Join  (cost=4203858.53..5475530.71 rows=34619 width=4) (actual time=309603.384..459480.863 rows=113478386 loops=1)  Hash Cond: (((q.documentinternalid)::text = (documentinformationsubject.documentinternalid)::text) AND ((r.targetinternalid)::text = (documentinformationsubject.actinternalid)::text))  ->  Hash Right Join  (cost=1341053.37..2611158.36 rows=13 width=74) (actual time=109807.980..109808.040 rows=236 loops=1)        Hash Cond: (((documentinformationsubject_2.documentinternalid)::text = (q.documentinternalid)::text) AND ((documentinformationsubject_2.actinternalid)::text = (q.actinternalid)::text))        ->  Gather  (cost=30803.54..1300908.52 rows=1 width=74) (actual time=58730.915..58737.757 rows=0 loops=1)              Workers Planned: 2              Workers Launched: 2              ->  Parallel Hash Left Join  (cost=29803.54..1299908.42 rows=1 width=74) (actual time=58723.378..58723.379 rows=0 loops=3)                    Hash Cond: ((documentinformationsubject_2.otherentityinternalid)::text = (agencyid.entityinternalid)::text)                    ->  Parallel Hash Left Join  (cost=28118.13..1298223.00 rows=1 width=111) (actual time=58713.650..58713.652 rows=0 loops=3)                          Hash Cond: ((documentinformationsubject_2.otherentityinternalid)::text = (agencyname.entityinternalid)::text)                          ->  Parallel Seq Scan on documentinformationsubject documentinformationsubject_2  (cost=0.00..1268800.85 rows=1 width=111) (actual time=58544.391..58544.391 rows=0 loops=3)                                Filter: ((participationtypecode)::text = 'AUT'::text)                                Rows Removed by Filter: 2815562                          ->  Parallel Hash  (cost=24733.28..24733.28 rows=166628 width=37) (actual time=125.611..125.611 rows=133303 loops=3)                                Buckets: 65536  Batches: 16  Memory Usage: 2336kB                                ->  Parallel Seq Scan on bestname agencyname  (cost=0.00..24733.28 rows=166628 width=37) (actual time=0.009..60.685 rows=133303 loops=3)                    ->  Parallel Hash  (cost=1434.07..1434.07 rows=20107 width=37) (actual time=9.329..9.329 rows=11393 loops=3)                          Buckets: 65536  Batches: 1  Memory Usage: 2976kB                          ->  Parallel Seq Scan on entity_id agencyid  (cost=0.00..1434.07 rows=20107 width=37) (actual time=0.008..5.224 rows=11393 loops=3)        ->  Hash  (cost=1310249.63..1310249.63 rows=13 width=111) (actual time=51077.049..51077.049 rows=236 loops=1)              Buckets: 1024  Batches: 1  Memory Usage: 41kB              ->  Hash Right Join  (cost=829388.20..1310249.63 rows=13 width=111) (actual time=45607.852..51076.967 rows=236 loops=1)                    Hash Cond: ((an.actinternalid)::text = (q.actinternalid)::text)                    ->  Seq Scan on act_id an  (cost=0.00..425941.04 rows=14645404 width=37) (actual time=1.212..10883.350 rows=14676871 loops=1)                    ->  Hash  (cost=829388.19..829388.19 rows=1 width=111) (actual time=38246.715..38246.715 rows=236 loops=1)                          Buckets: 1024  Batches: 1  Memory Usage: 41kB                          ->  Gather  (cost=381928.46..829388.19 rows=1 width=111) (actual time=31274.733..38246.640 rows=236 loops=1)                                Workers Planned: 2                                Workers Launched: 2                                ->  Parallel Hash Join  (cost=380928.46..828388.09 rows=1 width=111) (actual time=31347.260..38241.812 rows=79 loops=3)                                      Hash Cond: ((q.actinternalid)::text = (r.sourceinternalid)::text)                                      ->  Parallel Seq Scan on documentinformation q  (cost=0.00..447271.93 rows=50050 width=74) (actual time=13304.439..20265.733 rows=87921 loops=3)                                            Filter: (((classcode)::text = 'CNTRCT'::text) AND ((moodcode)::text = 'EVN'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text))                                            Rows Removed by Filter: 1540625                                      ->  Parallel Hash  (cost=380928.44..380928.44 rows=1 width=74) (actual time=17954.106..17954.106 rows=79 loops=3)                                            Buckets: 1024  Batches: 1  Memory Usage: 104kB                                            ->  Parallel Seq Scan on actrelationship r  (cost=0.00..380928.44 rows=1 width=74) (actual time=7489.704..17953.959 rows=79 loops=3)                                                  Filter: ((typecode)::text = 'SUBJ'::text)                                                  Rows Removed by Filter: 3433326  ->  Hash  (cost=2861845.87..2861845.87 rows=34619 width=74) (actual time=199792.446..199792.446 rows=113478127 loops=1)        Buckets: 65536 (originally 65536)  Batches: 131072 (originally 2)  Memory Usage: 189207kB        ->  Gather Merge  (cost=2845073.40..2861845.87 rows=34619 width=74) (actual time=107620.262..156256.432 rows=113478127 loops=1)              Workers Planned: 2              Workers Launched: 2              ->  Merge Left Join  (cost=2844073.37..2856849.96 rows=14425 width=74) (actual time=107570.719..126113.792 rows=37826042 loops=3)                    Merge Cond: (((documentinformationsubject.documentinternalid)::text = (documentinformationsubject_1.documentinternalid)::text) AND ((documentinformationsubject.documentid)::text = (documentinformationsubject_1.documentid)::text) AND ((documentinformationsubject.actinternalid)::text = (documentinformationsubject_1.actinternalid)::text))                    ->  Sort  (cost=1295969.26..1296005.32 rows=14425 width=111) (actual time=57700.723..58134.751 rows=231207 loops=3)                          Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.documentid, documentinformationsubject.actinternalid                          Sort Method: external merge  Disk: 26936kB                          Worker 0:  Sort Method: external merge  Disk: 27152kB                          Worker 1:  Sort Method: external merge  Disk: 28248kB                          ->  Parallel Seq Scan on documentinformationsubject  (cost=0.00..1294972.76 rows=14425 width=111) (actual time=24866.656..57424.420 rows=231207 loops=3)                                Filter: (((participationtypecode)::text = ANY ('{PPRF,PRF}'::text[])) AND ((classcode)::text = 'ACT'::text) AND ((moodcode)::text = 'DEF'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text))                                Rows Removed by Filter: 2584355                    ->  Materialize  (cost=1548104.12..1553157.04 rows=1010585 width=111) (actual time=49869.984..54191.701 rows=38060250 loops=3)                          ->  Sort  (cost=1548104.12..1550630.58 rows=1010585 width=111) (actual time=49869.980..50832.205 rows=1031106 loops=3)                                Sort Key: documentinformationsubject_1.documentinternalid, documentinformationsubject_1.documentid, documentinformationsubject_1.actinternalid                                Sort Method: external merge  Disk: 122192kB                                Worker 0:  Sort Method: external merge  Disk: 122192kB                                Worker 1:  Sort Method: external merge  Disk: 122192kB                                ->  Seq Scan on documentinformationsubject documentinformationsubject_1  (cost=0.00..1329868.64 rows=1010585 width=111) (actual time=20366.166..47751.267 rows=1031106 loops=3)                                      Filter: ((participationtypecode)::text = 'PRD'::text)                                      Rows Removed by Filter: 7415579Planning Time: 2.523 msExecution Time: 464825.391 ms
(66 rows)

By the way, let me ask, do you have pretty-print functions I can call with, e.g., node in ExecProcNode, or pstate in ExecHashJoin? Because if there was, then we could determine where exactly in the current plan we are? And can I call the plan printer for the entire plan we are currently executing? Might it even give us preliminary counts of where in the process it is? (I ask the latter not only because it would be really useful for our present debugging, but also because it would be an awesome tool for monitoring of long running queries! Something I am sure tons of people would just love to have!

BTW, I also read the other responses. I agree that having a swap space available just in case is better than these annoying out of memory errors. And yes, I can add that memory profiler thing, if you think it would actually work. I've done it with java heap dumps, even upgrading the VM to a 32 GB VM just to crunch the heap dump. But can you tell me just a little more as to how I need to configure this thing to get the data you want without blowing up the memory and disk during this huge query?

regards,
-Gunther

Re: Out of Memory errors are frustrating as heck!

From

Justin Pryzby

Date:

17 April 2019, 04:46:51

I wonder if it'd be useful to compile with 
./configure CFLAGS=-DHJDEBUG=1

Re: Out of Memory errors are frustrating as heck!

From

Jean-David Beyer

Date:

17 April 2019, 06:01:58

On 4/16/19 6:39 PM, Gavin Flower wrote:
> I suspect that most things will run a little better with some swap space.

Not always.

$ free
              total       used      free  shared   buffers     cached
Mem:       16254616   13120960   3133656   20820    646676   10765380
-/+ buffers/cache:     1708904  14545712
Swap:       4095996      17436   4078560

-- 
  .~.  Jean-David Beyer
  /V\  PGP-Key:166D840A 0C610C8B
 /( )\ Shrewsbury, New Jersey
 ^^-^^ 01:50:01 up 5 days, 56 min, 2 users, load average: 4.51, 4.59, 4.90

Re: Out of Memory errors are frustrating as heck!

From

Gavin Flower

Date:

17 April 2019, 07:53:16

On 17/04/2019 18:01, Jean-David Beyer wrote:
> On 4/16/19 6:39 PM, Gavin Flower wrote:
>> I suspect that most things will run a little better with some swap space.
> Not always.
>
> $ free
>                total       used      free  shared   buffers     cached
> Mem:       16254616   13120960   3133656   20820    646676   10765380
> -/+ buffers/cache:     1708904  14545712
> Swap:       4095996      17436   4078560
>
Unclear what is the point you're trying to make, and the stats you quote 
don't enlighten me.

Re: Out of Memory errors are frustrating as heck!

From

Gunther

Date:

18 April 2019, 03:52:44

Hi guys. I don't want to be pushy, but I found it strange that after so much lively back and forth getting to the bottom of this, suddenly my last nights follow-up remained completely without reply. I wonder if it even got received. For those who read their emails with modern readers (I know I too am from a time where I wrote everything in plain text) I marked some important questions in bold.

On 4/16/2019 11:30, Tom Lane wrote:

Breakpoint 6, AllocSetAlloc (context=0x29a6450, size=8) at aset.c:718
718     {
(gdb) bt 8
#0  AllocSetAlloc (context=0x29a6450, size=8) at aset.c:718
#1  0x000000000084e8ad in palloc0 (size=size@entry=8) at mcxt.c:969
#2  0x0000000000702b63 in makeBufFileCommon (nfiles=nfiles@entry=1) at buffile.c:119
#3  0x0000000000702e4c in makeBufFile (firstfile=68225) at buffile.c:138
#4  BufFileCreateTemp (interXact=interXact@entry=false) at buffile.c:201
#5  0x000000000061060b in ExecHashJoinSaveTuple (tuple=0x2ba1018, hashvalue=<optimized out>, fileptr=0x6305b00) at nodeHashjoin.c:1220
#6  0x000000000060d766 in ExecHashTableInsert (hashtable=hashtable@entry=0x2b50ad8, slot=<optimized out>, hashvalue=<optimized out>)    at nodeHash.c:1663
#7  0x0000000000610c8f in ExecHashJoinNewBatch (hjstate=0x29a6be0) at nodeHashjoin.c:1051

Hmm ... this matches up with a vague thought I had that for some reason
the hash join might be spawning a huge number of separate batches.
Each batch would have a couple of files with associated in-memory
state including an 8K I/O buffer, so you could account for the
"slow growth" behavior you're seeing by periodic decisions to
increase the number of batches.

You might try watching calls to ExecHashIncreaseNumBatches
and see if that theory holds water.

OK, checking that ... well yes, this breaks quickly into that, here is one backtrace

#0  ExecHashIncreaseNumBatches (hashtable=hashtable@entry=0x2ae8ca8) at nodeHash.c:893
#1  0x000000000060d84a in ExecHashTableInsert (hashtable=hashtable@entry=0x2ae8ca8, slot=slot@entry=0x2ae0238,   hashvalue=<optimized out>) at nodeHash.c:1655
#2  0x000000000060fd9c in MultiExecPrivateHash (node=<optimized out>) at nodeHash.c:186
#3  MultiExecHash (node=node@entry=0x2ac6dc8) at nodeHash.c:114
#4  0x00000000005fe42f in MultiExecProcNode (node=node@entry=0x2ac6dc8) at execProcnode.c:501
#5  0x000000000061073d in ExecHashJoinImpl (parallel=false, pstate=0x2a1dd40) at nodeHashjoin.c:290
#6  ExecHashJoin (pstate=0x2a1dd40) at nodeHashjoin.c:565
#7  0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dd40) at execProcnode.c:461
#8  0x000000000061ce6e in ExecProcNode (node=0x2a1dd40) at ../../../src/include/executor/executor.h:247
#9  ExecSort (pstate=0x2a1dc30) at nodeSort.c:107
#10 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dc30) at execProcnode.c:461
#11 0x000000000061d2e4 in ExecProcNode (node=0x2a1dc30) at ../../../src/include/executor/executor.h:247
#12 ExecUnique (pstate=0x2a1d9b0) at nodeUnique.c:73
#13 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1d9b0) at execProcnode.c:461
#14 0x00000000005f75da in ExecProcNode (node=0x2a1d9b0) at ../../../src/include/executor/executor.h:247
#15 ExecutePlan (execute_once=<optimized out>, dest=0xcc60e0 <donothingDR>, direction=<optimized out>, numberTuples=0,   sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x2a1d9b0, estate=0x2a1d6c0)   at execMain.c:1723
#16 standard_ExecutorRun (queryDesc=0x2a7a478, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364
#17 0x000000000059c718 in ExplainOnePlan (plannedstmt=plannedstmt@entry=0x2a787f8, into=into@entry=0x0, es=es@entry=0x28f1048,   queryString=<optimized out>, params=0x0, queryEnv=queryEnv@entry=0x0, planduration=0x7ffcbf930080) at explain.c:535

But this is still in the warm-up phase, we don't know if it is at the place where memory grows too much.

This could only happen with a very unfriendly distribution of the
hash keys, I think.  There's a heuristic in there to shut off
growth of nbatch if we observe that we're making no progress at
all, but perhaps this is a skewed distribution that's not quite
skewed enough to trigger that.

Your hunch is pretty right on. There is something very weirdly distributed in this particular join situation. Let's see if I can count the occurrences ... I do cont 100. Now resident memory slowly grows, but not too much just 122 kB and CPU is at 88%. I think we haven't hit the problematic part of the plan. There is a sort merge at some leaf, which I believe is innocent. My gut feeling from looking at CPU% high that we are in one of those since NL is disabled.

Next stage is that memory shot up to 264 kB and CPU% down to 8.6. Heavy IO (write and read).

#0  ExecHashIncreaseNumBatches (hashtable=hashtable@entry=0x2ae8ca8) at nodeHash.c:893
#1  0x000000000060d84a in ExecHashTableInsert (hashtable=hashtable@entry=0x2ae8ca8, slot=<optimized out>, hashvalue=<optimized out>)   at nodeHash.c:1655
#2  0x0000000000610c8f in ExecHashJoinNewBatch (hjstate=0x2a1dd40) at nodeHashjoin.c:1051
#3  ExecHashJoinImpl (parallel=false, pstate=0x2a1dd40) at nodeHashjoin.c:539
#4  ExecHashJoin (pstate=0x2a1dd40) at nodeHashjoin.c:565
#5  0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dd40) at execProcnode.c:461
#6  0x000000000061ce6e in ExecProcNode (node=0x2a1dd40) at ../../../src/include/executor/executor.h:247
#7  ExecSort (pstate=0x2a1dc30) at nodeSort.c:107
(More stack frames follow...)
(gdb) bt 18
#0  ExecHashIncreaseNumBatches (hashtable=hashtable@entry=0x2ae8ca8) at nodeHash.c:893
#1  0x000000000060d84a in ExecHashTableInsert (hashtable=hashtable@entry=0x2ae8ca8, slot=<optimized out>, hashvalue=<optimized out>)   at nodeHash.c:1655
#2  0x0000000000610c8f in ExecHashJoinNewBatch (hjstate=0x2a1dd40) at nodeHashjoin.c:1051
#3  ExecHashJoinImpl (parallel=false, pstate=0x2a1dd40) at nodeHashjoin.c:539
#4  ExecHashJoin (pstate=0x2a1dd40) at nodeHashjoin.c:565
#5  0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dd40) at execProcnode.c:461
#6  0x000000000061ce6e in ExecProcNode (node=0x2a1dd40) at ../../../src/include/executor/executor.h:247
#7  ExecSort (pstate=0x2a1dc30) at nodeSort.c:107
#8  0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dc30) at execProcnode.c:461
#9  0x000000000061d2e4 in ExecProcNode (node=0x2a1dc30) at ../../../src/include/executor/executor.h:247
#10 ExecUnique (pstate=0x2a1d9b0) at nodeUnique.c:73
#11 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1d9b0) at execProcnode.c:461
#12 0x00000000005f75da in ExecProcNode (node=0x2a1d9b0) at ../../../src/include/executor/executor.h:247
#13 ExecutePlan (execute_once=<optimized out>, dest=0xcc60e0 <donothingDR>, direction=<optimized out>, numberTuples=0,   sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x2a1d9b0, estate=0x2a1d6c0)   at execMain.c:1723
#14 standard_ExecutorRun (queryDesc=0x2a7a478, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364
#15 0x000000000059c718 in ExplainOnePlan (plannedstmt=plannedstmt@entry=0x2a787f8, into=into@entry=0x0, es=es@entry=0x28f1048,   queryString=<optimized out>, params=0x0, queryEnv=queryEnv@entry=0x0, planduration=0x7ffcbf930080) at explain.c:535

I ran the explain analyze of the plan while removing all the final result columns from the outer-most select, replacing them with simply SELECT 1 FROM .... And here is that plan. I am presenting it to you because you might glean something about the whatever skewed distribution.

 Hash Right Join  (cost=4203858.53..5475530.71 rows=34619 width=4) (actual time=309603.384..459480.863 rows=113478386 loops=1)  Hash Cond: (((q.documentinternalid)::text = (documentinformationsubject.documentinternalid)::text) AND ((r.targetinternalid)::text = (documentinformationsubject.actinternalid)::text))  ->  Hash Right Join  (cost=1341053.37..2611158.36 rows=13 width=74) (actual time=109807.980..109808.040 rows=236 loops=1)        Hash Cond: (((documentinformationsubject_2.documentinternalid)::text = (q.documentinternalid)::text) AND ((documentinformationsubject_2.actinternalid)::text = (q.actinternalid)::text))        ->  Gather  (cost=30803.54..1300908.52 rows=1 width=74) (actual time=58730.915..58737.757 rows=0 loops=1)              Workers Planned: 2              Workers Launched: 2              ->  Parallel Hash Left Join  (cost=29803.54..1299908.42 rows=1 width=74) (actual time=58723.378..58723.379 rows=0 loops=3)                    Hash Cond: ((documentinformationsubject_2.otherentityinternalid)::text = (agencyid.entityinternalid)::text)                    ->  Parallel Hash Left Join  (cost=28118.13..1298223.00 rows=1 width=111) (actual time=58713.650..58713.652 rows=0 loops=3)                          Hash Cond: ((documentinformationsubject_2.otherentityinternalid)::text = (agencyname.entityinternalid)::text)                          ->  Parallel Seq Scan on documentinformationsubject documentinformationsubject_2  (cost=0.00..1268800.85 rows=1 width=111) (actual time=58544.391..58544.391 rows=0 loops=3)                                Filter: ((participationtypecode)::text = 'AUT'::text)                                Rows Removed by Filter: 2815562                          ->  Parallel Hash  (cost=24733.28..24733.28 rows=166628 width=37) (actual time=125.611..125.611 rows=133303 loops=3)                                Buckets: 65536  Batches: 16  Memory Usage: 2336kB                                ->  Parallel Seq Scan on bestname agencyname  (cost=0.00..24733.28 rows=166628 width=37) (actual time=0.009..60.685 rows=133303 loops=3)                    ->  Parallel Hash  (cost=1434.07..1434.07 rows=20107 width=37) (actual time=9.329..9.329 rows=11393 loops=3)                          Buckets: 65536  Batches: 1  Memory Usage: 2976kB                          ->  Parallel Seq Scan on entity_id agencyid  (cost=0.00..1434.07 rows=20107 width=37) (actual time=0.008..5.224 rows=11393 loops=3)        ->  Hash  (cost=1310249.63..1310249.63 rows=13 width=111) (actual time=51077.049..51077.049 rows=236 loops=1)              Buckets: 1024  Batches: 1  Memory Usage: 41kB              ->  Hash Right Join  (cost=829388.20..1310249.63 rows=13 width=111) (actual time=45607.852..51076.967 rows=236 loops=1)                    Hash Cond: ((an.actinternalid)::text = (q.actinternalid)::text)                    ->  Seq Scan on act_id an  (cost=0.00..425941.04 rows=14645404 width=37) (actual time=1.212..10883.350 rows=14676871 loops=1)                    ->  Hash  (cost=829388.19..829388.19 rows=1 width=111) (actual time=38246.715..38246.715 rows=236 loops=1)                          Buckets: 1024  Batches: 1  Memory Usage: 41kB                          ->  Gather  (cost=381928.46..829388.19 rows=1 width=111) (actual time=31274.733..38246.640 rows=236 loops=1)                                Workers Planned: 2                                Workers Launched: 2                                ->  Parallel Hash Join  (cost=380928.46..828388.09 rows=1 width=111) (actual time=31347.260..38241.812 rows=79 loops=3)                                      Hash Cond: ((q.actinternalid)::text = (r.sourceinternalid)::text)                                      ->  Parallel Seq Scan on documentinformation q  (cost=0.00..447271.93 rows=50050 width=74) (actual time=13304.439..20265.733 rows=87921 loops=3)                                            Filter: (((classcode)::text = 'CNTRCT'::text) AND ((moodcode)::text = 'EVN'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text))                                            Rows Removed by Filter: 1540625                                      ->  Parallel Hash  (cost=380928.44..380928.44 rows=1 width=74) (actual time=17954.106..17954.106 rows=79 loops=3)                                            Buckets: 1024  Batches: 1  Memory Usage: 104kB                                            ->  Parallel Seq Scan on actrelationship r  (cost=0.00..380928.44 rows=1 width=74) (actual time=7489.704..17953.959 rows=79 loops=3)                                                  Filter: ((typecode)::text = 'SUBJ'::text)                                                  Rows Removed by Filter: 3433326  ->  Hash  (cost=2861845.87..2861845.87 rows=34619 width=74) (actual time=199792.446..199792.446 rows=113478127 loops=1)        Buckets: 65536 (originally 65536)  Batches: 131072 (originally 2)  Memory Usage: 189207kB        ->  Gather Merge  (cost=2845073.40..2861845.87 rows=34619 width=74) (actual time=107620.262..156256.432 rows=113478127 loops=1)              Workers Planned: 2              Workers Launched: 2              ->  Merge Left Join  (cost=2844073.37..2856849.96 rows=14425 width=74) (actual time=107570.719..126113.792 rows=37826042 loops=3)                    Merge Cond: (((documentinformationsubject.documentinternalid)::text = (documentinformationsubject_1.documentinternalid)::text) AND ((documentinformationsubject.documentid)::text = (documentinformationsubject_1.documentid)::text) AND ((documentinformationsubject.actinternalid)::text = (documentinformationsubject_1.actinternalid)::text))                    ->  Sort  (cost=1295969.26..1296005.32 rows=14425 width=111) (actual time=57700.723..58134.751 rows=231207 loops=3)                          Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.documentid, documentinformationsubject.actinternalid                          Sort Method: external merge  Disk: 26936kB                          Worker 0:  Sort Method: external merge  Disk: 27152kB                          Worker 1:  Sort Method: external merge  Disk: 28248kB                          ->  Parallel Seq Scan on documentinformationsubject  (cost=0.00..1294972.76 rows=14425 width=111) (actual time=24866.656..57424.420 rows=231207 loops=3)                                Filter: (((participationtypecode)::text = ANY ('{PPRF,PRF}'::text[])) AND ((classcode)::text = 'ACT'::text) AND ((moodcode)::text = 'DEF'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text))                                Rows Removed by Filter: 2584355                    ->  Materialize  (cost=1548104.12..1553157.04 rows=1010585 width=111) (actual time=49869.984..54191.701 rows=38060250 loops=3)                          ->  Sort  (cost=1548104.12..1550630.58 rows=1010585 width=111) (actual time=49869.980..50832.205 rows=1031106 loops=3)                                Sort Key: documentinformationsubject_1.documentinternalid, documentinformationsubject_1.documentid, documentinformationsubject_1.actinternalid                                Sort Method: external merge  Disk: 122192kB                                Worker 0:  Sort Method: external merge  Disk: 122192kB                                Worker 1:  Sort Method: external merge  Disk: 122192kB                                ->  Seq Scan on documentinformationsubject documentinformationsubject_1  (cost=0.00..1329868.64 rows=1010585 width=111) (actual time=20366.166..47751.267 rows=1031106 loops=3)                                      Filter: ((participationtypecode)::text = 'PRD'::text)                                      Rows Removed by Filter: 7415579Planning Time: 2.523 msExecution Time: 464825.391 ms
(66 rows)

I also read the other responses. I agree that having a swap space available just in case is better than these annoying out of memory errors. And yes, I can add that memory profiler thing, if you think it would actually work. I've done it with java heap dumps, even upgrading the VM to a 32 GB VM just to crunch the heap dump. But can you tell me just a little more as to how I need to configure this thing to get the data you want without blowing up the memory and disk during this huge query?

regards,
-Gunther

Re: Out of Memory errors are frustrating as heck!

From

Tomas Vondra

Date:

18 April 2019, 15:21:28

On Wed, Apr 17, 2019 at 11:52:44PM -0400, Gunther wrote:
>Hi guys. I don't want to be pushy, but I found it strange that after 
>so much lively back and forth getting to the bottom of this, suddenly 
>my last nights follow-up remained completely without reply. I wonder 
>if it even got received. For those who read their emails with modern 
>readers (I know I too am from a time where I wrote everything in plain 
>text) I marked some important questions in bold.
>

It was received (and it's visible in the archives). It's right before
easter, so I guess some people may be already on a vaction.

As for the issue - I think the current hypothesis is that the data
distribution is skewed in some strange way, triggering some unexpected
behavior in hash join. That seems plausible, but it's really hard to
investigate without knowing anything about the data distribution :-(

It would be possible to do at least one of these two things:

(a) export pg_stats info about distribution of the join keys

The number of tables involved in the query is not that high, and this
would allo us to generate a data set approximating your data. The one
thing this can't do is showing how it's affected by WHERE conditions.

(b) export data for join keys

This is similar to (a), but it would allow filtering data by the WHERE
conditions first. The amount of data would be higher, although we only
need data from the columns used as join keys.

Of course, if those key values contain sensitive data, it may not be
possible, but perhaps you could hash it in some way.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Out of Memory errors are frustrating as heck!

From

Gaetano Mendola

Date:

19 April 2019, 20:28:38

On Thu, Apr 18, 2019 at 6:01 AM Gunther <raj@gusw.net> wrote:

Hi guys. I don't want to be pushy, but I found it strange that after so much lively back and forth getting to the bottom of this, suddenly my last nights follow-up remained completely without reply. I wonder if it even got received. For those who read their emails with modern readers (I know I too am from a time where I wrote everything in plain text) I marked some important questions in bold.

Give Tom Lane remote access to your server, around 10 years ago I had the server crashing and the fastest we could do is let him access the server.

Re: Out of Memory errors are frustrating as heck!

From

Justin Pryzby

Date:

19 April 2019, 21:01:17

On Wed, Apr 17, 2019 at 11:52:44PM -0400, Gunther wrote:
> Hi guys. I don't want to be pushy, but I found it strange that after so much

Were you able to reproduce the issue in some minimized way ?  Like after
joining fewer tables or changing to join with fewer join conditions ?

On Thu, Apr 18, 2019 at 05:21:28PM +0200, Tomas Vondra wrote:
> As for the issue - I think the current hypothesis is that the data
> distribution is skewed in some strange way, triggering some unexpected
> behavior in hash join. That seems plausible, but it's really hard to
> investigate without knowing anything about the data distribution :-(
> 
> It would be possible to do at least one of these two things:
> 
> (a) export pg_stats info about distribution of the join keys

For starts, could you send the MCVs, maybe with some variation on this query ?
https://wiki.postgresql.org/wiki/Slow_Query_Questions#Statistics:_n_distinct.2C_MCV.2C_histogram

Justin

Re: Out of Memory errors are frustrating as heck!

From

Gunther

Date:

20 April 2019, 03:34:54

On 4/19/2019 17:01, Justin Pryzby wrote:

Were you able to reproduce the issue in some minimized way ?  Like after
joining fewer tables or changing to join with fewer join conditions ?

On Thu, Apr 18, 2019 at 05:21:28PM +0200, Tomas Vondra wrote:

It would be possible to do at least one of these two things:

Thanks, and sorry for my pushyness. Yes, I have pin pointed the HashJoin, and I have created the two tables involved.

The data distribution of the join keys, they are all essentially UUIDs and essentially random.

I am sharing this data with you. However, only someone who can actually control the planner can use it to reproduce the problem. I have tried but not succeeded. But I am sure the problem is reproduced by this material.

Here is the part of the plan that generates this massive number of calls to

        ->  Hash Right Join  (cost=4255031.53..5530808.71 rows=34619 width=1197)             Hash Cond: (((q.documentinternalid)::text = (documentinformationsubject.documentinternalid)::text) AND ((r.targetinternalid)::text = (documentinformationsubject.actinternalid)::text))             ->  Hash Right Join  (cost=1341541.37..2612134.36 rows=13 width=341)                   Hash Cond: (((documentinformationsubject_2.documentinternalid)::text = (q.documentinternalid)::text) AND ((documentinformationsubject_2.actinternalid)::text = (q.actinternalid)::text))                 ... let's call this tmp_q ...             ->  Hash  (cost=2908913.87..2908913.87 rows=34619 width=930)                   ->  Gather Merge  (cost=2892141.40..2908913.87 rows=34619 width=930)                       ... let's call this tmp_r ...

This can be logically reduced to the following query

SELECT * FROM tmp_q q RIGHT OUTER JOIN tmp_r r   USING(documentInternalId, actInternalId);

with the following two tables

CREATE TABLE xtmp_q (documentinternalid        character varying(255),operationqualifiercode    character varying(512),operationqualifiername    character varying(512),actinternalid             character varying(255),approvalinternalid        character varying(255),approvalnumber            character varying(555),approvalnumbersystem      character varying(555),approvalstatecode         character varying(512),approvalstatecodesystem   character varying(512),approvaleffectivetimelow  character varying(512),approvaleffectivetimehigh character varying(512),approvalstatuscode        character varying(32),licensecode               character varying(512),agencyid                  character varying(555),agencyname                text
);

CREATE TABLE tmp_r (documentinternalid      character varying(255),is_current              character(1),documentid              character varying(555),documenttypecode        character varying(512),subjectroleinternalid   character varying(255),subjectentityinternalid character varying(255),subjectentityid         character varying(555),subjectentityidroot     character varying(555),subjectentityname       character varying,subjectentitytel        text,subjectentityemail      text,otherentityinternalid   character varying(255),confidentialitycode     character varying(512),actinternalid           character varying(255),operationcode           character varying(512),operationname           text,productitemcode         character varying(512),productinternalid       character varying(255)..
);

you can download the data here (URLs just a tiny bit obfuscated):

The small table http:// gusw dot net/tmp_q.gz

The big table is in the form of 9 parts of 20 MB each, http:// gusw dot net/tmp_r.gz.00, .01, .02, ..., .09, maybe you need only the first part.

Download as many as you have patience to grab, and then import the data like this:

\copy tmp_q from program 'zcat tmp_q.gz'
\copt tmp_r from program 'cat tmp_r.gz.* |zcat'

The only problem is that I can't test that this actually would trigger the memory problem, because I can't force the plan to use the right join, it always reverts to the left join hashing the tmp_q:

        ->  Hash Left Join  (cost=10.25..5601401.19 rows=5505039 width=12118)             Hash Cond: (((r.documentinternalid)::text = (q.documentinternalid)::text) AND ((r.actinternalid)::text = (q.actinternalid)::text))             ->  Seq Scan on tmp_r r  (cost=0.00..5560089.39 rows=5505039 width=6844)             ->  Hash  (cost=10.10..10.10 rows=10 width=6306)                   ->  Seq Scan on tmp_q q  (cost=0.00..10.10 rows=10 width=6306)

which is of course much better, but when tmp_q and tmp_r are the results of complex stuff that the planner can't estimate, then it gets it wrong, and then the issue gets triggered because we are hashing on the big tmp_r, not tmp_q.

It would be so nice if there was a way to force a specific plan for purposes of the testing. I tried giving false data in pg_class reltuples and relpages:

foo=# analyze tmp_q;
ANALYZE
foo=# analyze tmp_r;
ANALYZE
foo=# select relname, relpages, reltuples from pg_class where relname in ('tmp_q', 'tmp_r');relname | relpages |  reltuples
---------+----------+-------------tmp_r   |  5505039 | 1.13467e+08tmp_q   |        7 |         236
(2 rows)

foo=# update pg_class set (relpages, reltuples) = (5505039, 1.13467e+08) where relname = 'tmp_q';
UPDATE 1
foo=# update pg_class set (relpages, reltuples) = (7, 236) where relname = 'tmp_r';
UPDATE 1

but that didn't help. Somehow the planner outsmarts every such trick, so I can't get it to follow my right outer join plan where the big table is hashed. I am sure y'all know some way to force it.

regards,
-Gunther

Re: Out of Memory errors are frustrating as heck!

From

Jeremy Schneider

Date:

20 April 2019, 05:54:13

pg_hint_plan extension might be able to force a plan.

Also, I don’t know if perf probes & perf record/script could be useful for creating a log of all the calls to do memory allocation along with the unwound call stacks? Then analyzing that file? At least this can be done for a single process, and just while the problematic sql is running.

-Jeremy

Sent from my TI-83

On Apr 19, 2019, at 20:34, Gunther <raj@gusw.net> wrote:

On 4/19/2019 17:01, Justin Pryzby wrote:
Were you able to reproduce the issue in some minimized way ?  Like after
joining fewer tables or changing to join with fewer join conditions ?

On Thu, Apr 18, 2019 at 05:21:28PM +0200, Tomas Vondra wrote:
It would be possible to do at least one of these two things:
Thanks, and sorry for my pushyness. Yes, I have pin pointed the HashJoin, and I have created the two tables involved.
The data distribution of the join keys, they are all essentially UUIDs and essentially random.
I am sharing this data with you. However, only someone who can actually control the planner can use it to reproduce the problem. I have tried but not succeeded. But I am sure the problem is reproduced by this material.
Here is the part of the plan that generates this massive number of calls to
        ->  Hash Right Join  (cost=4255031.53..5530808.71 rows=34619 width=1197)             Hash Cond: (((q.documentinternalid)::text = (documentinformationsubject.documentinternalid)::text) AND ((r.targetinternalid)::text = (documentinformationsubject.actinternalid)::text))             ->  Hash Right Join  (cost=1341541.37..2612134.36 rows=13 width=341)                   Hash Cond: (((documentinformationsubject_2.documentinternalid)::text = (q.documentinternalid)::text) AND ((documentinformationsubject_2.actinternalid)::text = (q.actinternalid)::text))                 ... let's call this tmp_q ...             ->  Hash  (cost=2908913.87..2908913.87 rows=34619 width=930)                   ->  Gather Merge  (cost=2892141.40..2908913.87 rows=34619 width=930)                       ... let's call this tmp_r ...                
This can be logically reduced to the following query
SELECT * FROM tmp_q q RIGHT OUTER JOIN tmp_r r   USING(documentInternalId, actInternalId);
with the following two tables
CREATE TABLE xtmp_q (documentinternalid        character varying(255),operationqualifiercode    character varying(512),operationqualifiername    character varying(512),actinternalid             character varying(255),approvalinternalid        character varying(255),approvalnumber            character varying(555),approvalnumbersystem      character varying(555),approvalstatecode         character varying(512),approvalstatecodesystem   character varying(512),approvaleffectivetimelow  character varying(512),approvaleffectivetimehigh character varying(512),approvalstatuscode        character varying(32),licensecode               character varying(512),agencyid                  character varying(555),agencyname                text
);

CREATE TABLE tmp_r (documentinternalid      character varying(255),is_current              character(1),documentid              character varying(555),documenttypecode        character varying(512),subjectroleinternalid   character varying(255),subjectentityinternalid character varying(255),subjectentityid         character varying(555),subjectentityidroot     character varying(555),subjectentityname       character varying,subjectentitytel        text,subjectentityemail      text,otherentityinternalid   character varying(255),confidentialitycode     character varying(512),actinternalid           character varying(255),operationcode           character varying(512),operationname           text,productitemcode         character varying(512),productinternalid       character varying(255)..
);
you can download the data here (URLs just a tiny bit obfuscated):
The small table http:// gusw dot net/tmp_q.gz
The big table is in the form of 9 parts of 20 MB each, http:// gusw dot net/tmp_r.gz.00, .01, .02, ..., .09, maybe you need only the first part.
Download as many as you have patience to grab, and then import the data like this:
\copy tmp_q from program 'zcat tmp_q.gz'
\copt tmp_r from program 'cat tmp_r.gz.* |zcat'
The only problem is that I can't test that this actually would trigger the memory problem, because I can't force the plan to use the right join, it always reverts to the left join hashing the tmp_q:
        ->  Hash Left Join  (cost=10.25..5601401.19 rows=5505039 width=12118)             Hash Cond: (((r.documentinternalid)::text = (q.documentinternalid)::text) AND ((r.actinternalid)::text = (q.actinternalid)::text))             ->  Seq Scan on tmp_r r  (cost=0.00..5560089.39 rows=5505039 width=6844)             ->  Hash  (cost=10.10..10.10 rows=10 width=6306)                   ->  Seq Scan on tmp_q q  (cost=0.00..10.10 rows=10 width=6306)
which is of course much better, but when tmp_q and tmp_r are the results of complex stuff that the planner can't estimate, then it gets it wrong, and then the issue gets triggered because we are hashing on the big tmp_r, not tmp_q.
It would be so nice if there was a way to force a specific plan for purposes of the testing. I tried giving false data in pg_class reltuples and relpages:
foo=# analyze tmp_q;
ANALYZE
foo=# analyze tmp_r;
ANALYZE
foo=# select relname, relpages, reltuples from pg_class where relname in ('tmp_q', 'tmp_r');relname | relpages |  reltuples
---------+----------+-------------tmp_r   |  5505039 | 1.13467e+08tmp_q   |        7 |         236
(2 rows)

foo=# update pg_class set (relpages, reltuples) = (5505039, 1.13467e+08) where relname = 'tmp_q';
UPDATE 1
foo=# update pg_class set (relpages, reltuples) = (7, 236) where relname = 'tmp_r';
UPDATE 1
but that didn't help. Somehow the planner outsmarts every such trick, so I can't get it to follow my right outer join plan where the big table is hashed. I am sure y'all know some way to force it.
regards,
-Gunther

Re: Out of Memory errors are frustrating as heck!

From

Justin Pryzby

Date:

20 April 2019, 07:52:57

On Fri, Apr 19, 2019 at 11:34:54PM -0400, Gunther wrote:
> On 4/19/2019 17:01, Justin Pryzby wrote:
> >Were you able to reproduce the issue in some minimized way ?  Like after
> >joining fewer tables or changing to join with fewer join conditions ?
> >
> >On Thu, Apr 18, 2019 at 05:21:28PM +0200, Tomas Vondra wrote:
> >>It would be possible to do at least one of these two things:
> 
> Thanks, and sorry for my pushyness. Yes, I have pin pointed the HashJoin,
> and I have created the two tables involved.
> 
> The data distribution of the join keys, they are all essentially UUIDs and
> essentially random.
> 
> I am sharing this data with you. However, only someone who can actually
> control the planner can use it to reproduce the problem. I have tried but
> not succeeded. But I am sure the problem is reproduced by this material.
> 
> Here is the part of the plan that generates this massive number of calls to
> 
>         ->  Hash Right Join  (cost=4255031.53..5530808.71 rows=34619 width=1197)
>               Hash Cond: (((q.documentinternalid)::text = (documentinformationsubject.documentinternalid)::text) AND
((r.targetinternalid)::text= (documentinformationsubject.actinternalid)::text))
 
>               ->  Hash Right Join  (cost=1341541.37..2612134.36 rows=13 width=341)
>                     Hash Cond: (((documentinformationsubject_2.documentinternalid)::text =
(q.documentinternalid)::text)AND ((documentinformationsubject_2.actinternalid)::text = (q.actinternalid)::text))
 
>                   ... let's call this tmp_q ...
>               ->  Hash  (cost=2908913.87..2908913.87 rows=34619 width=930)
>                     ->  Gather Merge  (cost=2892141.40..2908913.87 rows=34619 width=930)
>                         ... let's call this tmp_r ...

Would you send basic stats for these ?
q.documentinternalid, documentinformationsubject.documentinternalid, r.targetinternalid,
documentinformationsubject.actinternalid

Like from this query
https://wiki.postgresql.org/wiki/Slow_Query_Questions#Statistics:_n_distinct.2C_MCV.2C_histogram

Justin

Re: Out of Memory errors are frustrating as heck!

From

Justin Pryzby

Date:

20 April 2019, 10:53:36

> The only problem is that I can't test that this actually would trigger the
> memory problem, because I can't force the plan to use the right join, it
> always reverts to the left join hashing the tmp_q:

I think the table on the "OUTER" side is the one which needs to be iterated
over (not hashed), in order to return each of its rows even if there are no
join partners in the other table.  In your original query, the small table was
being hashed and the large table iterated; maybe that's whats important.

> which is of course much better, but when tmp_q and tmp_r are the results of
> complex stuff that the planner can't estimate, then it gets it wrong, and
> then the issue gets triggered because we are hashing on the big tmp_r, not
> tmp_q.

I was able to get something maybe promising ?  "Batches: 65536 (originally 1)"

I didn't get "Out of memory" error yet, but did crash the server with this one:
postgres=# explain analyze WITH v AS( SELECT * FROM generate_series(1,99999999)i WHERE i%10<10 AND i%11<11 AND i%12<12
ANDi%13<13 AND i%14<14 AND i%15<15 AND i%16<16 AND i%17<17 AND i%18<18 AND i%19<19 AND i%20<20 AND i%21<21 ) SELECT *
FROMgenerate_series(1,99)k JOIN v ON k=i ;
 

Note, on pg12dev this needs to be "with v AS MATERIALIZED".

postgres=# SET work_mem='128kB';SET client_min_messages =log;SET log_statement_stats=on;explain(analyze,timing off)
WITHv AS( SELECT * FROM generate_series(1,999999)i WHERE i%10<10 AND i%11<11 AND i%12<12 AND i%13<13 AND i%14<14 AND
i%15<15AND i%16<16 AND i%17<17 AND i%18<18 AND i%19<19 AND i%20<20 AND i%21<21 ) SELECT * FROM generate_series(1,99)k
JOINv ON k=i ;
 
 Hash Join  (cost=70.04..83.84 rows=5 width=8) (actual rows=99 loops=1)
   Hash Cond: (k.k = v.i)
   CTE v
     ->  Function Scan on generate_series i  (cost=0.00..70.00 rows=1 width=4) (actual rows=999999 loops=1)
           Filter: (((i % 10) < 10) AND ((i % 11) < 11) AND ((i % 12) < 12) AND ((i % 13) < 13) AND ((i % 14) < 14) AND
((i% 15) < 15) AND ((i % 16) < 16) AND ((i % 17) < 17) AND ((i % 18) < 18) AND ((i % 19) < 19) AND ((i % 20) < 20) AND
((i% 21) < 21))
 
   ->  Function Scan on generate_series k  (cost=0.00..10.00 rows=1000 width=4) (actual rows=99 loops=1)
   ->  Hash  (cost=0.02..0.02 rows=1 width=4) (actual rows=999999 loops=1)
         Buckets: 4096 (originally 1024)  Batches: 512 (originally 1)  Memory Usage: 101kB
         ->  CTE Scan on v  (cost=0.00..0.02 rows=1 width=4) (actual rows=999999 loops=1)

Justin

Re: Out of Memory errors are frustrating as heck!

From

Justin Pryzby

Date:

20 April 2019, 19:30:09

On Sun, Apr 14, 2019 at 11:24:59PM -0400, Tom Lane wrote:
> Gunther <raj@gusw.net> writes:
> >        ExecutorState: 2234123384 total in 266261 blocks; 3782328 free (17244 chunks); 2230341056 used
> 
> Oooh, that looks like a memory leak right enough.  The ExecutorState
> should not get that big for any reasonable query.

On Tue, Apr 16, 2019 at 11:30:19AM -0400, Tom Lane wrote:
> Hmm ... this matches up with a vague thought I had that for some reason
> the hash join might be spawning a huge number of separate batches.
> Each batch would have a couple of files with associated in-memory
> state including an 8K I/O buffer, so you could account for the

On Tue, Apr 16, 2019 at 10:24:53PM -0400, Gunther wrote:
>    ->  Hash  (cost=2861845.87..2861845.87 rows=34619 width=74) (actual time=199792.446..199792.446 rows=113478127
loops=1)
>          Buckets: 65536 (originally 65536)  Batches: 131072 (originally 2)  Memory Usage: 189207kB

Is it significant that there are ~2x as many ExecutorState blocks as there are
batches ?  266261/131072 => 2.03...

If there was 1 blocks leaked when batch=2, and 2 blocks leaked when batch=4,
and 4 blocks leaked when batch=131072, then when batch=16, there'd be 64k
leaked blocks, and 131072 total blocks.

I'm guessing Tom probably already thought of this, but:
2230341056/266261 => ~8376
which is pretty close to the 8kB I/O buffer you were talking about (if the
number of same-sized buffers much greater than other allocations).

If Tom thinks (as I understand) that the issue is *not* a memory leak, but out
of control increasing of nbatches, and failure to account for their size...then
this patch might help.

The number of batches is increased to avoid exceeding work_mem.  With very low
work_mem (or very larger number of tuples hashed), it'll try to use a large
number of batches.  At some point the memory used by BatchFiles structure
(increasing by powers of two) itself exceeds work_mem.

With larger work_mem, there's less need for more batches.  So the number of
batches used for small work_mem needs to be constrained, either based on
work_mem, or at all.

With my patch, the number of batches is nonlinear WRT work_mem, and reaches a
maximum for moderately small work_mem.  The goal is to choose the optimal
number of batches to minimize the degree to which work_mem is exceeded.

Justin

Attachment

limit-hash-nbatches.patch

Re: Out of Memory errors are frustrating as heck!

From

Gunther

Date:

20 April 2019, 20:00:18

On Tue, Apr 16, 2019 at 11:46:51PM -0500, Justin Pryzby wrote:

I wonder if it'd be useful to compile with 
./configure CFLAGS=-DHJDEBUG=1

Could you try this, too ?

OK, doing it now, here is what I'm getting in the log file now. I am surprised I get so few rows here when there

2019-04-20 17:12:16.077 UTC [7093] LOG:  database system was shut down at 2019-04-20 17:12:15 UTC
2019-04-20 17:12:16.085 UTC [7091] LOG:  database system is ready to accept connections
Hashjoin 0x118e0c8: initial nbatch = 1, nbuckets = 1024
Hashjoin 0x118e0f8: initial nbatch = 1, nbuckets = 1024
Hashjoin 0x1194e78: initial nbatch = 1, nbuckets = 65536
Hashjoin 0x119b518: initial nbatch = 16, nbuckets = 65536
Hashjoin 0x1194e78: initial nbatch = 1, nbuckets = 65536
Hashjoin 0x119bb38: initial nbatch = 16, nbuckets = 65536
TopMemoryContext: 4347672 total in 9 blocks; 41784 free (19 chunks); 4305888 used HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 1456 free (0 chunks); 6736 used TopTransactionContext: 8192 total in 1 blocks; 5416 free (2 chunks); 2776 used Operator lookup cache: 24576 total in 2 blocks; 10760 free (3 chunks); 13816 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 32768 total in 3 blocks; 13488 free (1 chunks); 19280 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used   PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used   PortalContext: 1474560 total in 183 blocks; 6152 free (8 chunks); 1468408 used:     ExecutorState: 2234501600 total in 266274 blocks; 3696112 free (17274 chunks); 2230805488 used       HashTableContext: 32768 total in 3 blocks; 17272 free (8 chunks); 15496 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 40120 used       TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 40120 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8454256 total in 6 blocks; 64848 free (32 chunks); 8389408 used         HashBatchContext: 177003344 total in 5387 blocks; 7936 free (0 chunks); 176995408 used       TupleSort main: 452880 total in 8 blocks; 126248 free (27 chunks); 326632 used         Caller tuples: 1048576 total in 8 blocks; 21608 free (14 chunks); 1026968 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used CacheMemoryContext: 1101328 total in 14 blocks; 236384 free (1 chunks); 864944 used   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index   index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx   index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: entity_id_idx   index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: act_id_fkidx   ...   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_attribute_relid_attnum_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 6152 free (1 chunks); 2040 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (4 chunks); 256 used
Grand total: 2430295856 bytes in 272166 blocks; 5223104 free (17571 chunks); 2425072752 used
2019-04-20 17:28:56.887 UTC [7100] ERROR:  out of memory
2019-04-20 17:28:56.887 UTC [7100] DETAIL:  Failed on request of size 32800 in memory context "HashBatchContext".
2019-04-20 17:28:56.887 UTC [7100] STATEMENT:  explain analyze SELECT * FROM reports.v_BusinessOperation;

There are amazingly few entries in the log.

(gdb) break ExecHashIncreaseNumBatches
Breakpoint 1 at 0x6b7bd6: file nodeHash.c, line 884.
(gdb) cont
Continuing.

Breakpoint 1, ExecHashIncreaseNumBatches (hashtable=0x12d0818) at nodeHash.c:884
884             int                     oldnbatch = hashtable->nbatch;
(gdb) cont
Continuing.

Breakpoint 1, ExecHashIncreaseNumBatches (hashtable=0x12d0818) at nodeHash.c:884
884             int                     oldnbatch = hashtable->nbatch;
(gdb) cont
Continuing.

Breakpoint 1, ExecHashIncreaseNumBatches (hashtable=0x12d0818) at nodeHash.c:884
884             int                     oldnbatch = hashtable->nbatch;
(gdb) cont 100
Will ignore next 99 crossings of breakpoint 1.  Continuing.

But I am surprised now to find that the behavior has changed or what?

First, weirdly enough, now I am not getting any of the HJDEBUG messages any more. And yet my resident memory has already increased to 1.1 GB.

 7100 postgres  20   0 2797036   1.1g 188064 R  17.9 14.9  14:46.00 postgres: postgres integrator [local] EXPLAIN7664 postgres  20   0 1271788  16228  14408 D  17.6  0.2   0:01.96 postgres: parallel worker for PID 71007665 postgres  20   0 1271788  16224  14404 R  17.6  0.2   0:01.95 postgres: parallel worker for PID 7100

so why is it all different now? Are we chasing a red herring?

Finally I had another stop now at the HJDEBUG line

#0  ExecHashIncreaseNumBatches (hashtable=0x12d0818) at nodeHash.c:904
#1  0x00000000006b93b1 in ExecHashTableInsert (hashtable=0x12d0818, slot=0x12bcf98, hashvalue=234960700) at nodeHash.c:1655
#2  0x00000000006bd600 in ExecHashJoinNewBatch (hjstate=0x12a4340) at nodeHashjoin.c:1051
#3  0x00000000006bc999 in ExecHashJoinImpl (pstate=0x12a4340, parallel=false) at nodeHashjoin.c:539
#4  0x00000000006bca23 in ExecHashJoin (pstate=0x12a4340) at nodeHashjoin.c:565
#5  0x00000000006a191f in ExecProcNodeInstr (node=0x12a4340) at execProcnode.c:461
#6  0x00000000006ceaad in ExecProcNode (node=0x12a4340) at ../../../src/include/executor/executor.h:247
#7  0x00000000006cebe7 in ExecSort (pstate=0x12a4230) at nodeSort.c:107
#8  0x00000000006a191f in ExecProcNodeInstr (node=0x12a4230) at execProcnode.c:461
#9  0x00000000006a18f0 in ExecProcNodeFirst (node=0x12a4230) at execProcnode.c:445
#10 0x00000000006cf25c in ExecProcNode (node=0x12a4230) at ../../../src/include/executor/executor.h:247
#11 0x00000000006cf388 in ExecUnique (pstate=0x12a4040) at nodeUnique.c:73
#12 0x00000000006a191f in ExecProcNodeInstr (node=0x12a4040) at execProcnode.c:461
#13 0x00000000006a18f0 in ExecProcNodeFirst (node=0x12a4040) at execProcnode.c:445
#14 0x000000000069728b in ExecProcNode (node=0x12a4040) at ../../../src/include/executor/executor.h:247
#15 0x0000000000699790 in ExecutePlan (estate=0x12a3da0, planstate=0x12a4040, use_parallel_mode=true, operation=CMD_SELECT,   sendTuples=true, numberTuples=0, direction=ForwardScanDirection, dest=0xe811c0 <donothingDR>, execute_once=true)   at execMain.c:1723
#16 0x0000000000697757 in standard_ExecutorRun (queryDesc=0x1404168, direction=ForwardScanDirection, count=0, execute_once=true)   at execMain.c:364
#17 0x00000000006975f4 in ExecutorRun (queryDesc=0x1404168, direction=ForwardScanDirection, count=0, execute_once=true)   at execMain.c:307
#18 0x000000000060d227 in ExplainOnePlan (plannedstmt=0x1402588, into=0x0, es=0x10b03a8,   queryString=0x10866c0 "explain analyze SELECT * FROM reports.v_BusinessOperation;", params=0x0, queryEnv=0x0,   planduration=0x7fff56a0df00) at explain.c:535

So now I am back with checking who calls AllocSetAlloc?

(gdb) break AllocSetAlloc if (int)strcmp(context->name, "ExecutorState") == 0
Breakpoint 4 at 0x9eab11: file aset.c, line 719.
(gdb) cont
Continuing.

Breakpoint 4, AllocSetAlloc (context=0x12a3c90, size=8272) at aset.c:719
719             AllocSet        set = (AllocSet) context;
(gdb) bt 5
#0  AllocSetAlloc (context=0x12a3c90, size=8272) at aset.c:719
#1  0x00000000009f2e47 in palloc (size=8272) at mcxt.c:938
#2  0x000000000082ae84 in makeBufFileCommon (nfiles=1) at buffile.c:116
#3  0x000000000082af14 in makeBufFile (firstfile=34029) at buffile.c:138
#4  0x000000000082b09f in BufFileCreateTemp (interXact=false) at buffile.c:201
(More stack frames follow...)
(gdb) bt 7
#0  AllocSetAlloc (context=0x12a3c90, size=8272) at aset.c:719
#1  0x00000000009f2e47 in palloc (size=8272) at mcxt.c:938
#2  0x000000000082ae84 in makeBufFileCommon (nfiles=1) at buffile.c:116
#3  0x000000000082af14 in makeBufFile (firstfile=34029) at buffile.c:138
#4  0x000000000082b09f in BufFileCreateTemp (interXact=false) at buffile.c:201
#5  0x00000000006bda31 in ExecHashJoinSaveTuple (tuple=0x86069b8, hashvalue=234960700, fileptr=0x8616a30) at nodeHashjoin.c:1220
#6  0x00000000006b80c0 in ExecHashIncreaseNumBatches (hashtable=0x12d0818) at nodeHash.c:1004
(More stack frames follow...)
(gdb) cont
Continuing.

Breakpoint 4, AllocSetAlloc (context=0x12a3c90, size=8) at aset.c:719
719             AllocSet        set = (AllocSet) context;
(gdb) bt 7
#0  AllocSetAlloc (context=0x12a3c90, size=8) at aset.c:719
#1  0x00000000009f2f5d in palloc0 (size=8) at mcxt.c:969
#2  0x000000000082aea2 in makeBufFileCommon (nfiles=1) at buffile.c:119
#3  0x000000000082af14 in makeBufFile (firstfile=34029) at buffile.c:138
#4  0x000000000082b09f in BufFileCreateTemp (interXact=false) at buffile.c:201
#5  0x00000000006bda31 in ExecHashJoinSaveTuple (tuple=0x86069b8, hashvalue=234960700, fileptr=0x8616a30) at nodeHashjoin.c:1220
#6  0x00000000006b80c0 in ExecHashIncreaseNumBatches (hashtable=0x12d0818) at nodeHash.c:1004
(More stack frames follow...)
(gdb) cont
Continuing.

Breakpoint 4, AllocSetAlloc (context=0x12a3c90, size=4) at aset.c:719
719             AllocSet        set = (AllocSet) context;
(gdb) bt 7
#0  AllocSetAlloc (context=0x12a3c90, size=4) at aset.c:719
#1  0x00000000009f2e47 in palloc (size=4) at mcxt.c:938
#2  0x000000000082af22 in makeBufFile (firstfile=34029) at buffile.c:140
#3  0x000000000082b09f in BufFileCreateTemp (interXact=false) at buffile.c:201
#4  0x00000000006bda31 in ExecHashJoinSaveTuple (tuple=0x86069b8, hashvalue=234960700, fileptr=0x8616a30) at nodeHashjoin.c:1220
#5  0x00000000006b80c0 in ExecHashIncreaseNumBatches (hashtable=0x12d0818) at nodeHash.c:1004
#6  0x00000000006b93b1 in ExecHashTableInsert (hashtable=0x12d0818, slot=0x12bcf98, hashvalue=234960700) at nodeHash.c:1655
(More stack frames follow...)
(gdb) cont
Continuing.

Breakpoint 4, AllocSetAlloc (context=0x12a3c90, size=375) at aset.c:719
719             AllocSet        set = (AllocSet) context;
(gdb) bt 7
#0  AllocSetAlloc (context=0x12a3c90, size=375) at aset.c:719
#1  0x00000000009f2e47 in palloc (size=375) at mcxt.c:938
#2  0x00000000006bdbec in ExecHashJoinGetSavedTuple (hjstate=0x12a4340, file=0x13ca418, hashvalue=0x7fff56a0da54, tupleSlot=0x12bcf98)   at nodeHashjoin.c:1277
#3  0x00000000006bd61f in ExecHashJoinNewBatch (hjstate=0x12a4340) at nodeHashjoin.c:1042
#4  0x00000000006bc999 in ExecHashJoinImpl (pstate=0x12a4340, parallel=false) at nodeHashjoin.c:539
#5  0x00000000006bca23 in ExecHashJoin (pstate=0x12a4340) at nodeHashjoin.c:565
#6  0x00000000006a191f in ExecProcNodeInstr (node=0x12a4340) at execProcnode.c:461
(More stack frames follow...)

Now I am adding a breakpoint in AllocSetFree and I see that this gets called right after AllocSetAlloc with the same context, and presumably the pointer previously allocated.

(gdb) finish
Run till exit from #0  AllocSetAlloc (context=0x12a3c90, size=375) at aset.c:719
0x00000000009f2e47 in palloc (size=375) at mcxt.c:938
938             ret = context->methods->alloc(context, size);
Value returned is $1 = (void *) 0x11bd858
(gdb) cont
Continuing.

Breakpoint 5, AllocSetFree (context=0x12a3c90, pointer=0x11bf748) at aset.c:992
992             AllocSet        set = (AllocSet) context;
(gdb) cont
Continuing.

Breakpoint 4, AllocSetAlloc (context=0x12a3c90, size=375) at aset.c:719
719             AllocSet        set = (AllocSet) context;
(gdb) finish
Run till exit from #0  AllocSetAlloc (context=0x12a3c90, size=375) at aset.c:719
0x00000000009f2e47 in palloc (size=375) at mcxt.c:938
938             ret = context->methods->alloc(context, size);
Value returned is $2 = (void *) 0x11bf748
(gdb) cont
Continuing.

Breakpoint 5, AllocSetFree (context=0x12a3c90, pointer=0x11bd858) at aset.c:992
992             AllocSet        set = (AllocSet) context;

See, now that pointer allocated before is being freed. I can already see how I would write a memory leak detection tool. I would keep a cache of a fixed size, say one page, of recently allocated pointers, and when free is called, it would remove the pointer from the cache. Then I would only log the allocated pointer once it has to be evicted from the cache because another one needs to be added, so I don't fill my log file with a bunch of memory allocations that get freed relatively quickly.

This is very confused. I just stepped over

Breakpoint 2, ExecHashIncreaseNumBatches (hashtable=0x12d0818) at nodeHash.c:904
904             printf("Hashjoin %p: increasing nbatch to %d because space = %zu\n",
(gdb) next
908             oldcxt = MemoryContextSwitchTo(hashtable->hashCxt);
(gdb) call MemoryContextStats(TopPortalContext)

and checked my log file and there was nothing before the call MemoryContextStats(TopPortalContext) so I don't understand where this printf stuff is ending up. Clearly *some* of it is in the log, but none of that "increasing nbatch" stuff is. I think there is something wrong with that HJDEBUG stuff. Oops, my calling that MemoryContextStats really seems to help force this buffer to be flushed, because now I got more:

Hashjoin 0x118e4c8: initial nbatch = 1, nbuckets = 1024
Hashjoin 0x118e4f8: initial nbatch = 1, nbuckets = 1024
Hashjoin 0x1195278: initial nbatch = 1, nbuckets = 65536
Hashjoin 0x119b918: initial nbatch = 16, nbuckets = 65536
Hashjoin 0x1195278: initial nbatch = 1, nbuckets = 65536
Hashjoin 0x119b918: initial nbatch = 16, nbuckets = 65536
...
Hashjoin 0x13a8e68: initial nbatch = 16, nbuckets = 8192
Hashjoin 0x13a8e68: increasing nbatch to 32 because space = 4128933
Hashjoin 0x13a8e68: freed 148 of 10584 tuples, space now 4071106
Hashjoin 0x13a8e68: increasing nbatch to 64 because space = 4128826
Hashjoin 0x13a8e68: freed 544 of 10584 tuples, space now 3916296
Hashjoin 0x13a8e68: increasing nbatch to 128 because space = 4128846
Hashjoin 0x13a8e68: freed 10419 of 10585 tuples, space now 65570
Hashjoin 0x13a8e68: increasing nbatch to 256 because space = 4128829
Hashjoin 0x13a8e68: freed 10308 of 10734 tuples, space now 161815
Hashjoin 0x13a8e68: increasing nbatch to 512 because space = 4128908
Hashjoin 0x13a8e68: freed 398 of 10379 tuples, space now 3977787
Hashjoin 0x13a8e68: increasing nbatch to 1024 because space = 4129008
Hashjoin 0x13a8e68: freed 296 of 10360 tuples, space now 4013423
Hashjoin 0x13a8e68: increasing nbatch to 2048 because space = 4129133
Hashjoin 0x13a8e68: freed 154 of 10354 tuples, space now 4068786
Hashjoin 0x13a8e68: increasing nbatch to 4096 because space = 4129035
Hashjoin 0x13a8e68: freed 10167 of 10351 tuples, space now 72849
Hashjoin 0x242c9b0: initial nbatch = 1, nbuckets = 1024
Hashjoin 0x2443ee0: initial nbatch = 1, nbuckets = 1024
Hashjoin 0x2443aa0: initial nbatch = 1, nbuckets = 1024
Hashjoin 0x2443440: initial nbatch = 1, nbuckets = 65536
Hashjoin 0x2443330: initial nbatch = 16, nbuckets = 65536
Hashjoin 0x13a8e68: increasing nbatch to 8192 because space = 4128997
Hashjoin 0x12d0818: freed 10555 of 10560 tuples, space now 1983
Hashjoin 0x12d0818: increasing nbatch to 16384 because space = 4128957
Hashjoin 0x12d0818: freed 10697 of 10764 tuples, space now 25956TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used
...

And you see here clearly that there is a problem with this printf stdout buffer.

Now finally I am at this stage of the breakpoint firing rapidly

(gdb) cont 100
Will ignore next 99 crossings of breakpoint 1.  Continuing.

Breakpoint 1, ExecHashIncreaseNumBatches (hashtable=0x12d0818) at nodeHash.c:884
884             int                     oldnbatch = hashtable->nbatch;
(gdb) bt 7
#0  ExecHashIncreaseNumBatches (hashtable=0x12d0818) at nodeHash.c:884
#1  0x00000000006b93b1 in ExecHashTableInsert (hashtable=0x12d0818, slot=0x12bcf98, hashvalue=2161368) at nodeHash.c:1655
#2  0x00000000006bd600 in ExecHashJoinNewBatch (hjstate=0x12a4340) at nodeHashjoin.c:1051
#3  0x00000000006bc999 in ExecHashJoinImpl (pstate=0x12a4340, parallel=false) at nodeHashjoin.c:539
#4  0x00000000006bca23 in ExecHashJoin (pstate=0x12a4340) at nodeHashjoin.c:565
#5  0x00000000006a191f in ExecProcNodeInstr (node=0x12a4340) at execProcnode.c:461
#6  0x00000000006ceaad in ExecProcNode (node=0x12a4340) at ../../../src/include/executor/executor.h:247
(More stack frames follow...)

so it's not a red herring after all. But nothing new in the logfile from that HJDEBUG. Trying to flush stdout with this call MemoryContextStats(TopPortalContext)

...   ExecutorState: 782712360 total in 92954 blocks; 3626888 free (3126 chunks); 779085472 used

and nothing heard from HJDEBUG. So you really can't rely on that HJDEBUG stuff. It's not working.

I will summarize that the memory problem with the rapid firing on ExecHashIncreaseNumBatches is still occurring, confirmed as I reported earlier, it wasn't some red herring but that the HJDEBUG stuff doesn't work. Also, that there might be benefit in creating a little resident memory leak detector that keeps track of a single page of AllocSetAlloc pointers cancelled out by matching AllocSetFree and reporting the overflow with a quick little stack trace.

On 4/20/2019 6:53, Justin Pryzby wrote:

The only problem is that I can't test that this actually would trigger the
memory problem, because I can't force the plan to use the right join, it
always reverts to the left join hashing the tmp_q:

I think the table on the "OUTER" side is the one which needs to be iterated
over (not hashed), in order to return each of its rows even if there are no
join partners in the other table.  In your original query, the small table was
being hashed and the large table iterated; maybe that's whats important.

May be so. Trying to wrap my head around the RIGHT vs. LEFT outer join and why there even is a difference though.

       ->  Hash Right Join  (cost=4255031.53..5530808.71 rows=34619 width=1197)             Hash Cond: (((q.documentinternalid)::text = (documentinformationsubject.documentinternalid)::text) AND ((r.targetinternalid)::text = (documentinformationsubject.actinternalid)::text))             ->  Hash Right Join  (cost=1341541.37..2612134.36 rows=13 width=341)                   Hash Cond: (((documentinformationsubject_2.documentinternalid)::text = (q.documentinternalid)::text) AND ((documentinformationsubject_2.actinternalid)::text = (q.actinternalid)::text))             ... from the TINY table tmp_q, the estimate of 13 rows ain't bad             ->  Hash  (cost=2908913.87..2908913.87 rows=34619 width=930)                  ->  Gather Merge  (cost=2892141.40..2908913.87 rows=34619 width=930)                  ....  from the HUGE tmp_r table, with 100 million rows (estimate of 34619 is grossly wrong, but how could it know? ...

Now in a right join, we include all rows from the right table, and only those from the left table that match the join key. I wonder why not transform all of those to left joins then?

Why are we not hashing only the "optional" side?

The plan here seems to tell me without a doubt that it is hashing the big table, 100 million rows get hashed into the hash join and then we iterate over the tiny table and then ... now my mind boggles and I just don't know why there are right joins at all.

But I see the plan is running the Hash index of 100 million rows.

which is of course much better, but when tmp_q and tmp_r are the results of
complex stuff that the planner can't estimate, then it gets it wrong, and
then the issue gets triggered because we are hashing on the big tmp_r, not
tmp_q.

I was able to get something maybe promising ?  "Batches: 65536 (originally 1)"

I didn't get "Out of memory" error yet, but did crash the server with this one:
postgres=# explain analyze WITH v AS( SELECT * FROM generate_series(1,99999999)i WHERE i%10<10 AND i%11<11 AND i%12<12 AND i%13<13 AND i%14<14 AND i%15<15 AND i%16<16 AND i%17<17 AND i%18<18 AND i%19<19 AND i%20<20 AND i%21<21 ) SELECT * FROM generate_series(1,99)k JOIN v ON k=i ;

Note, on pg12dev this needs to be "with v AS MATERIALIZED".

postgres=# SET work_mem='128kB';SET client_min_messages =log;SET log_statement_stats=on;explain(analyze,timing off) WITH v AS( SELECT * FROM generate_series(1,999999)i WHERE i%10<10 AND i%11<11 AND i%12<12 AND i%13<13 AND i%14<14 AND i%15<15 AND i%16<16 AND i%17<17 AND i%18<18 AND i%19<19 AND i%20<20 AND i%21<21 ) SELECT * FROM generate_series(1,99)k JOIN v ON k=i ;Hash Join  (cost=70.04..83.84 rows=5 width=8) (actual rows=99 loops=1)  Hash Cond: (k.k = v.i)  CTE v    ->  Function Scan on generate_series i  (cost=0.00..70.00 rows=1 width=4) (actual rows=999999 loops=1)          Filter: (((i % 10) < 10) AND ((i % 11) < 11) AND ((i % 12) < 12) AND ((i % 13) < 13) AND ((i % 14) < 14) AND ((i % 15) < 15) AND ((i % 16) < 16) AND ((i % 17) < 17) AND ((i % 18) < 18) AND ((i % 19) < 19) AND ((i % 20) < 20) AND ((i % 21) < 21))  ->  Function Scan on generate_series k  (cost=0.00..10.00 rows=1000 width=4) (actual rows=99 loops=1)  ->  Hash  (cost=0.02..0.02 rows=1 width=4) (actual rows=999999 loops=1)        Buckets: 4096 (originally 1024)  Batches: 512 (originally 1)  Memory Usage: 101kB        ->  CTE Scan on v  (cost=0.00..0.02 rows=1 width=4) (actual rows=999999 loops=1)

Yes I thought that with CTEs and functions one might be able to generate a test case, but still not seing how you can trick the planner into this peculiar jointype JOIN_RIGHT and whether that is requisite for triggering the problem.

Finally, I have tried to make a pstate pretty printer in explain.c:

void
DumpPlanState(PlanState *pstate) {       ExplainState *es = NewExplainState();       ExplainNode(pstate, NIL, NULL, NULL, es);       puts(es->str->data);       pfree(es->str->data);
}

but that didn't work, because unfortunately that ExplainNode function is destructive. It would be so nice to refactor this explain code such that there would be a completely conservative function that simply dumps the present pstate with all the information about its estimate and actual situation, how many iterations it has already accomplished, how many it estimates to still have to do, whether its original estimate panned out or not, etc. This would be so tremendously useful for runtime debugging of queries. I think the utility of this can hardly be overstated. I mean even for end-user applications of some data warehouse, where you could probe a long running query every 5 seconds as to where the execution is. Man, I could not think of any more low hanging fruit useful feature. I am sure that if PostgreSQL was originally written in Java, this feature would naturally exist already.

regards and Happy Easter,
-Gunther

Re: Out of Memory errors are frustrating as heck!

From

Tomas Vondra

Date:

20 April 2019, 20:01:34

On Fri, Apr 19, 2019 at 11:34:54PM -0400, Gunther wrote:
>
> ...
>
>It would be so nice if there was a way to force a specific plan for 
>purposes of the testing.  I tried giving false data in pg_class 
>reltuples and relpages:
>
>foo=# analyze tmp_q;
>ANALYZE
>foo=# analyze tmp_r;
>ANALYZE
>foo=# select relname, relpages, reltuples from pg_class where relname in ('tmp_q', 'tmp_r');
> relname | relpages |  reltuples
>---------+----------+-------------
> tmp_r   |  5505039 | 1.13467e+08
> tmp_q   |        7 |         236
>(2 rows)
>
>foo=# update pg_class set (relpages, reltuples) = (5505039, 1.13467e+08) where relname = 'tmp_q';
>UPDATE 1
>foo=# update pg_class set (relpages, reltuples) = (7, 236) where relname = 'tmp_r';
>UPDATE 1
>
>but that didn't help. Somehow the planner outsmarts every such trick, 
>so I can't get it to follow my right outer join plan where the big 
>table is hashed.  I am sure y'all know some way to force it.
>

That does not work, because the planner does not actually use these values
directly - it only computes the density from them, and then multiplies
that to the current number of pages in the file. That behaves much nicer
when the table grows/shrinks between refreshes of the pg_class values.

So what you need to do is tweak these values to skew the density in a way
that then results in the desired esimate when multiplied with the actual
number of pages. For me, this did the trick:

  update pg_class set (relpages, reltuples) = (1000000, 1)
  where relname = 'tmp_r';

  update pg_class set (relpages, reltuples) = (1, 1000000)
  where relname = 'tmp_q';

after which I get a plan like this:

   Hash Right Join
     Hash Cond: (...)
     ->  Seq Scan on tmp_q q
     ->  Hash
           ->  Seq Scan on tmp_r r

As for the issue, I have a theory that I think would explain the issues.
It is related to the number of batches, as others speculated over here.
It's not a memory leak, though, it's just that each batch requires a lot
of extra memory and we don't account for that.

The trouble is - each batch is represented by BufFile, which is a whopping
8272 bytes, because it includes PGAlignedBlock. Now, with 131072 batches,
that's a nice 1GB of memory right there. And we don't actually account for
this memory in hashjoin code, so it's not counted against work_mem and we
just increase the number of batches.

Attached are two patches, that should help us to confirm that's actually
what's happening when running the query on actual data. The first patch
moves the BufFile stuff into a separate memory context, to make it more
obvious where the memory went. It also adds a buch of logging into the
ExecHashIncreaseNumBatches() function.

The second patch makes sure all the BufFiles are allocated right when
increasing the number of batches - otherwise we allocate them only when we
actually find a row for that batch, and I suspect the sample data shared
on this thread are somewhat correlated (I see long runs of the same UUID
value). That might slow down the memory growth. Of course, the real data
probably don't have such correlation, resulting in faster failures.

With the patch, I see stuff like this with 256k batches:

  ExecutorState: 65536 total in 4 blocks; 28136 free (4 chunks); 37400 used
    HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used
      hash batch files: 4404002656 total in 524302 blocks; 8387928 free (20 chunks); 4395614728 used

so it's conceivable it's the root cause.

As for a fix, I'm not sure. I'm pretty sure we need to consider the amount
of memory for BufFile(s) when increasing the number of batches. But we
can't just stop incrementing the batches, because that would mean the
current batch may easily get bigger than work_mem :-(

I think we might cap the number of batches kept in memory, and at some
point start spilling data into an "overflow batch." So for example we'd
allow 32k batches, and then instead of increasing nbatch to 64k, we'd
create a single "overflow batch" representing batches 32k - 64k. After
processing the first 32k batches, we'd close the files and reuse the
memory for the next 32k batches. We'd read the overflow batch, split it
into the 32k batches, and just process them as usual. Of course, there
might be multiple rounds of this, for example we might end up with 32k
concurrent batches but 128k virtual ones, which means we'd do 4 rounds of
this dance.

It's a bit inefficient, but situations like this should be rather rare,
and it's more graceful than just crashing with OOM.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

On Sat, Apr 20, 2019 at 10:36:50PM +0200, Tomas Vondra wrote:
>On Sat, Apr 20, 2019 at 04:26:34PM -0400, Tom Lane wrote:
>>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>>>Considering how rare this issue likely is, we need to be looking for a
>>>solution that does not break the common case.
>>
>>Agreed.  What I think we need to focus on next is why the code keeps
>>increasing the number of batches.  It seems like there must be an undue
>>amount of data all falling into the same bucket ... but if it were simply
>>a matter of a lot of duplicate hash keys, the growEnabled shutoff
>>heuristic ought to trigger.
>>
>
>I think it's really a matter of underestimate, which convinces the planner
>to hash the larger table. In this case, the table is 42GB, so it's
>possible it actually works as expected. With work_mem = 4MB I've seen 32k
>batches, and that's not that far off, I'd day. Maybe there are more common
>values, but it does not seem like a very contrived data set.
>

Actually, I might have spoken too soon. I've dne some stats on the sample
data. There are 113478127 rows in total, and while most UUIDs are unique,
there are UUIDs that represent ~10% of the data. So maybe there really is
something broken in disabling the growth.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

stats.txt

Re: Out of Memory errors are frustrating as heck!

From

Justin Pryzby

Date:

20 April 2019, 21:45:41

On Sat, Apr 20, 2019 at 04:46:03PM -0400, Tom Lane wrote:
> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> > I think it's really a matter of underestimate, which convinces the planner
> > to hash the larger table. In this case, the table is 42GB, so it's
> > possible it actually works as expected. With work_mem = 4MB I've seen 32k
> > batches, and that's not that far off, I'd day. Maybe there are more common
> > values, but it does not seem like a very contrived data set.
> 
> Maybe we just need to account for the per-batch buffers while estimating
> the amount of memory used during planning.  That would force this case
> into a mergejoin instead, given that work_mem is set so small.

Do you mean by adding disable_cost if work_mem is so small that it's estimated
to be exceeded ?

Justin

Re: Out of Memory errors are frustrating as heck!

From

Tom Lane

Date:

20 April 2019, 22:20:15

Justin Pryzby <pryzby@telsasoft.com> writes:
> On Sat, Apr 20, 2019 at 04:46:03PM -0400, Tom Lane wrote:
>> Maybe we just need to account for the per-batch buffers while estimating
>> the amount of memory used during planning.  That would force this case
>> into a mergejoin instead, given that work_mem is set so small.

> Do you mean by adding disable_cost if work_mem is so small that it's estimated
> to be exceeded ?

No, my point is that ExecChooseHashTableSize fails to consider the
I/O buffers at all while estimating hash table size.  It's not
immediately obvious how to factor that in, but we should.

If Tomas is right that there's also an underestimate of the number
of rows here, that might not solve Gunther's immediate problem; but
it seems like a clear costing oversight.

There's also the angle that the runtime code acts as though increasing
the number of batches is free, while it clearly isn't when you think
about the I/O buffers.  So at some point we should probably stop
increasing the number of batches on the grounds of needing too many
buffers.

            regards, tom lane

Re: Out of Memory errors are frustrating as heck!

From

Tomas Vondra

Date:

20 April 2019, 23:46:35

On Sat, Apr 20, 2019 at 06:20:15PM -0400, Tom Lane wrote:
>Justin Pryzby <pryzby@telsasoft.com> writes:
>> On Sat, Apr 20, 2019 at 04:46:03PM -0400, Tom Lane wrote:
>>> Maybe we just need to account for the per-batch buffers while estimating
>>> the amount of memory used during planning.  That would force this case
>>> into a mergejoin instead, given that work_mem is set so small.
>
>> Do you mean by adding disable_cost if work_mem is so small that it's estimated
>> to be exceeded ?
>
>No, my point is that ExecChooseHashTableSize fails to consider the
>I/O buffers at all while estimating hash table size.  It's not
>immediately obvious how to factor that in, but we should.
>
>If Tomas is right that there's also an underestimate of the number
>of rows here, that might not solve Gunther's immediate problem; but
>it seems like a clear costing oversight.
>
>There's also the angle that the runtime code acts as though increasing
>the number of batches is free, while it clearly isn't when you think
>about the I/O buffers.  So at some point we should probably stop
>increasing the number of batches on the grounds of needing too many
>buffers.

Yes. I think it might be partially due to the cost being hidden elsewhere.
The hashjoin code only really deals with array of pointers to BufFile, not
with the BufFiles. And might have looked insignificant for common cases,
but clearly for these corner cases it matters quite a bit.

So yes, ExecChooseHashTableSize() needs to consider this memory and check
if doubling the number of batches has any chance of actually improving
things, because at some point the BufFile memory starts to dominate and
would just force us to do more and more batches.

But I think we should also consider this before even creating the hash
join path - see if the expected number of batches has any chance of
fitting into work_mem, and if not then just not create the path at all.
Just like we do for hash aggregate, for example. It's not going to solve
cases like this (with underestimates), but it seems reasonable. Although,
maybe we won't actually use such paths, because merge join will win thanks
to being automatically cheaper? Not sure.

Also, I wonder if we really need 8kB buffers here. Would it make sense to
allow smaller buffers in some cases? Say, 1kB. It's not going to save us,
but it's still 8x better than now.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Out of Memory errors are frustrating as heck!

From

Gunther

Date:

21 April 2019, 00:33:46

On 4/20/2019 16:01, Tomas Vondra wrote:

For me, this did the trick:
update pg_class set (relpages, reltuples) = (1000000, 1) where relname = 'tmp_r';
update pg_class set (relpages, reltuples) = (1, 1000000) where relname = 'tmp_q';

YES! For me too. My EXPLAIN ANALYZE actually succeeded.

 Hash Right Join  (cost=11009552.27..11377073.28 rows=11 width=4271) (actual time=511580.110..1058354.140 rows=113478386 loops=1)  Hash Cond: (((q.documentinternalid)::text = (r.documentinternalid)::text) AND ((q.actinternalid)::text = (r.actinternalid)::text))  ->  Seq Scan on tmp_q q  (cost=0.00..210021.00 rows=21000000 width=3417) (actual time=1.148..1.387 rows=236 loops=1)  ->  Hash  (cost=11009552.11..11009552.11 rows=11 width=928) (actual time=511577.002..511577.002 rows=113478127 loops=1)        Buckets: 16384 (originally 1024)  Batches: 131072 (originally 1)  Memory Usage: 679961kB        ->  Seq Scan on tmp_r r  (cost=0.00..11009552.11 rows=11 width=928) (actual time=4.077..344558.954 rows=113478127 loops=1)Planning Time: 0.725 msExecution Time: 1064128.721 ms

But it used a lot of resident memory, and now it seems like I actually have a leak! Because after the command returned as shown above, the memory is still allocated:

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND7100 postgres  20   0 2164012   1.1g 251364 S   0.0 14.5  23:27.23 postgres: postgres integrator [local] idle

and let's do the memory map dump:

2019-04-20 22:09:52.522 UTC [7100] LOG:  duration: 1064132.171 ms  statement: explain analyze       SELECT *         FROM tmp_q q         RIGHT OUTER JOIN tmp_r r           USING(documentInternalId, actInternalId);
TopMemoryContext: 153312 total in 8 blocks; 48168 free (70 chunks); 105144 used HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Operator lookup cache: 24576 total in 2 blocks; 10760 free (3 chunks); 13816 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 8192 total in 1 blocks; 6896 free (1 chunks); 1296 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7936 free (1 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used CacheMemoryContext: 1154080 total in 20 blocks; 151784 free (1 chunks); 1002296 used   index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: pg_class_tblspc_relfilenode_index   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index   index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx   index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: entity_id_idx   ...   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_attribute_relid_attnum_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 4992 free (6 chunks); 3200 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (3 chunks); 256 used
Grand total: 2082624 bytes in 240 blocks; 382760 free (175 chunks); 1699864 used

strange, it shows no leak here. Now I run this test again, to see if the memory grows further in top? This time I also leave the DISTINCT step in the query. I am trying to hit the out of memory situation. Well, I clearly saw memory growing now:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND7100 postgres  20   0 2601900   1.5g 251976 R  97.7 19.9  36:32.23 postgres: postgres integrator [local] EXPLAIN

TopMemoryContext: 2250520 total in 9 blocks; 45384 free (56 chunks); 2205136 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 1456 free (0 chunks); 6736 used TopTransactionContext: 8192 total in 1 blocks; 7528 free (1 chunks); 664 used HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Operator lookup cache: 24576 total in 2 blocks; 10760 free (3 chunks); 13816 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 65536 total in 4 blocks; 28664 free (9 chunks); 36872 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used   PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used   PortalContext: 147456 total in 21 blocks; 10400 free (7 chunks); 137056 used:     ExecutorState: 489605432 total in 57794 blocks; 5522192 free (129776 chunks); 484083240 used       HashTableContext: 2162800 total in 6 blocks; 64848 free (35 chunks); 2097952 used         HashBatchContext: 706576176 total in 21503 blocks; 7936 free (0 chunks); 706568240 used       TupleSort main: 452880 total in 8 blocks; 125880 free (29 chunks); 327000 used         Caller tuples: 4194304 total in 10 blocks; 452280 free (20 chunks); 3742024 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used CacheMemoryContext: 1154080 total in 20 blocks; 149992 free (1 chunks); 1004088 used   index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: pg_class_tblspc_relfilenode_index   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index
...   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_attribute_relid_attnum_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 4992 free (6 chunks); 3200 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
Grand total: 1207458200 bytes in 79595 blocks; 6639272 free (130033 chunks); 1200818928 used

but Executor state is only 489 MB, so it is in the area of slow but massive growth. 
     ExecutorState: 489605432 total in 57794 blocks; 5522192 free (129776 chunks); 484083240 used       HashTableContext: 2162800 total in 6 blocks; 64848 free (35 chunks); 2097952 used         HashBatchContext: 706576176 total in 21503 blocks; 7936 free (0 chunks); 706568240 used       TupleSort main: 452880 total in 8 blocks; 125880 free (29 chunks); 327000 used         Caller tuples: 4194304 total in 10 blocks; 452280 free (20 chunks); 3742024 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used

now I see if I can run it to completion anyway, and if there will then be a new bottom of memory. Now the newly allocated memory seems to have been released, so we stick to the 1.1G baseline we started out with.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND7100 postgres  20   0 2164012   1.1g 251976 D   6.3 14.5  55:26.82 postgres: postgres integrator [local] EXPLAIN

on the other hand, the sort step is not yet finished.

Also, I think while we might have focused in on a peculiar planning situation where a very unfortunate plan is chosen which stresses the memory situation, the real reason for the final out of memory situation has not yet been determined. Remember, I have seen 3 stages in my original query:

Steady state, sort-merge join active high CPU% memory at or below 100 kB
Slow massive growth from over 200 kB to 1.5 GB or 1.8 GB
Explosive growth within a second to over 2.2 GB

It might well be that my initial query would have succeeded just fine despite the unfortunate plan with the big memory consumption on the oddly planned hash join, were it not for that third phase of explosive growth. And we haven't been able to catch this, because it happens too quickly.

It seems to me that some better memory tracing would be necessary. Looking at src/backend/utils/memdebug.c, mentions Valgrind. But to me, Valgrind would be a huge toolbox to just look after one thing. I wonder if we could not make a much simpler memory leak debugger tool. One that is fast, yet doesn't provide too much output to overwhelm the log destination file system (and waste too much time). There are already Debug macros there which, if enabled, just create an absolutely crazy amount of undecipherable log file content, because ALL backend processes would spit out this blabber. So I already stopped that by adding a variable that must be set to 1 (using the debugger on exactly one process for exactly the time desired):

int _alloc_info = 0;
#ifdef HAVE_ALLOCINFO
#define AllocFreeInfo(_cxt, _chunk) \           if(_alloc_info) \                       fprintf(stderr, "AllocFree: %s: %p, %zu\n", \                               (_cxt)->header.name, (_chunk), (_chunk)->size)
#define AllocAllocInfo(_cxt, _chunk) \               if(_alloc_info) \                       fprintf(stderr, "AllocAlloc: %s: %p, %zu\n", \                               (_cxt)->header.name, (_chunk), (_chunk)->size)
#else
#define AllocFreeInfo(_cxt, _chunk)
#define AllocAllocInfo(_cxt, _chunk)
#endif

But now I am thinking that should be the hook to use a limited cache where we can cancel out AllocSetAlloc with their AllocSetFree calls that follow relatively rapidly, which apparently is the majority of the log chatter created.

The memory debugger would allocate a single fixed memory chunk like 8 or 16 kB as a cache per each memory context that is actually traced. In each we would record the memory allocation in the shortest possible way. With everything compressed. Instead of pointer, references to the memory we would store whatever memory chunk index, a very brief backtrace would be stored in a compressed form, by instruction pointer ($rip) of the caller and then variable length differences to the $rip of the caller next up. These could even be stored in an index of the 100 most common caller chains to compress this data some more, while minimizing the cost in searching. Now each allocated chunk would be kept in this table and when it fills up, the oldest allocated chunk removed and written to the log file. When freed before being evicted from the cache, the chunk gets silently removed. When a chunk is freed that is no longer in the cache, the free event is recorded in the log. That way only those chunks get written to the log files that have endured beyond the capacity of the cache. Hopefully that would be the chunks most likely involved in the memory leak. Once the log has been created, it can be loaded into PostgreSQL table itself, and analyzed to find the chunks that never get freed and from the compressed backtrace figure out where they have been allocated.

BTW, my explain analyze is still running. That Sort - Unique step is taking forever on this data.

OK, now I will try the various patches that people sent.

-Gunther

Re: Out of Memory errors are frustrating as heck!

From

Tomas Vondra

Date:

21 April 2019, 01:14:01

On Sat, Apr 20, 2019 at 08:33:46PM -0400, Gunther wrote:
>On 4/20/2019 16:01, Tomas Vondra wrote:
>>For me, this did the trick:
>> update pg_class set (relpages, reltuples) = (1000000, 1) where 
>>relname = 'tmp_r';
>> update pg_class set (relpages, reltuples) = (1, 1000000) where 
>>relname = 'tmp_q';
>>
>YES! For me too. My EXPLAIN ANALYZE actually succeeded.
>
> Hash Right Join  (cost=11009552.27..11377073.28 rows=11 width=4271) (actual time=511580.110..1058354.140
rows=113478386loops=1)
 
>   Hash Cond: (((q.documentinternalid)::text = (r.documentinternalid)::text) AND ((q.actinternalid)::text =
(r.actinternalid)::text))
>   ->  Seq Scan on tmp_q q  (cost=0.00..210021.00 rows=21000000 width=3417) (actual time=1.148..1.387 rows=236
loops=1)
>   ->  Hash  (cost=11009552.11..11009552.11 rows=11 width=928) (actual time=511577.002..511577.002 rows=113478127
loops=1)
>         Buckets: 16384 (originally 1024)  Batches: 131072 (originally 1)  Memory Usage: 679961kB
>         ->  Seq Scan on tmp_r r  (cost=0.00..11009552.11 rows=11 width=928) (actual time=4.077..344558.954
rows=113478127loops=1)
 
> Planning Time: 0.725 ms
> Execution Time: 1064128.721 ms
>
>But it used a lot of resident memory, and now it seems like I actually 
>have a leak! Because after the command returned as shown above, the 
>memory is still allocated:
>
>  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
> 7100 postgres  20   0 2164012   1.1g 251364 S   0.0 14.5  23:27.23 postgres: postgres integrator [local] idle
>
>and let's do the memory map dump:
>
>2019-04-20 22:09:52.522 UTC [7100] LOG:  duration: 1064132.171 ms  statement: explain analyze
>        SELECT *
>          FROM tmp_q q
>          RIGHT OUTER JOIN tmp_r r
>            USING(documentInternalId, actInternalId);
>TopMemoryContext: 153312 total in 8 blocks; 48168 free (70 chunks); 105144 used
>  HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
>  Operator lookup cache: 24576 total in 2 blocks; 10760 free (3 chunks); 13816 used
>  TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used
>  Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used
>  RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used
>  MessageContext: 8192 total in 1 blocks; 6896 free (1 chunks); 1296 used
>  Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
>  smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used
>  TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used
>  Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
>  TopPortalContext: 8192 total in 1 blocks; 7936 free (1 chunks); 256 used
>  Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used
>  CacheMemoryContext: 1154080 total in 20 blocks; 151784 free (1 chunks); 1002296 used
>    index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: pg_class_tblspc_relfilenode_index
>    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index
>    index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx
>    index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: entity_id_idx
>    ...
>    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_attribute_relid_attnum_index
>    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index
>  WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used
>  PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used
>  MdSmgr: 8192 total in 1 blocks; 4992 free (6 chunks); 3200 used
>  LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used
>  Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used
>  ErrorContext: 8192 total in 1 blocks; 7936 free (3 chunks); 256 used
>Grand total: 2082624 bytes in 240 blocks; 382760 free (175 chunks); 1699864 used
>
>strange, it shows no leak here. Now I run this test again, to see if 
>the memory grows further in top? This time I also leave the DISTINCT 
>step in the query. I am trying to hit the out of memory situation. 
>Well, I clearly saw memory growing now:
>

Unfortunately, interpreting RES is way more complicated. The trouble is
PostgreSQL does not get memory from kernel directly, it gets it through
glibc. So when we do free(), it's not guaranteed kernel gets it.

Also, I think glibc has multiple ways of getting memory from the kernel.
It can either to mmap or sbrk, and AFAIK it's easy to cause "islands" that
make it impossible to undo sbrk after freeing memory.

Memory leaks in PostgreSQL are usually about allocating memory in the
wrong context, and so are visible in MemoryContextStats. Permanent leaks
that don't show there are quite rare.

>
>Also, I think while we might have focused in on a peculiar planning 
>situation where a very unfortunate plan is chosen which stresses the 
>memory situation, the real reason for the final out of memory 
>situation has not yet been determined. Remember, I have seen 3 stages 
>in my original query:
>
>1. Steady state, sort-merge join active high CPU% memory at or below 100 kB
>2. Slow massive growth from over 200 kB to 1.5 GB or 1.8 GB
>3. Explosive growth within a second to over 2.2 GB
>
>It might well be that my initial query would have succeeded just fine 
>despite the unfortunate plan with the big memory consumption on the 
>oddly planned hash join, were it not for that third phase of explosive 
>growth. And we haven't been able to catch this, because it happens too 
>quickly.
>
>It seems to me that some better memory tracing would be necessary. 
>Looking at src/backend/utils/memdebug.c, mentions Valgrind. But to me, 
>Valgrind would be a huge toolbox to just look after one thing. I 
>wonder if we could not make a much simpler memory leak debugger tool.  
>One that is fast,  yet doesn't provide too much output to overwhelm 
>the log destination file system (and waste too much time). There are 
>already Debug macros there which, if enabled, just create an 
>absolutely crazy amount of undecipherable log file content, because 
>ALL backend processes would spit out this blabber. So I already 
>stopped that by adding a variable that must be set to 1 (using the 
>debugger on exactly one process for exactly the time desired):
>
>int _alloc_info = 0;
>#ifdef HAVE_ALLOCINFO
>#define AllocFreeInfo(_cxt, _chunk) \
>            if(_alloc_info) \
>                        fprintf(stderr, "AllocFree: %s: %p, %zu\n", \
>                                (_cxt)->header.name, (_chunk), (_chunk)->size)
>#define AllocAllocInfo(_cxt, _chunk) \
>                if(_alloc_info) \
>                        fprintf(stderr, "AllocAlloc: %s: %p, %zu\n", \
>                                (_cxt)->header.name, (_chunk), (_chunk)->size)
>#else
>#define AllocFreeInfo(_cxt, _chunk)
>#define AllocAllocInfo(_cxt, _chunk)
>#endif
>
>But now I am thinking that should be the hook to use a limited cache 
>where we can cancel out AllocSetAlloc with their AllocSetFree calls 
>that follow relatively rapidly, which apparently is the majority of 
>the log chatter created.
>
>The memory debugger would allocate a single fixed memory chunk like 8 
>or 16 kB as a cache per each memory context that is actually traced. 
>In each we would record the memory allocation in the shortest possible 
>way. With everything compressed. Instead of pointer, references to the 
>memory we would store whatever memory chunk index, a very brief 
>backtrace would be stored in a compressed form, by instruction pointer 
>($rip) of the caller and then variable length differences to the $rip 
>of the caller next up. These could even be stored in an index of the 
>100 most common caller chains to compress this data some more, while 
>minimizing the cost in searching. Now each allocated chunk would be 
>kept in this table and when it fills up, the oldest allocated chunk
>removed and written to the log file. When freed before being evicted 
>from the cache, the chunk gets silently removed. When a chunk is freed 
>that is no longer in the cache, the free event is recorded in the log. 
>That way only those chunks get written to the log files that have 
>endured beyond the capacity of the cache. Hopefully that would be the 
>chunks most likely involved in the memory leak. Once the log has been 
>created, it can be loaded into PostgreSQL table itself, and analyzed 
>to find the chunks that never get freed and from the compressed 
>backtrace figure out where they have been allocated.
>
>BTW, my explain analyze is still running. That Sort - Unique step is 
>taking forever on this data.
>
>OK, now I will try the various patches that people sent.
>

Maybe. But before wasting any more time on the memory leak investigation,
I suggest you first try the patch moving the BufFile allocations to a
separate context. That'll either confirm or disprove the theory.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Out of Memory errors are frustrating as heck!

From

Gunther

Date:

21 April 2019, 05:03:50

On 4/20/2019 21:14, Tomas Vondra wrote:

Maybe. But before wasting any more time on the memory leak investigation,
I suggest you first try the patch moving the BufFile allocations to a
separate context. That'll either confirm or disprove the theory.

OK, fair enough. So, first patch 0001-* applied, recompiled and

2019-04-21 04:08:04.364 UTC [11304] LOG:  server process (PID 11313) was terminated by signal 11: Segmentation fault
2019-04-21 04:08:04.364 UTC [11304] DETAIL:  Failed process was running: explain analyze select * from reports.v_BusinessOperation;
2019-04-21 04:08:04.364 UTC [11304] LOG:  terminating any other active server processes
2019-04-21 04:08:04.368 UTC [11319] FATAL:  the database system is in recovery mode
2019-04-21 04:08:04.368 UTC [11315] WARNING:  terminating connection because of crash of another server process

SIGSEGV ... and with the core dump that I have I can tell you where:

Core was generated by `postgres: postgres integrator'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000000009f300c in palloc (size=8272) at mcxt.c:936
936             context->isReset = false;
(gdb) bt
#0  0x00000000009f300c in palloc (size=8272) at mcxt.c:936
#1  0x000000000082b068 in makeBufFileCommon (nfiles=1) at buffile.c:116
#2  0x000000000082b0f8 in makeBufFile (firstfile=73) at buffile.c:138
#3  0x000000000082b283 in BufFileCreateTemp (interXact=false) at buffile.c:201
#4  0x00000000006bdc15 in ExecHashJoinSaveTuple (tuple=0x1c5a468, hashvalue=3834163156, fileptr=0x18a3730) at nodeHashjoin.c:1227
#5  0x00000000006b9568 in ExecHashTableInsert (hashtable=0x188fb88, slot=0x1877a18, hashvalue=3834163156) at nodeHash.c:1701
#6  0x00000000006b6c39 in MultiExecPrivateHash (node=0x1862168) at nodeHash.c:186
#7  0x00000000006b6aec in MultiExecHash (node=0x1862168) at nodeHash.c:114
#8  0x00000000006a19cc in MultiExecProcNode (node=0x1862168) at execProcnode.c:501
#9  0x00000000006bc5d2 in ExecHashJoinImpl (pstate=0x17b90e0, parallel=false) at nodeHashjoin.c:290
...
(gdb) info frame
Stack level 0, frame at 0x7fffd5d4dc80:rip = 0x9f300c in palloc (mcxt.c:936); saved rip = 0x82b068called by frame at 0x7fffd5d4dcb0source language c.Arglist at 0x7fffd5d4dc70, args: size=8272Locals at 0x7fffd5d4dc70, Previous frame's sp is 0x7fffd5d4dc80Saved registers: rbx at 0x7fffd5d4dc60, rbp at 0x7fffd5d4dc70, r12 at 0x7fffd5d4dc68, rip at 0x7fffd5d4dc78

and I have confirmed that this is while working the main T_HashJoin with jointype JOIN_RIGHT.

So now I am assuming that perhaps you want both of these patches applied. So applied it, and retried and boom, same thing same place.

turns out the MemoryContext is NULL:

(gdb) p context
$1 = (MemoryContext) 0x0

all patches applied cleanly (with the -p1 option). I see no .rej file, but also no .orig file, not sure why that version of patch didn't create them. But I paid attention and know that there was no error.

-Gunther

Re: Out of Memory errors are frustrating as heck!

From

Gunther

Date:

21 April 2019, 05:31:06

I am now running Justin's patch after undoing Tomas' patches and any of my own hacks (which would not have interfered with Tomas' patch)

On 4/20/2019 15:30, Justin Pryzby wrote:

With my patch, the number of batches is nonlinear WRT work_mem, and reaches a
maximum for moderately small work_mem.  The goal is to choose the optimal
number of batches to minimize the degree to which work_mem is exceeded.

Now I seem to be in that slow massive growth phase or maybe still in an earlier step, but I can see the top RES behavior different already. The size lingers around 400 MB. But then it's growing too, at high CPU%, goes past 700, 800, 900 MB now 1.5 GB, 1.7 GB, 1.8 GB, 1.9 GB, 2.0 GB, 2.1, and still 98% CPU. 2.4 GB, wow, it has never been that big ... and BOOM!

TopMemoryContext: 120544 total in 7 blocks; 9760 free (7 chunks); 110784 used HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 blocks; 6680 free (0 chunks); 1512 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 32768 total in 3 blocks; 13488 free (10 chunks); 19280 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used   PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used   PortalContext: 1482752 total in 184 blocks; 11216 free (9 chunks); 1471536 used:     ExecutorState: 2361536 total in 27 blocks; 1827536 free (3163 chunks); 534000 used       TupleSort main: 3957712 total in 22 blocks; 246792 free (39 chunks); 3710920 used       TupleSort main: 4219912 total in 23 blocks; 246792 free (39 chunks); 3973120 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7336 free (6 chunks); 856 used         HashBatchContext: 2523874568 total in 76816 blocks; 7936 free (0 chunks); 2523866632 used       TupleSort main: 41016 total in 3 blocks; 6504 free (6 chunks); 34512 used         Caller tuples: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used CacheMemoryContext: 1101328 total in 14 blocks; 288672 free (1 chunks); 812656 used   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index   index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx
...   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_attribute_relid_attnum_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 6176 free (1 chunks); 2016 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (4 chunks); 256 used
Grand total: 2538218304 bytes in 77339 blocks; 3075256 free (3372 chunks); 2535143048 used
2019-04-21 05:27:07.731 UTC [968] ERROR:  out of memory
2019-04-21 05:27:07.731 UTC [968] DETAIL:  Failed on request of size 32800 in memory context "HashBatchContext".
2019-04-21 05:27:07.731 UTC [968] STATEMENT:  explain analyze select * from reports.v_BusinessOperation;

so we're ending up with the same problem.

No cigar. But lots of admiration and gratitude for all your attempts to pinpoint this.

Also, again, if anyone (of the trusted people) wants access to hack directly, I can provide.

regards,
-Gunther

Re: Out of Memory errors are frustrating as heck!

From

Justin Pryzby

Date:

21 April 2019, 08:08:22

On Sun, Apr 21, 2019 at 01:03:50AM -0400, Gunther wrote:
> On 4/20/2019 21:14, Tomas Vondra wrote:
> >Maybe. But before wasting any more time on the memory leak investigation,
> >I suggest you first try the patch moving the BufFile allocations to a
> >separate context. That'll either confirm or disprove the theory.
> 
> OK, fair enough. So, first patch 0001-* applied, recompiled and
> 
> 2019-04-21 04:08:04.364 UTC [11304] LOG:  server process (PID 11313) was terminated by signal 11: Segmentation fault
...
> turns out the MemoryContext is NULL:
> 
> (gdb) p context
> $1 = (MemoryContext) 0x0

I updated Tomas' patch to unconditionally set the context.
(Note, oldctx vs oldcxt is fairly subtle but I think deliberate?)

Justin

Attachment

v2-0001-move-BufFile-stuff-into-separate-context.patch

Re: Out of Memory errors are frustrating as heck!

From

Justin Pryzby

Date:

21 April 2019, 10:19:09

I was able to reproduce in a somewhat contrived way:

sh -c 'ulimit -v 1024000 ; /usr/local/pgsql/bin/postgres -D ./pg12dev5 -cport=1234' &

postgres=# SET work_mem='64kB';SET client_min_messages =debug1;SET log_statement_stats=on;explain(analyze) WITH v AS
MATERIALIZED(SELECT * FROM generate_series(1,9999999)i WHERE i%10<10 AND i%11<11 AND i%12<12 AND i%13<13 AND i%14<14
ANDi%15<15 AND i%16<16 AND i%17<17 AND i%18<18 AND i%19<19 AND i%20<20 AND i%21<21 ) SELECT * FROM
generate_series(1,99)kJOIN v ON k=i;
 

        HashTableContext: 8192 total in 1 blocks; 7696 free (7 chunks); 496 used
          hash batch files: 852065104 total in 101318 blocks; 951896 free (20 chunks); 851113208 used
          HashBatchContext: 73888 total in 4 blocks; 24280 free (6 chunks); 49608 used

2019-04-21 04:11:02.521 CDT [4156] ERROR:  out of memory
2019-04-21 04:11:02.521 CDT [4156] DETAIL:  Failed on request of size 8264 in memory context "hash batch files".

Re: Out of Memory errors are frustrating as heck!

From

Tomas Vondra

Date:

21 April 2019, 11:46:18

On Sun, Apr 21, 2019 at 03:08:22AM -0500, Justin Pryzby wrote:
>On Sun, Apr 21, 2019 at 01:03:50AM -0400, Gunther wrote:
>> On 4/20/2019 21:14, Tomas Vondra wrote:
>> >Maybe. But before wasting any more time on the memory leak investigation,
>> >I suggest you first try the patch moving the BufFile allocations to a
>> >separate context. That'll either confirm or disprove the theory.
>>
>> OK, fair enough. So, first patch 0001-* applied, recompiled and
>>
>> 2019-04-21 04:08:04.364 UTC [11304] LOG:  server process (PID 11313) was terminated by signal 11: Segmentation
fault
>...
>> turns out the MemoryContext is NULL:
>>
>> (gdb) p context
>> $1 = (MemoryContext) 0x0
>
>I updated Tomas' patch to unconditionally set the context.
>(Note, oldctx vs oldcxt is fairly subtle but I think deliberate?)
>

I don't follow - is there a typo confusing oldctx vs. oldcxt? I don't
think so, but I might have missed something. (I always struggle with which
spelling is the right one).

I think the bug is actually such simpler - the memory context was created
only in ExecuteIncreaseNumBatches() when starting with nbatch=1. But when
the initial nbatch value was higher (i.e. when starting with 2 or more
batches), it was left NULL. That was OK for testing with the contrived
data set, but it may easily break on other examples.

So here is an updated patch - hopefully this version works. I don't have
time to do much more testing now, though. But it compiles.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Out of Memory errors are frustrating as heck!

From

Jeff Janes

Date:

21 April 2019, 13:05:16

On Sat, Apr 20, 2019 at 4:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> Considering how rare this issue likely is, we need to be looking for a
> solution that does not break the common case.

Agreed. What I think we need to focus on next is why the code keeps
increasing the number of batches. It seems like there must be an undue
amount of data all falling into the same bucket ... but if it were simply
a matter of a lot of duplicate hash keys, the growEnabled shutoff
heuristic ought to trigger.

The growEnabled stuff only prevents infinite loops. It doesn't prevent extreme silliness.

If a single 32 bit hash value has enough tuples by itself to not fit in work_mem, then it will keep splitting until that value is in a batch by itself before shutting off (or at least until the split-point bit of whatever else is in the that bucket happens to be the same value as the split-point-bit of the degenerate one, so by luck nothing or everything gets moved)

Probabilistically we keep splitting until every batch, other than the one containing the degenerate value, has about one tuple in it.

Cheers,

Jeff

Re: Out of Memory errors are frustrating as heck!

From

Tom Lane

Date:

21 April 2019, 14:36:43

Jeff Janes <jeff.janes@gmail.com> writes:
> The growEnabled stuff only prevents infinite loops.  It doesn't prevent
> extreme silliness.

> If a single 32 bit hash value has enough tuples by itself to not fit in
> work_mem, then it will keep splitting until that value is in a batch by
> itself before shutting off

Right, that's the code's intention.  If that's not good enough for this
case, we'll need to understand the details a bit better before we can
design a better(?) heuristic.

I suspect, however, that we might be better off just taking the existence
of the I/O buffers into account somehow while deciding whether it's worth
growing further.  That is, I'm imagining adding a second independent
reason for shutting off growEnabled, along the lines of "increasing
nbatch any further will require an unreasonable amount of buffer memory".
The question then becomes how to define "unreasonable".

            regards, tom lane

Re: Out of Memory errors are frustrating as heck!

From

Tomas Vondra

Date:

21 April 2019, 16:15:25

On Sun, Apr 21, 2019 at 10:36:43AM -0400, Tom Lane wrote:
>Jeff Janes <jeff.janes@gmail.com> writes:
>> The growEnabled stuff only prevents infinite loops.  It doesn't prevent
>> extreme silliness.
>
>> If a single 32 bit hash value has enough tuples by itself to not fit in
>> work_mem, then it will keep splitting until that value is in a batch by
>> itself before shutting off
>
>Right, that's the code's intention.  If that's not good enough for this
>case, we'll need to understand the details a bit better before we can
>design a better(?) heuristic.
>

I think we only disable growing when there are no other values in the
batch, but that seems rather easy to defeat - all you need is a single
tuple with a hash that falls into the same batch, and it's over. Maybe
we should make this a bit less accurate - say, if less than 5% memory
gets freed, don't add more batches.

>I suspect, however, that we might be better off just taking the existence
>of the I/O buffers into account somehow while deciding whether it's worth
>growing further.  That is, I'm imagining adding a second independent
>reason for shutting off growEnabled, along the lines of "increasing
>nbatch any further will require an unreasonable amount of buffer memory".
>The question then becomes how to define "unreasonable".
>

I think the question the code needs to be asking is "If we double the
number of batches, does the amount of memory we need drop?" And the
memory needs to account both for the buffers and per-batch data.

I don't think we can just stop increasing the number of batches when the
memory for BufFile exceeds work_mem, because that entirely ignores the
fact that by doing that we force the system to keep the per-batch stuff
in memory (and that can be almost arbitrary amount).

What I think we should be doing instead is instead make the threshold
dynamic - instead of just checking at work_mem, we need to increment the
number of batches when the total amount of memory exceeds

    Max(work_mem, 3 * memory_for_buffiles)

This is based on the observation that by increasing the number of
batches, we double memory_for_buffiles and split the per-batch data in
half. By adding more batches, we'd actually increase the amount of
memory used.

Of course, this just stops enforcing work_mem at some point, but it at
least attempts to minimize the amount of memory used.

An alternative would be spilling the extra tuples into a special
overflow file, as I explained earlier. That would actually enforce
work_mem I think.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Out of Memory errors are frustrating as heck!

From

Justin Pryzby

Date:

21 April 2019, 16:40:22

On Sun, Apr 21, 2019 at 10:36:43AM -0400, Tom Lane wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
> > The growEnabled stuff only prevents infinite loops.  It doesn't prevent
> > extreme silliness.
> 
> > If a single 32 bit hash value has enough tuples by itself to not fit in
> > work_mem, then it will keep splitting until that value is in a batch by
> > itself before shutting off
> 
> I suspect, however, that we might be better off just taking the existence
> of the I/O buffers into account somehow while deciding whether it's worth
> growing further.  That is, I'm imagining adding a second independent
> reason for shutting off growEnabled, along the lines of "increasing
> nbatch any further will require an unreasonable amount of buffer memory".
> The question then becomes how to define "unreasonable".

On Sun, Apr 21, 2019 at 06:15:25PM +0200, Tomas Vondra wrote:
> I think the question the code needs to be asking is "If we double the
> number of batches, does the amount of memory we need drop?" And the
> memory needs to account both for the buffers and per-batch data.
> 
> I don't think we can just stop increasing the number of batches when the
> memory for BufFile exceeds work_mem, because that entirely ignores the
> fact that by doing that we force the system to keep the per-batch stuff
> in memory (and that can be almost arbitrary amount).
...
> Of course, this just stops enforcing work_mem at some point, but it at
> least attempts to minimize the amount of memory used.

This patch defines reasonable as "additional BatchFiles will not themselves
exceed work_mem; OR, exceeded work_mem already but additional BatchFiles are
going to save us RAM"...

I think the first condition is insensitive and not too important to get right,
it only allows work_mem to be exceeded by 2x, which maybe already happens for
multiple reasons, related to this thread and otherwise.  It'd be fine to slap
on a factor of /2 or /4 or /8 there too. 

The current patch doesn't unset growEnabled, since there's no point at which
the hashtable should grow without bound: if hash tables are *already* exceeding
work_mem by 2x as big, nbatches should be doubled.

Justin

Attachment

v1-0001-account-for-size-of-BatchFile-structure-in-hashJo.patch

Re: Out of Memory errors are frustrating as heck!

From

Tom Lane

Date:

21 April 2019, 23:07:53

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On Sun, Apr 21, 2019 at 10:36:43AM -0400, Tom Lane wrote:
>> Jeff Janes <jeff.janes@gmail.com> writes:
>>> If a single 32 bit hash value has enough tuples by itself to not fit in
>>> work_mem, then it will keep splitting until that value is in a batch by
>>> itself before shutting off

>> Right, that's the code's intention.  If that's not good enough for this
>> case, we'll need to understand the details a bit better before we can
>> design a better(?) heuristic.

> I think we only disable growing when there are no other values in the
> batch, but that seems rather easy to defeat - all you need is a single
> tuple with a hash that falls into the same batch, and it's over. Maybe
> we should make this a bit less accurate - say, if less than 5% memory
> gets freed, don't add more batches.

Yeah, something like that, but it's hard to design it without seeing some
concrete misbehaving examples.

I think though that this is somewhat independent of the problem that we're
not including the I/O buffers in our reasoning about memory consumption.

> An alternative would be spilling the extra tuples into a special
> overflow file, as I explained earlier. That would actually enforce
> work_mem I think.

Well, no, it won't.  If you have umpteen gigabytes of RHS tuples with the
exact same hash code, nothing we can do here is going to prevent you from
having to process those in a single table load.  (If it's a plain inner
join, maybe you could break that into subsections anyway ... but that
won't work for left or full joins where you need per-tuple match status.)

I think our ambition here should just be to not have the code go crazy
trying to keep its memory consumption under work_mem when it's ultimately
going to fail to do so anyhow.

            regards, tom lane

Re: Out of Memory errors are frustrating as heck!

From

Gunther

Date:

21 April 2019, 23:25:15

After applying Tomas' corrected patch 0001, and routing HJDEBUG messages to stderr:

integrator=# set enable_nestloop to off;
SET
integrator=# explain analyze select * from reports.v_BusinessOperation;
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 16 to 32
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 32 to 64
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 64 to 128
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 128 to 256
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 256 to 512
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 512 to 1024
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 1024 to 2048
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 2048 to 4096
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 4096 to 8192
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 8192 to 16384
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 16384 to 32768
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 32768 to 65536
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 65536 to 131072
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 131072 to 262144
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 262144 to 524288
ERROR:  out of memory
DETAIL:  Failed on request of size 32800 in memory context "HashBatchContext".

Now

TopMemoryContext: 4347672 total in 9 blocks; 41688 free (18 chunks); 4305984 used HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 blocks; 5416 free (2 chunks); 2776 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 32768 total in 3 blocks; 13488 free (10 chunks); 19280 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used   PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used   PortalContext: 1482752 total in 184 blocks; 11216 free (8 chunks); 1471536 used:     ExecutorState: 2449896 total in 16 blocks; 1795000 free (3158 chunks); 654896 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 16384 total in 2 blocks; 3008 free (6 chunks); 13376 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       TupleSort main: 1073512 total in 11 blocks; 246792 free (39 chunks); 826720 used       TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 40120 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 2242545904 total in 266270 blocks; 3996232 free (14164 chunks); 2238549672 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 168165080 total in 5118 blocks; 7936 free (0 chunks); 168157144 used       TupleSort main: 452880 total in 8 blocks; 126248 free (27 chunks); 326632 used         Caller tuples: 1048576 total in 8 blocks; 21608 free (14 chunks); 1026968 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used CacheMemoryContext: 1101328 total in 14 blocks; 288672 free (1 chunks); 812656 used   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index   index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx
...   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_attribute_relid_attnum_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 6560 free (1 chunks); 1632 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (5 chunks); 256 used
Grand total: 2424300520 bytes in 271910 blocks; 7332360 free (17596 chunks); 2416968160 used
2019-04-21 19:50:21.338 UTC [6974] ERROR:  out of memory
2019-04-21 19:50:21.338 UTC [6974] DETAIL:  Failed on request of size 32800 in memory context "HashBatchContext".
2019-04-21 19:50:21.338 UTC [6974] STATEMENT:  explain analyze select * from reports.v_BusinessOperation;

Next I'll apply Tomas' corrected 0002 patch on top of this and see ...

-Gunther

Re: Out of Memory errors are frustrating as heck!

From

Tomas Vondra

Date:

21 April 2019, 23:51:02

On Sun, Apr 21, 2019 at 07:25:15PM -0400, Gunther wrote:
>   After applying Tomas' corrected patch 0001, and routing HJDEBUG messages
>   to stderr:
>
> integrator=# set enable_nestloop to off;
> SET
> integrator=# explain analyze select * from reports.v_BusinessOperation;
>
> ...
> WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 131072 to 262144
> WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 262144 to 524288
> ERROR:  out of memory
> DETAIL:  Failed on request of size 32800 in memory context "HashBatchContext".
>
>   Now 
>
> TopMemoryContext: 4347672 total in 9 blocks; 41688 free (18 chunks); 4305984 used
> ...
>   Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
>   TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used
>     PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used
>     PortalContext: 1482752 total in 184 blocks; 11216 free (8 chunks); 1471536 used:
>       ExecutorState: 2449896 total in 16 blocks; 1795000 free (3158 chunks); 654896 used
>         TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 40120 used
>         ...
>         ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
>         HashBatchFiles: 2242545904 total in 266270 blocks; 3996232 free (14164 chunks); 2238549672 used
>         HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used
>           HashBatchContext: 168165080 total in 5118 blocks; 7936 free (0 chunks); 168157144 used
>         TupleSort main: 452880 total in 8 blocks; 126248 free (27 chunks); 326632 used
>           Caller tuples: 1048576 total in 8 blocks; 21608 free (14 chunks); 1026968 used
> ...
> Grand total: 2424300520 bytes in 271910 blocks; 7332360 free (17596 chunks); 2416968160 used
>

IMO this pretty much proves that the memory allocated for BufFile really
is the root cause of the issues with this query. 524288 batches means
up to 1048576 BufFiles, which is a bit more than ~8GB of RAM. However
those for the inner relation were not allycated yet, so at this point
only about 4GB might be allocated. And it seems ~1/2 of them did not
receive any tuples, so only about 2GB got allocated so far.

The second batch will probably make it fail much sooner, because it
allocates the BufFile stuff eagerly (both for inner and outer side).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Out of Memory errors are frustrating as heck!

From

Gunther

Date:

21 April 2019, 23:58:49

After applying Tomas' patch 0002 as corrected, over 0001, same thing:

integrator=# set enable_nestloop to off;
SET
integrator=# explain analyze select * from reports.v_BusinessOperation;
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 16 to 32
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 32 to 64
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 64 to 128
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 128 to 256
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 256 to 512
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 512 to 1024
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 1024 to 2048
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 2048 to 4096
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 4096 to 8192
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 8192 to 16384
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 16384 to 32768
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 32768 to 65536
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 65536 to 131072
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 131072 to 262144
ERROR:  out of memory
DETAIL:  Failed on request of size 8272 in memory context "HashBatchFiles".

And from the log:

2019-04-21 23:29:33.497 UTC [8890] LOG:  database system was shut down at 2019-04-21 23:29:33 UTC
2019-04-21 23:29:33.507 UTC [8888] LOG:  database system is ready to accept connections
Hashjoin 0x1732b88: initial nbatch = 16, nbuckets = 8192
2019-04-21 23:31:54.447 UTC [8896] WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 16 to 32
2019-04-21 23:31:54.447 UTC [8896] LOG:  ExecHashIncreaseNumBatches ======= context stats start =======
2019-04-21 23:31:54.447 UTC [8896] STATEMENT:  explain analyze select * from reports.v_BusinessOperation;
TopMemoryContext: 120544 total in 7 blocks; 10016 free (6 chunks); 110528 used HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 blocks; 6680 free (0 chunks); 1512 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 32768 total in 3 blocks; 13488 free (10 chunks); 19280 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used   PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used   PortalContext: 1482752 total in 184 blocks; 11216 free (8 chunks); 1471536 used:     ExecutorState: 647368 total in 10 blocks; 197536 free (13 chunks); 449832 used       TupleSort main: 4219912 total in 23 blocks; 246792 free (39 chunks); 3973120 used       TupleSort main: 4219912 total in 23 blocks; 246792 free (39 chunks); 3973120 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 258032 total in 31 blocks; 6208 free (0 chunks); 251824 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used       TupleSort main: 41016 total in 3 blocks; 6504 free (6 chunks); 34512 used         Caller tuples: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
...       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used CacheMemoryContext: 1101328 total in 14 blocks; 288672 free (1 chunks); 812656 used   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index   index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx
...   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_attribute_relid_attnum_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 6560 free (1 chunks); 1632 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (5 chunks); 256 used
Grand total: 17429152 bytes in 668 blocks; 1452392 free (220 chunks); 15976760 used
Hashjoin 0x1732b88: increasing nbatch to 32 because space = 4128933
Hashjoin 0x1732b88: freed 148 of 10584 tuples, space now 4071106
2019-04-21 23:31:54.450 UTC [8896] LOG:  ExecHashIncreaseNumBatches ======= context stats end =======
2019-04-21 23:31:54.450 UTC [8896] STATEMENT:  explain analyze select * from reports.v_BusinessOperation;
TopMemoryContext: 120544 total in 7 blocks; 9760 free (7 chunks); 110784 used HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 blocks; 6680 free (0 chunks); 1512 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 32768 total in 3 blocks; 13488 free (10 chunks); 19280 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used   PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used   PortalContext: 1482752 total in 184 blocks; 11216 free (8 chunks); 1471536 used:     ExecutorState: 647368 total in 10 blocks; 197536 free (13 chunks); 449832 used       TupleSort main: 4219912 total in 23 blocks; 246792 free (39 chunks); 3973120 used       TupleSort main: 4219912 total in 23 blocks; 246792 free (39 chunks); 3973120 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 524528 total in 63 blocks; 4416 free (2 chunks); 520112 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 4213640 total in 128 blocks; 7936 free (0 chunks); 4205704 used       TupleSort main: 41016 total in 3 blocks; 6504 free (6 chunks); 34512 used         Caller tuples: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
...       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used CacheMemoryContext: 1101328 total in 14 blocks; 288672 free (1 chunks); 812656 used   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index   index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx
...   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_attribute_relid_attnum_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 6560 free (1 chunks); 1632 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (5 chunks); 256 used
Grand total: 17629936 bytes in 698 blocks; 1450344 free (223 chunks); 16179592 used
2019-04-21 23:31:54.452 UTC [8896] WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 32 to 64
2019-04-21 23:31:54.452 UTC [8896] LOG:  ExecHashIncreaseNumBatches ======= context stats start =======
2019-04-21 23:31:54.452 UTC [8896] STATEMENT:  explain analyze select * from reports.v_BusinessOperation;
TopMemoryContext: 120544 total in 7 blocks; 9760 free (7 chunks); 110784 used HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 blocks; 6680 free (0 chunks); 1512 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 32768 total in 3 blocks; 13488 free (10 chunks); 19280 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used   PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used   PortalContext: 1482752 total in 184 blocks; 11216 free (8 chunks); 1471536 used:     ExecutorState: 647368 total in 10 blocks; 197536 free (13 chunks); 449832 used       TupleSort main: 4219912 total in 23 blocks; 246792 free (39 chunks); 3973120 used       TupleSort main: 4219912 total in 23 blocks; 246792 free (39 chunks); 3973120 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 524528 total in 63 blocks; 4416 free (2 chunks); 520112 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used       TupleSort main: 41016 total in 3 blocks; 6504 free (6 chunks); 34512 used         Caller tuples: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used

...       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used CacheMemoryContext: 1101328 total in 14 blocks; 288672 free (1 chunks); 812656 used   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index   index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx
...   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_attribute_relid_attnum_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 6560 free (1 chunks); 1632 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (5 chunks); 256 used
Grand total: 17695648 bytes in 700 blocks; 1450344 free (223 chunks); 16245304 used
Hashjoin 0x1732b88: increasing nbatch to 64 because space = 4128826
Hashjoin 0x1732b88: freed 544 of 10584 tuples, space now 3916296
2019-04-21 23:31:54.456 UTC [8896] LOG:  ExecHashIncreaseNumBatches ======= context stats end =======
2019-04-21 23:31:54.456 UTC [8896] STATEMENT:  explain analyze select * from reports.v_BusinessOperation;
TopMemoryContext: 120544 total in 7 blocks; 8224 free (7 chunks); 112320 used HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 blocks; 6680 free (0 chunks); 1512 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 32768 total in 3 blocks; 13488 free (10 chunks); 19280 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used   PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used   PortalContext: 1482752 total in 184 blocks; 11216 free (8 chunks); 1471536 used:     ExecutorState: 647368 total in 10 blocks; 197536 free (13 chunks); 449832 used       TupleSort main: 4219912 total in 23 blocks; 246792 free (39 chunks); 3973120 used       TupleSort main: 4219912 total in 23 blocks; 246792 free (39 chunks); 3973120 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 1057520 total in 127 blocks; 832 free (4 chunks); 1056688 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 4049360 total in 123 blocks; 7936 free (0 chunks); 4041424 used       TupleSort main: 41016 total in 3 blocks; 6504 free (6 chunks); 34512 used         Caller tuples: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
...       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used CacheMemoryContext: 1101328 total in 14 blocks; 288672 free (1 chunks); 812656 used   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index   index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx
...   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_attribute_relid_attnum_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 6560 free (1 chunks); 1632 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (5 chunks); 256 used
Grand total: 17998648 bytes in 757 blocks; 1445224 free (225 chunks); 16553424 used
2019-04-21 23:31:54.459 UTC [8896] WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 64 to 128
2019-04-21 23:31:54.459 UTC [8896] LOG:  ExecHashIncreaseNumBatches ======= context stats start =======
2019-04-21 23:31:54.459 UTC [8896] STATEMENT:  explain analyze select * from reports.v_BusinessOperation;
TopMemoryContext: 120544 total in 7 blocks; 8224 free (7 chunks); 112320 used HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 blocks; 6680 free (0 chunks); 1512 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 32768 total in 3 blocks; 13488 free (10 chunks); 19280 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used   PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used   PortalContext: 1482752 total in 184 blocks; 11216 free (8 chunks); 1471536 used:     ExecutorState: 647368 total in 10 blocks; 197536 free (13 chunks); 449832 used       TupleSort main: 4219912 total in 23 blocks; 246792 free (39 chunks); 3973120 used       TupleSort main: 4219912 total in 23 blocks; 246792 free (39 chunks); 3973120 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 1057520 total in 127 blocks; 832 free (4 chunks); 1056688 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used       TupleSort main: 41016 total in 3 blocks; 6504 free (6 chunks); 34512 used         Caller tuples: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 blocks; 6680 free (0 chunks); 1512 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 32768 total in 3 blocks; 13488 free (10 chunks); 19280 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used   PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used   PortalContext: 1482752 total in 184 blocks; 11216 free (8 chunks); 1471536 used:     ExecutorState: 647368 total in 10 blocks; 197536 free (13 chunks); 449832 used       TupleSort main: 4219912 total in 23 blocks; 246792 free (39 chunks); 3973120 used       TupleSort main: 4219912 total in 23 blocks; 246792 free (39 chunks); 3973120 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 1057520 total in 127 blocks; 832 free (4 chunks); 1056688 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used       TupleSort main: 41016 total in 3 blocks; 6504 free (6 chunks); 34512 used         Caller tuples: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
...       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used CacheMemoryContext: 1101328 total in 14 blocks; 288672 free (1 chunks); 812656 used   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index   index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx
...   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_attribute_relid_attnum_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 6560 free (1 chunks); 1632 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (5 chunks); 256 used
Grand total: 18228640 bytes in 764 blocks; 1445224 free (225 chunks); 16783416 used
Hashjoin 0x1732b88: increasing nbatch to 128 because space = 4128846
Hashjoin 0x1732b88: freed 10419 of 10585 tuples, space now 65570
2019-04-21 23:31:54.466 UTC [8896] LOG:  ExecHashIncreaseNumBatches ======= context stats end =======
2019-04-21 23:31:54.466 UTC [8896] STATEMENT:  explain analyze select * from reports.v_BusinessOperation;
TopMemoryContext: 120544 total in 7 blocks; 6176 free (8 chunks); 114368 used HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 blocks; 6680 free (0 chunks); 1512 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 32768 total in 3 blocks; 13488 free (10 chunks); 19280 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used   PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used
...
I notice now you have tons of these memory map dumps, tell me what you want to see and I will grep it out for you.
I guess the Hashjoin related things .... OK, last one before the out of memory now, and then I give you some grepped stuff...
...
TopMemoryContext: 4347672 total in 9 blocks; 41608 free (18 chunks); 4306064 used HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 blocks; 5416 free (2 chunks); 2776 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 32768 total in 3 blocks; 13488 free (10 chunks); 19280 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used   PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used   PortalContext: 1482752 total in 184 blocks; 11216 free (8 chunks); 1471536 used:     ExecutorState: 2449896 total in 16 blocks; 1794968 free (3158 chunks); 654928 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 32768 total in 3 blocks; 17304 free (9 chunks); 15464 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 40120 used       TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 40120 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 2468537520 total in 293910 blocks; 2669512 free (14 chunks); 2465868008 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used       TupleSort main: 256272 total in 6 blocks; 36424 free (15 chunks); 219848 used         Caller tuples: 2097152 total in 9 blocks; 929696 free (17 chunks); 1167456 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
...       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used CacheMemoryContext: 1101328 total in 14 blocks; 288672 free (1 chunks); 812656 used   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index   index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx
...   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_attribute_relid_attnum_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 6560 free (1 chunks); 1632 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (5 chunks); 256 used
Grand total: 2486488160 bytes in 294559 blocks; 6838088 free (3440 chunks); 2479650072 used
2019-04-21 23:40:23.118 UTC [8896] ERROR:  out of memory
2019-04-21 23:40:23.118 UTC [8896] DETAIL:  Failed on request of size 8272 in memory context "HashBatchFiles".
2019-04-21 23:40:23.118 UTC [8896] STATEMENT:  explain analyze select * from reports.v_BusinessOperation;
2019-04-21 23:40:23.119 UTC [8896] LOG:  could not open directory "base/pgsql_tmp/pgsql_tmp8896.2.sharedfileset": Cannot allocate memory

ok now here comes a summary grepped out:

grep 'Hash'

Hashjoin 0x1732b88: initial nbatch = 16, nbuckets = 8192
2019-04-21 23:31:54.447 UTC [8896] WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 16 to 32
2019-04-21 23:31:54.447 UTC [8896] LOG:  ExecHashIncreaseNumBatches ======= context stats start =======       HashBatchFiles: 258032 total in 31 blocks; 6208 free (0 chunks); 251824 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used
Hashjoin 0x1732b88: increasing nbatch to 32 because space = 4128933
Hashjoin 0x1732b88: freed 148 of 10584 tuples, space now 4071106
2019-04-21 23:31:54.450 UTC [8896] LOG:  ExecHashIncreaseNumBatches ======= context stats end =======       HashBatchFiles: 524528 total in 63 blocks; 4416 free (2 chunks); 520112 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 4213640 total in 128 blocks; 7936 free (0 chunks); 4205704 used
2019-04-21 23:31:54.452 UTC [8896] WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 32 to 64
2019-04-21 23:31:54.452 UTC [8896] LOG:  ExecHashIncreaseNumBatches ======= context stats start =======       HashBatchFiles: 524528 total in 63 blocks; 4416 free (2 chunks); 520112 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used
Hashjoin 0x1732b88: increasing nbatch to 64 because space = 4128826
Hashjoin 0x1732b88: freed 544 of 10584 tuples, space now 3916296
2019-04-21 23:31:54.456 UTC [8896] LOG:  ExecHashIncreaseNumBatches ======= context stats end =======       HashBatchFiles: 1057520 total in 127 blocks; 832 free (4 chunks); 1056688 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 4049360 total in 123 blocks; 7936 free (0 chunks); 4041424 used
2019-04-21 23:31:54.459 UTC [8896] WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 64 to 128
2019-04-21 23:31:54.459 UTC [8896] LOG:  ExecHashIncreaseNumBatches ======= context stats start =======       HashBatchFiles: 1057520 total in 127 blocks; 832 free (4 chunks); 1056688 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used
Hashjoin 0x1732b88: increasing nbatch to 128 because space = 4128846
Hashjoin 0x1732b88: freed 10419 of 10585 tuples, space now 65570
2019-04-21 23:31:54.466 UTC [8896] LOG:  ExecHashIncreaseNumBatches ======= context stats end =======       HashBatchFiles: 2148080 total in 257 blocks; 18160 free (6 chunks); 2129920 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 172352 total in 5 blocks; 7936 free (0 chunks); 164416 used
2019-04-21 23:32:07.174 UTC [8896] WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 128 to 256
2019-04-21 23:32:07.174 UTC [8896] LOG:  ExecHashIncreaseNumBatches ======= context stats start =======       HashBatchFiles: 2148080 total in 257 blocks; 18160 free (6 chunks); 2129920 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 4312208 total in 131 blocks; 7936 free (0 chunks); 4304272 used
Hashjoin 0x1732b88: increasing nbatch to 256 because space = 4128829
Hashjoin 0x1732b88: freed 10308 of 10734 tuples, space now 161815
2019-04-21 23:32:07.183 UTC [8896] LOG:  ExecHashIncreaseNumBatches ======= context stats end =======       HashBatchFiles: 4312816 total in 514 blocks; 36552 free (8 chunks); 4276264 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 270920 total in 8 blocks; 7936 free (0 chunks); 262984 used
2019-04-21 23:32:18.865 UTC [8896] WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 256 to 512
2019-04-21 23:32:18.865 UTC [8896] LOG:  ExecHashIncreaseNumBatches ======= context stats start =======       HashBatchFiles: 4312816 total in 514 blocks; 36552 free (8 chunks); 4276264 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used
Hashjoin 0x1732b88: increasing nbatch to 512 because space = 4128908
Hashjoin 0x1732b88: freed 398 of 10379 tuples, space now 3977787
2019-04-21 23:32:18.877 UTC [8896] LOG:  ExecHashIncreaseNumBatches ======= context stats end =======       HashBatchFiles: 8642288 total in 1027 blocks; 73376 free (10 chunks); 8568912 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 4147928 total in 126 blocks; 7936 free (0 chunks); 4139992 used
2019-04-21 23:32:18.880 UTC [8896] WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 512 to 1024
2019-04-21 23:32:18.880 UTC [8896] LOG:  ExecHashIncreaseNumBatches ======= context stats start =======       HashBatchFiles: 8642288 total in 1027 blocks; 73376 free (10 chunks); 8568912 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used
Hashjoin 0x1732b88: increasing nbatch to 1024 because space = 4129008
Hashjoin 0x1732b88: freed 296 of 10360 tuples, space now 4013423
2019-04-21 23:32:18.903 UTC [8896] LOG:  ExecHashIncreaseNumBatches ======= context stats end =======       HashBatchFiles: 17301232 total in 2052 blocks; 147064 free (12 chunks); 17154168 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 4180784 total in 127 blocks; 7936 free (0 chunks); 4172848 used
2019-04-21 23:32:18.906 UTC [8896] WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 1024 to 2048
2019-04-21 23:32:18.906 UTC [8896] LOG:  ExecHashIncreaseNumBatches ======= context stats start =======       HashBatchFiles: 17301232 total in 2052 blocks; 147064 free (12 chunks); 17154168 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used
Hashjoin 0x1732b88: increasing nbatch to 2048 because space = 4129133
Hashjoin 0x1732b88: freed 154 of 10354 tuples, space now 4068786
2019-04-21 23:32:18.946 UTC [8896] LOG:  ExecHashIncreaseNumBatches ======= context stats end =======       HashBatchFiles: 34389856 total in 4102 blocks; 65176 free (14 chunks); 34324680 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 4213640 total in 128 blocks; 7936 free (0 chunks); 4205704 used
2019-04-21 23:32:18.949 UTC [8896] WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 2048 to 4096
2019-04-21 23:32:18.949 UTC [8896] LOG:  ExecHashIncreaseNumBatches ======= context stats start =======       HashBatchFiles: 34389856 total in 4102 blocks; 65176 free (14 chunks); 34324680 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used
Hashjoin 0x1732b88: increasing nbatch to 4096 because space = 4129035
Hashjoin 0x1732b88: freed 10167 of 10351 tuples, space now 72849
2019-04-21 23:32:19.032 UTC [8896] LOG:  ExecHashIncreaseNumBatches ======= context stats end =======       HashBatchFiles: 68796256 total in 8199 blocks; 130672 free (14 chunks); 68665584 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 172352 total in 5 blocks; 7936 free (0 chunks); 164416 used
Hashjoin 0x1b3d4c8: initial nbatch = 1, nbuckets = 1024
Hashjoin 0x1b3bc08: initial nbatch = 1, nbuckets = 1024
Hashjoin 0x1b3b5d8: initial nbatch = 1, nbuckets = 1024
Hashjoin 0x15499a8: initial nbatch = 1, nbuckets = 1024
Hashjoin 0x1549978: initial nbatch = 1, nbuckets = 1024
Hashjoin 0x1b3cc88: initial nbatch = 1, nbuckets = 65536
Hashjoin 0x1553638: initial nbatch = 1, nbuckets = 65536
Hashjoin 0x1553638: initial nbatch = 1, nbuckets = 65536
Hashjoin 0x1553d38: initial nbatch = 16, nbuckets = 65536
Hashjoin 0x1b3ad98: initial nbatch = 16, nbuckets = 65536
Hashjoin 0x15538d8: initial nbatch = 16, nbuckets = 65536
2019-04-21 23:40:06.495 UTC [8896] WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 4096 to 8192
2019-04-21 23:40:06.495 UTC [8896] LOG:  ExecHashIncreaseNumBatches ======= context stats start =======       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 32768 total in 3 blocks; 17304 free (9 chunks); 15464 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       HashBatchFiles: 68663008 total in 8183 blocks; 131440 free (46 chunks); 68531568 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used
Hashjoin 0x1732b88: increasing nbatch to 8192 because space = 4128997
Hashjoin 0x1732b88: freed 10555 of 10560 tuples, space now 1983
2019-04-21 23:40:06.680 UTC [8896] LOG:  ExecHashIncreaseNumBatches ======= context stats end =======       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 32768 total in 3 blocks; 17304 free (9 chunks); 15464 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       HashBatchFiles: 137609056 total in 16392 blocks; 261704 free (14 chunks); 137347352 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 106640 total in 3 blocks; 7936 free (0 chunks); 98704 used
2019-04-21 23:40:06.883 UTC [8896] WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 8192 to 16384
2019-04-21 23:40:06.883 UTC [8896] LOG:  ExecHashIncreaseNumBatches ======= context stats start =======       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 32768 total in 3 blocks; 17304 free (9 chunks); 15464 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       HashBatchFiles: 137592400 total in 16390 blocks; 261800 free (18 chunks); 137330600 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 4246496 total in 129 blocks; 7936 free (0 chunks); 4238560 used
Hashjoin 0x1732b88: increasing nbatch to 16384 because space = 4128957
Hashjoin 0x1732b88: freed 10697 of 10764 tuples, space now 25956
2019-04-21 23:40:07.268 UTC [8896] LOG:  ExecHashIncreaseNumBatches ======= context stats end =======       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 32768 total in 3 blocks; 17304 free (9 chunks); 15464 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       HashBatchFiles: 275234656 total in 32777 blocks; 523808 free (14 chunks); 274710848 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 106640 total in 3 blocks; 7936 free (0 chunks); 98704 used
2019-04-21 23:40:09.096 UTC [8896] WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 16384 to 32768
2019-04-21 23:40:09.096 UTC [8896] LOG:  ExecHashIncreaseNumBatches ======= context stats start =======       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 32768 total in 3 blocks; 17304 free (9 chunks); 15464 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       HashBatchFiles: 275018128 total in 32751 blocks; 525056 free (66 chunks); 274493072 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used
Hashjoin 0x1732b88: increasing nbatch to 32768 because space = 4128890
Hashjoin 0x1732b88: freed 8 of 10809 tuples, space now 4125769
2019-04-21 23:40:10.050 UTC [8896] LOG:  ExecHashIncreaseNumBatches ======= context stats end =======       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 32768 total in 3 blocks; 17304 free (9 chunks); 15464 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       HashBatchFiles: 550485856 total in 65546 blocks; 1048056 free (14 chunks); 549437800 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used
2019-04-21 23:40:10.060 UTC [8896] WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 32768 to 65536
2019-04-21 23:40:10.060 UTC [8896] LOG:  ExecHashIncreaseNumBatches ======= context stats start =======       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 32768 total in 3 blocks; 17304 free (9 chunks); 15464 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       HashBatchFiles: 550485856 total in 65546 blocks; 1048056 free (14 chunks); 549437800 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used
Hashjoin 0x1732b88: increasing nbatch to 65536 because space = 4128825
Hashjoin 0x1732b88: freed 20 of 10809 tuples, space now 4121380
2019-04-21 23:40:12.686 UTC [8896] LOG:  ExecHashIncreaseNumBatches ======= context stats end =======       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 32768 total in 3 blocks; 17304 free (9 chunks); 15464 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       HashBatchFiles: 1100988256 total in 131083 blocks; 2096592 free (14 chunks); 1098891664 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 4246496 total in 129 blocks; 7936 free (0 chunks); 4238560 used
2019-04-21 23:40:12.703 UTC [8896] WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 65536 to 131072
2019-04-21 23:40:12.703 UTC [8896] LOG:  ExecHashIncreaseNumBatches ======= context stats start =======       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 32768 total in 3 blocks; 17304 free (9 chunks); 15464 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       HashBatchFiles: 1100988256 total in 131083 blocks; 2096592 free (14 chunks); 1098891664 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used
Hashjoin 0x1732b88: increasing nbatch to 131072 because space = 4129020
Hashjoin 0x1732b88: freed 2 of 10809 tuples, space now 4128291
2019-04-21 23:40:20.571 UTC [8896] LOG:  ExecHashIncreaseNumBatches ======= context stats end =======       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 32768 total in 3 blocks; 17304 free (9 chunks); 15464 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       HashBatchFiles: 2201993056 total in 262156 blocks; 4193704 free (14 chunks); 2197799352 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used
2019-04-21 23:40:20.602 UTC [8896] WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 131072 to 262144
2019-04-21 23:40:20.602 UTC [8896] LOG:  ExecHashIncreaseNumBatches ======= context stats start =======       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 32768 total in 3 blocks; 17304 free (9 chunks); 15464 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       HashBatchFiles: 2201993056 total in 262156 blocks; 4193704 free (14 chunks); 2197799352 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used
Hashjoin 0x1732b88: increasing nbatch to 262144 because space = 4129055       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 32768 total in 3 blocks; 17304 free (9 chunks); 15464 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used         HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used         HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used       HashBatchFiles: 2468537520 total in 293910 blocks; 2669512 free (14 chunks); 2465868008 used       HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used         HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used
2019-04-21 23:40:23.118 UTC [8896] DETAIL:  Failed on request of size 8272 in memory context "HashBatchFiles".
(END)

-Gunther

Re: Out of Memory errors are frustrating as heck!

From

Gunther

Date:

22 April 2019, 00:55:03

Now to Justin's patch. First undo Tomas' patch and apply:

$ mv src/include/executor/hashjoin.h.orig src/include/executor/hashjoin.h
$ mv src/backend/executor/nodeHash.c.orig src/backend/executor/nodeHash.c
$ mv src/backend/executor/nodeHashjoin.c.orig src/backend/executor/nodeHashjoin.c
$ patch -p1 <../limit-hash-nbatches-v2.patch
patching file src/backend/executor/nodeHash.c
Hunk #1 succeeded at 570 (offset -3 lines).
Hunk #2 succeeded at 917 (offset -3 lines).
Hunk #3 succeeded at 930 (offset -3 lines).
Hunk #4 succeeded at 1037 (offset -3 lines).
Hunk #5 succeeded at 1658 (offset -4 lines).

$ make 
$ make install
$ pg_ctl -c restart

and go ...

lots of CPU% again and very limited memory use as of yet.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
11054 postgres  20   0 1302772  90316  58004 R  94.4  1.1   4:38.05 postgres: postgres integrator [local] EXPLAIN
11055 postgres  20   0 1280532  68076  57168 R  97.7  0.9   2:03.54 postgres: parallel worker for PID 11054
11056 postgres  20   0 1280532  67964  57124 S   0.0  0.9   2:08.28 postgres: parallel worker for PID 11054

that's a pretty decent sign so far. Slight increase ... but still relatively steady ...

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
11054 postgres  20   0 1379704 167140  58004 R  95.0  2.1   5:56.28 postgres: postgres integrator [local] EXPLAIN
11055 postgres  20   0 1280532  68076  57168 S  25.6  0.9   2:36.59 postgres: parallel worker for PID 11054
11056 postgres  20   0 1280532  67964  57124 R  61.8  0.9   2:29.65 postgres: parallel worker for PID 11054

aaand break out ...

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
11119 postgres  20   0 1271660   1.0g   1.0g D   0.0 13.4   0:03.10 postgres: parallel worker for PID 11054
11054 postgres  20   0 1380940   1.0g 950508 D   0.0 13.4   6:56.09 postgres: postgres integrator [local] EXPLAIN
11118 postgres  20   0 1271660 884540 882724 D   0.0 11.2   0:02.84 postgres: parallel worker for PID 11054

and crash:

foo=# explain analyze select * from reports.v_BusinessOperation;
server closed the connection unexpectedly       This probably means the server terminated abnormally       before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

what happened? ouch, no space left on root device, too much logging? Maybe the core dump ... Log file content is simple:

2019-04-22 00:07:56.104 UTC [11048] LOG:  database system was shut down at 2019-04-22 00:07:55 UTC
2019-04-22 00:07:56.108 UTC [11046] LOG:  database system is ready to accept connections
Hashjoin 0x2122458: initial nbatch = 16, nbuckets = 8192
Hashjoin 0x2122458: increasing nbatch to 32 because space = 4128933
Hashjoin 0x2122458: freed 148 of 10584 tuples, space now 4071106
Hashjoin 0x2122458: increasing nbatch to 64 because space = 4128826
Hashjoin 0x2122458: freed 544 of 10584 tuples, space now 3916296
Hashjoin 0x2122458: increasing nbatch to 128 because space = 4128846
Hashjoin 0x2122458: freed 10419 of 10585 tuples, space now 65570
Hashjoin 0x2122458: increasing nbatch to 256 because space = 4128829
Hashjoin 0x2122458: freed 10308 of 10734 tuples, space now 161815
Hashjoin 0x2122458: increasing nbatch to 512 because space = 4128908
Hashjoin 0x2122458: freed 398 of 10379 tuples, space now 3977787
Hashjoin 0x3ac9918: initial nbatch = 1, nbuckets = 1024
Hashjoin 0x3ac91a8: initial nbatch = 1, nbuckets = 1024
Hashjoin 0x3ac93c8: initial nbatch = 1, nbuckets = 1024
Hashjoin 0x1f41018: initial nbatch = 1, nbuckets = 1024
Hashjoin 0x1f41048: initial nbatch = 1, nbuckets = 1024
2019-04-22 00:16:55.273 UTC [11046] LOG:  background worker "parallel worker" (PID 11119) was terminated by signal 11: Segmentation fault
2019-04-22 00:16:55.273 UTC [11046] DETAIL:  Failed process was running: explain analyze select * from reports.v_BusinessOperation;
2019-04-22 00:16:55.273 UTC [11046] LOG:  terminating any other active server processes
2019-04-22 00:16:55.274 UTC [11058] WARNING:  terminating connection because of crash of another server process
2019-04-22 00:16:55.274 UTC [11058] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2019-04-22 00:16:55.274 UTC [11058] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2019-04-22 00:16:55.277 UTC [11052] LOG:  could not write temporary statistics file "pg_stat/db_16903.tmp": No space left on device
2019-04-22 00:16:55.278 UTC [11052] LOG:  could not close temporary statistics file "pg_stat/db_0.tmp": No space left on device
2019-04-22 00:16:55.278 UTC [11052] LOG:  could not close temporary statistics file "pg_stat/global.tmp": No space left on device
2019-04-22 00:16:55.315 UTC [11046] LOG:  all server processes terminated; reinitializing
2019-04-22 00:16:55.425 UTC [11123] LOG:  database system was interrupted; last known up at 2019-04-22 00:12:56 UTC
2019-04-22 00:16:55.426 UTC [11124] FATAL:  the database system is in recovery mode
2019-04-22 00:16:55.545 UTC [11123] LOG:  database system was not properly shut down; automatic recovery in progress
2019-04-22 00:16:55.549 UTC [11123] LOG:  redo starts at 3D2/C44FDCF8

ok it is all because it dumped 3 core dumps, glad that I captured the top lines of the backend and its 2 workers

-rw------- 1 postgres postgres 1075843072 Apr 22 00:16 core.11054 -- backend
-rw------- 1 postgres postgres 894640128 Apr 22 00:16 core.11118 -- worker 1
-rw------- 1 postgres postgres 1079103488 Apr 22 00:16 core.11119 -- worker 2

And the melt down starts with "parallel worker" (PID 11119) receiving SIGSEGV.

So let's get gdb to the task to see what's up:

$ gdb -c data/core.11119 postgresql-11.2/src/backend/postgres
...
Reading symbols from postgresql-11.2/src/backend/postgres...done.
BFD: Warning: /var/lib/pgsql/data/core.11119 is truncated: expected core file size >= 1127112704, found: 1079103488.
[New LWP 11119]
Cannot access memory at address 0x7ff8d25dc108
Cannot access memory at address 0x7ff8d25dc100
Failed to read a valid object file image from memory.
Core was generated by `postgres: parallel worker for'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000000006bd792 in ExecParallelHashJoinNewBatch (   hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9c38>) at nodeHashjoin.c:1127
1127                            {
(gdb) bt 8
#0  0x00000000006bd792 in ExecParallelHashJoinNewBatch (   hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9c38>) at nodeHashjoin.c:1127
Backtrace stopped: Cannot access memory at address 0x7ffd45fa9c88
(gdb) info frame
Stack level 0, frame at 0x7ffd45fa9c90:rip = 0x6bd792 in ExecParallelHashJoinNewBatch (nodeHashjoin.c:1127); saved rip = <not saved>Outermost frame: Cannot access memory at address 0x7ffd45fa9c88source language c.Arglist at 0x7ffd45fa9c80, args: hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9c38>Locals at 0x7ffd45fa9c80, Previous frame's sp is 0x7ffd45fa9c90
Cannot access memory at address 0x7ffd45fa9c80
(gdb) list
1122                            SharedTuplestoreAccessor *inner_tuples;
1123                            Barrier    *batch_barrier =
1124                            &hashtable->batches[batchno].shared->batch_barrier;
1125
1126                            switch (BarrierAttach(batch_barrier))
1127                            {
1128                                    case PHJ_BATCH_ELECTING:
1129
1130                                            /* One backend allocates the hash table. */
1131                                            if (BarrierArriveAndWait(batch_barrier,

unfortunately this core file is truncated because of the file system running out of space. Let's see the others.

$ gdb -c data/core.11118 postgresql-11.2/src/backend/postgres
...
Reading symbols from postgresql-11.2/src/backend/postgres...done.
BFD: Warning: /var/lib/pgsql/data/core.11118 is truncated: expected core file size >= 1127112704, found: 894640128.
[New LWP 11118]
Cannot access memory at address 0x7ff8d25dc108
Cannot access memory at address 0x7ff8d25dc100
Failed to read a valid object file image from memory.
Core was generated by `postgres: parallel worker for'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000000006bd792 in ExecParallelHashJoinNewBatch (   hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9c38>) at nodeHashjoin.c:1127
1127                            {
(gdb) bt 5
#0  0x00000000006bd792 in ExecParallelHashJoinNewBatch (   hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9c38>) at nodeHashjoin.c:1127
Backtrace stopped: Cannot access memory at address 0x7ffd45fa9c88
(gdb) list
1122                            SharedTuplestoreAccessor *inner_tuples;
1123                            Barrier    *batch_barrier =
1124                            &hashtable->batches[batchno].shared->batch_barrier;
1125
1126                            switch (BarrierAttach(batch_barrier))
1127                            {
1128                                    case PHJ_BATCH_ELECTING:
1129
1130                                            /* One backend allocates the hash table. */
1131                                            if (BarrierArriveAndWait(batch_barrier,

strange, that one must have died very similar, same place, also truncated.

$ gdb -c data/core.11054 postgresql-11.2/src/backend/postgres
...
Reading symbols from postgresql-11.2/src/backend/postgres...done.
BFD: Warning: /var/lib/pgsql/data/core.11054 is truncated: expected core file size >= 1238786048, found: 1075843072.
[New LWP 11054]
Cannot access memory at address 0x7ff8d25dc108
Cannot access memory at address 0x7ff8d25dc100
Failed to read a valid object file image from memory.
Core was generated by `postgres: postgres integrator'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000000006bd792 in ExecParallelHashJoinNewBatch (   hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9498>) at nodeHashjoin.c:1127
1127                            {
(

I don't understand why all of them are at the same location. Doesn't make any sense to me.

But I'll leave it at that right now.

-Gunther

Re: Out of Memory errors are frustrating as heck!

From

Tomas Vondra

Date:

22 April 2019, 03:09:27

On Sun, Apr 21, 2019 at 11:40:22AM -0500, Justin Pryzby wrote:
>On Sun, Apr 21, 2019 at 10:36:43AM -0400, Tom Lane wrote:
>> Jeff Janes <jeff.janes@gmail.com> writes:
>> > The growEnabled stuff only prevents infinite loops.  It doesn't prevent
>> > extreme silliness.
>>
>> > If a single 32 bit hash value has enough tuples by itself to not fit in
>> > work_mem, then it will keep splitting until that value is in a batch by
>> > itself before shutting off
>>
>> I suspect, however, that we might be better off just taking the existence
>> of the I/O buffers into account somehow while deciding whether it's worth
>> growing further.  That is, I'm imagining adding a second independent
>> reason for shutting off growEnabled, along the lines of "increasing
>> nbatch any further will require an unreasonable amount of buffer memory".
>> The question then becomes how to define "unreasonable".
>
>On Sun, Apr 21, 2019 at 06:15:25PM +0200, Tomas Vondra wrote:
>> I think the question the code needs to be asking is "If we double the
>> number of batches, does the amount of memory we need drop?" And the
>> memory needs to account both for the buffers and per-batch data.
>>
>> I don't think we can just stop increasing the number of batches when the
>> memory for BufFile exceeds work_mem, because that entirely ignores the
>> fact that by doing that we force the system to keep the per-batch stuff
>> in memory (and that can be almost arbitrary amount).
>...
>> Of course, this just stops enforcing work_mem at some point, but it at
>> least attempts to minimize the amount of memory used.
>
>This patch defines reasonable as "additional BatchFiles will not themselves
>exceed work_mem; OR, exceeded work_mem already but additional BatchFiles are
>going to save us RAM"...
>

OK.

>I think the first condition is insensitive and not too important to get right,
>it only allows work_mem to be exceeded by 2x, which maybe already happens for
>multiple reasons, related to this thread and otherwise.  It'd be fine to slap
>on a factor of /2 or /4 or /8 there too.
>

TBH I'm not quite sure I understand all the conditions in the patch - it
seems unnecessarily complicated. And I don't think it actually minimizes
the amount of memory used for hash table + buffers, because it keeps the
same spaceAllowed (which pushes nbatches up). At some point it actually
makes to bump spaceAllowed and make larger batches instead of adding
more batches, and the patch does not seem to do that.

Also, the patch does this:

    if (hashtable->nbatch*sizeof(PGAlignedBlock) < hashtable->spaceAllowed)
    {
        ExecHashIncreaseNumBatches(hashtable);
    }
    else if (hashtable->spaceUsed/2 >= hashtable->spaceAllowed)
    {
        /* Exceeded spaceAllowed by 2x, so we'll save RAM by allowing nbatches to increase */
        /* I think this branch would be hit almost same as below branch */
        ExecHashIncreaseNumBatches(hashtable);
    }
    ...

but the reasoning for the second branch seems wrong, because

    (spaceUsed/2 >= spaceAllowed)

is not enough to guarantee that we actually save memory by doubling the
number of batches. To do that, we need to make sure that

   (spaceUsed/2 >= hashtable->nbatch * sizeof(PGAlignedBlock))

But that may not be true - it certainly is not guaranteed by not getting
into the first branch.

Consider an ideal example with uniform distribution:

    create table small (id bigint, val text);
    create table large (id bigint, val text);

    insert into large select 1000000000 * random(), md5(i::text)
    from generate_series(1, 700000000) s(i);

    insert into small select 1000000000 * random(), md5(i::text)
    from generate_series(1, 10000) s(i);

    vacuum analyze large;
    vacuum analyze small;

    update pg_class set (relpages, reltuples) = (1000000, 1)
    where relname = 'large';

    update pg_class set (relpages, reltuples) = (1, 1000000)
    where relname = 'small';

    set work_mem = '1MB';

    explain analyze select * from small join large using (id);

A log after each call to ExecHashIncreaseNumBatches says this:

    nbatch=2 spaceUsed=463200 spaceAllowed=1048576 BufFile=16384
    nbatch=4 spaceUsed=463120 spaceAllowed=1048576 BufFile=32768
    nbatch=8 spaceUsed=457120 spaceAllowed=1048576 BufFile=65536
    nbatch=16 spaceUsed=458320 spaceAllowed=1048576 BufFile=131072
    nbatch=32 spaceUsed=457120 spaceAllowed=1048576 BufFile=262144
    nbatch=64 spaceUsed=459200 spaceAllowed=1048576 BufFile=524288
    nbatch=128 spaceUsed=455600 spaceAllowed=1048576 BufFile=1048576
    nbatch=256 spaceUsed=525120 spaceAllowed=1048576 BufFile=2097152
    nbatch=256 spaceUsed=2097200 spaceAllowed=1048576 BufFile=2097152
    nbatch=512 spaceUsed=2097200 spaceAllowed=1048576 BufFile=4194304
    nbatch=1024 spaceUsed=2097200 spaceAllowed=1048576 BufFile=8388608
    nbatch=2048 spaceUsed=2097200 spaceAllowed=1048576 BufFile=16777216
    nbatch=4096 spaceUsed=2097200 spaceAllowed=1048576 BufFile=33554432
    nbatch=8192 spaceUsed=2097200 spaceAllowed=1048576 BufFile=67108864
    nbatch=16384 spaceUsed=2097200 spaceAllowed=1048576 BufFile=134217728

So we've succeeded in keeping spaceUsed below 2*spaceAllowed (which
seems rather confusing, BTW), but we've allocated 128MB for BufFile. So
about 130MB in total. With 16k batches.

What I think might work better is the attached v2 of the patch, with a
single top-level condition, comparing the combined memory usage
(spaceUsed + BufFile) against spaceAllowed. But it also tweaks
spaceAllowed once the size needed for BufFile gets over work_mem/3.

And it behaves like this:

    nbatch=2 spaceUsed=458640 spaceAllowed=1048576 BufFile=16384
    nbatch=4 spaceUsed=455040 spaceAllowed=1048576 BufFile=32768
    nbatch=8 spaceUsed=440160 spaceAllowed=1048576 BufFile=65536
    nbatch=16 spaceUsed=426560 spaceAllowed=1048576 BufFile=131072
    nbatch=32 spaceUsed=393200 spaceAllowed=1048576 BufFile=262144
    nbatch=64 spaceUsed=329120 spaceAllowed=1572864 BufFile=524288
    nbatch=128 spaceUsed=455600 spaceAllowed=3145728 BufFile=1048576
    nbatch=256 spaceUsed=987440 spaceAllowed=6291456 BufFile=2097152
    nbatch=512 spaceUsed=2040560 spaceAllowed=12582912 BufFile=4194304
    nbatch=1024 spaceUsed=4114640 spaceAllowed=25165824 BufFile=8388608
    nbatch=2048 spaceUsed=8302880 spaceAllowed=50331648 BufFile=16777216

So we end up with just 2k batches, using ~24MB of memory in total.
That's because the spaceAllowed limit was bumped up instead of adding
more and more batches.

>The current patch doesn't unset growEnabled, since there's no point at which
>the hashtable should grow without bound: if hash tables are *already* exceeding
>work_mem by 2x as big, nbatches should be doubled.
>

Not sure. I guess it might be useful to re-evaluate the flag after a
while - not necessarily by actually enabling it right away, but just
checking if it'd move any tuples. Just disabling it once may be an issue
when the input data is somehow correlated, which seems to be one of the
issues with the data set discussed in this thread.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

v2-0001-account-for-size-of-BatchFile-structure-in-hashJo.patch

Re: Out of Memory errors are frustrating as heck!

From

Gaetano Mendola

Date:

22 April 2019, 08:07:52

Batch splitting shouldn't be followed by a hash function change?

On Mon, Apr 22, 2019, 05:09 Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On Sun, Apr 21, 2019 at 11:40:22AM -0500, Justin Pryzby wrote:
>On Sun, Apr 21, 2019 at 10:36:43AM -0400, Tom Lane wrote:
>> Jeff Janes <jeff.janes@gmail.com> writes:
>> > The growEnabled stuff only prevents infinite loops. It doesn't prevent
>> > extreme silliness.
>>
>> > If a single 32 bit hash value has enough tuples by itself to not fit in
>> > work_mem, then it will keep splitting until that value is in a batch by
>> > itself before shutting off
>>
>> I suspect, however, that we might be better off just taking the existence
>> of the I/O buffers into account somehow while deciding whether it's worth
>> growing further. That is, I'm imagining adding a second independent
>> reason for shutting off growEnabled, along the lines of "increasing
>> nbatch any further will require an unreasonable amount of buffer memory".
>> The question then becomes how to define "unreasonable".
>
>On Sun, Apr 21, 2019 at 06:15:25PM +0200, Tomas Vondra wrote:
>> I think the question the code needs to be asking is "If we double the
>> number of batches, does the amount of memory we need drop?" And the
>> memory needs to account both for the buffers and per-batch data.
>>
>> I don't think we can just stop increasing the number of batches when the
>> memory for BufFile exceeds work_mem, because that entirely ignores the
>> fact that by doing that we force the system to keep the per-batch stuff
>> in memory (and that can be almost arbitrary amount).
>...
>> Of course, this just stops enforcing work_mem at some point, but it at
>> least attempts to minimize the amount of memory used.
>
>This patch defines reasonable as "additional BatchFiles will not themselves
>exceed work_mem; OR, exceeded work_mem already but additional BatchFiles are
>going to save us RAM"...
>

OK.

>I think the first condition is insensitive and not too important to get right,
>it only allows work_mem to be exceeded by 2x, which maybe already happens for
>multiple reasons, related to this thread and otherwise. It'd be fine to slap
>on a factor of /2 or /4 or /8 there too.
>

TBH I'm not quite sure I understand all the conditions in the patch - it
seems unnecessarily complicated. And I don't think it actually minimizes
the amount of memory used for hash table + buffers, because it keeps the
same spaceAllowed (which pushes nbatches up). At some point it actually
makes to bump spaceAllowed and make larger batches instead of adding
more batches, and the patch does not seem to do that.

Also, the patch does this:

if (hashtable->nbatch*sizeof(PGAlignedBlock) < hashtable->spaceAllowed)
{
ExecHashIncreaseNumBatches(hashtable);
}
else if (hashtable->spaceUsed/2 >= hashtable->spaceAllowed)
{
/* Exceeded spaceAllowed by 2x, so we'll save RAM by allowing nbatches to increase */
/* I think this branch would be hit almost same as below branch */
ExecHashIncreaseNumBatches(hashtable);
}
...

but the reasoning for the second branch seems wrong, because

(spaceUsed/2 >= spaceAllowed)

is not enough to guarantee that we actually save memory by doubling the
number of batches. To do that, we need to make sure that

(spaceUsed/2 >= hashtable->nbatch * sizeof(PGAlignedBlock))

But that may not be true - it certainly is not guaranteed by not getting
into the first branch.

Consider an ideal example with uniform distribution:

create table small (id bigint, val text);
create table large (id bigint, val text);

insert into large select 1000000000 * random(), md5(i::text)
from generate_series(1, 700000000) s(i);

insert into small select 1000000000 * random(), md5(i::text)
from generate_series(1, 10000) s(i);

vacuum analyze large;
vacuum analyze small;

update pg_class set (relpages, reltuples) = (1000000, 1)
where relname = 'large';

update pg_class set (relpages, reltuples) = (1, 1000000)
where relname = 'small';

set work_mem = '1MB';

explain analyze select * from small join large using (id);

A log after each call to ExecHashIncreaseNumBatches says this:

nbatch=2 spaceUsed=463200 spaceAllowed=1048576 BufFile=16384
nbatch=4 spaceUsed=463120 spaceAllowed=1048576 BufFile=32768
nbatch=8 spaceUsed=457120 spaceAllowed=1048576 BufFile=65536
nbatch=16 spaceUsed=458320 spaceAllowed=1048576 BufFile=131072
nbatch=32 spaceUsed=457120 spaceAllowed=1048576 BufFile=262144
nbatch=64 spaceUsed=459200 spaceAllowed=1048576 BufFile=524288
nbatch=128 spaceUsed=455600 spaceAllowed=1048576 BufFile=1048576
nbatch=256 spaceUsed=525120 spaceAllowed=1048576 BufFile=2097152
nbatch=256 spaceUsed=2097200 spaceAllowed=1048576 BufFile=2097152
nbatch=512 spaceUsed=2097200 spaceAllowed=1048576 BufFile=4194304
nbatch=1024 spaceUsed=2097200 spaceAllowed=1048576 BufFile=8388608
nbatch=2048 spaceUsed=2097200 spaceAllowed=1048576 BufFile=16777216
nbatch=4096 spaceUsed=2097200 spaceAllowed=1048576 BufFile=33554432
nbatch=8192 spaceUsed=2097200 spaceAllowed=1048576 BufFile=67108864
nbatch=16384 spaceUsed=2097200 spaceAllowed=1048576 BufFile=134217728

So we've succeeded in keeping spaceUsed below 2*spaceAllowed (which
seems rather confusing, BTW), but we've allocated 128MB for BufFile. So
about 130MB in total. With 16k batches.

What I think might work better is the attached v2 of the patch, with a
single top-level condition, comparing the combined memory usage
(spaceUsed + BufFile) against spaceAllowed. But it also tweaks
spaceAllowed once the size needed for BufFile gets over work_mem/3.

And it behaves like this:

nbatch=2 spaceUsed=458640 spaceAllowed=1048576 BufFile=16384
nbatch=4 spaceUsed=455040 spaceAllowed=1048576 BufFile=32768
nbatch=8 spaceUsed=440160 spaceAllowed=1048576 BufFile=65536
nbatch=16 spaceUsed=426560 spaceAllowed=1048576 BufFile=131072
nbatch=32 spaceUsed=393200 spaceAllowed=1048576 BufFile=262144
nbatch=64 spaceUsed=329120 spaceAllowed=1572864 BufFile=524288
nbatch=128 spaceUsed=455600 spaceAllowed=3145728 BufFile=1048576
nbatch=256 spaceUsed=987440 spaceAllowed=6291456 BufFile=2097152
nbatch=512 spaceUsed=2040560 spaceAllowed=12582912 BufFile=4194304
nbatch=1024 spaceUsed=4114640 spaceAllowed=25165824 BufFile=8388608
nbatch=2048 spaceUsed=8302880 spaceAllowed=50331648 BufFile=16777216

So we end up with just 2k batches, using ~24MB of memory in total.
That's because the spaceAllowed limit was bumped up instead of adding
more and more batches.

>The current patch doesn't unset growEnabled, since there's no point at which
>the hashtable should grow without bound: if hash tables are *already* exceeding
>work_mem by 2x as big, nbatches should be doubled.
>

Not sure. I guess it might be useful to re-evaluate the flag after a
while - not necessarily by actually enabling it right away, but just
checking if it'd move any tuples. Just disabling it once may be an issue
when the input data is somehow correlated, which seems to be one of the
issues with the data set discussed in this thread.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Out of Memory errors are frustrating as heck!

From

Tomas Vondra

Date:

22 April 2019, 13:04:15

On Mon, Apr 22, 2019 at 10:07:52AM +0200, Gaetano Mendola wrote:
>   Batch splitting shouldn't be followed by a hash function change? 

What would be the value? That can help with hash collisions, but that's
not the issue with the data sets discussed in this thread. The issue
reported originally is about underestimates, and the sample data set has
a large number of duplicate values (a single value representing ~10% of
the data set). Neither of those issues is about hash collisions.

The data set I used to demonstrate how the algorithms work is pretty
perfect, with uniform distribution and no hash collisions.

Furthermore, I don't think we can just change the hash function, for a
couple of technical reasons.

Firstly, it's not like we totally redistribute the whole dataset from N
old batches to (2*N) new ones. By using the same 32-bit hash value and
cconsidering one extra bit, the tuples either stay in the same batch
(when the new bit is 0) or move to a single new batch (when it's 1). So
each batch is split in 1/2. By changing the hash function this would no
longer be true, and we'd redistribute pretty much the whole data set.

The other issue is even more significant - we don't redistribute the
tuples immediately. We only redistribute the current batch, but leave
the other batches alone and handle them when we actually get to them.
This is possible, because the tuples never move backwards - when
splitting batch K, the tuples either stay in K or move to 2K. Or
something like that, I'm too lazy to recall the exact formula now.

And if I recall correctly, I think we can increment the number of
batches while already performing the join, after some rows were already
processed. That would probably be no longer true if we just switched the
hash function, because it might move rows backwards (to the already
processed region).


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Out of Memory errors are frustrating as heck!

From

Jeff Janes

Date:

22 April 2019, 17:15:10

On Sat, Apr 20, 2019 at 4:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Gunther <raj@gusw.net> writes:
> and checked my log file and there was nothing before the call
> MemoryContextStats(TopPortalContext) so I don't understand where this
> printf stuff is ending up.

It's going to stdout, which is likely block-buffered whereas stderr
is line-buffered, so data from the latter will show up in your log
file much sooner. You might consider adding something to startup
to switch stdout to line buffering.

Is there a reason to not just elog the HJDEBUG stuff? With some of the other DEBUG defines, we will probably be using them before the logging system is set up, but surely we won't be doing Hash Joins that early?

I think we could just get rid of the conditional compilation and elog this at DEBUG1 or DEBUG2. Or keep the conditional compilation and elog it at LOG.

Cheers,

Jeff

Re: Out of Memory errors are frustrating as heck!

From

Tom Lane

Date:

22 April 2019, 17:57:34

Jeff Janes <jeff.janes@gmail.com> writes:
> Is there a reason to not just elog the HJDEBUG stuff?

Yes --- it'd be expensive (a "no op" elog is far from free) and
useless to ~ 99.999% of users.

Almost all the conditionally-compiled debug support in the PG executor
is legacy leftovers from Berkeley days.  If it were useful more often
than once in a blue moon, we probably would have made it more easily
reachable long ago.  I'm a bit surprised we haven't just ripped it
out, TBH.  When I find myself needing extra debug output, it's almost
never the case that any of that old code does what I need.

There might be a case for changing it all to print to stderr not
stdout, so that it plays more nicely with elog/ereport output
when you do have it turned on, though.

            regards, tom lane

Re: Out of Memory errors are frustrating as heck!

From

Gunther

Date:

23 April 2019, 20:37:50

On 4/21/2019 23:09, Tomas Vondra wrote:

What I think might work better is the attached v2 of the patch, with a
single top-level condition, comparing the combined memory usage
(spaceUsed + BufFile) against spaceAllowed. But it also tweaks
spaceAllowed once the size needed for BufFile gets over work_mem/3.

Thanks for this, and I am trying this now.

So far it is promising.

I see the memory footprint contained under 1 GB. I see it go up, but also down again. CPU, IO, all being live.

foo=# set enable_nestloop to off;
SET
foo=# explain analyze select * from reports.v_BusinessOperation;
WARNING:  ExecHashIncreaseNumBatches: nbatch=32 spaceAllowed=4194304
WARNING:  ExecHashIncreaseNumBatches: nbatch=64 spaceAllowed=4194304
WARNING:  ExecHashIncreaseNumBatches: nbatch=128 spaceAllowed=4194304
WARNING:  ExecHashIncreaseNumBatches: nbatch=256 spaceAllowed=6291456
WARNING:  ExecHashIncreaseNumBatches: nbatch=512 spaceAllowed=12582912
WARNING:  ExecHashIncreaseNumBatches: nbatch=1024 spaceAllowed=25165824
WARNING:  ExecHashIncreaseNumBatches: nbatch=2048 spaceAllowed=50331648
WARNING:  ExecHashIncreaseNumBatches: nbatch=4096 spaceAllowed=100663296
WARNING:  ExecHashIncreaseNumBatches: nbatch=8192 spaceAllowed=201326592
WARNING:  ExecHashIncreaseNumBatches: nbatch=16384 spaceAllowed=402653184
WARNING:  ExecHashIncreaseNumBatches: nbatch=32768 spaceAllowed=805306368
WARNING:  ExecHashIncreaseNumBatches: nbatch=65536 spaceAllowed=1610612736

Aaaaaand, it's a winner!

 Unique  (cost=5551524.36..5554207.33 rows=34619 width=1197) (actual time=6150303.060..6895451.210 rows=435274 loops=1)  ->  Sort  (cost=5551524.36..5551610.91 rows=34619 width=1197) (actual time=6150303.058..6801372.192 rows=113478386 loops=1)        Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.is_current, documentinformationsubject.documentid, documentinformationsubject.documenttypecode, documentinformationsubject.subjectroleinternalid, documentinformationsubject.subjectentityinternalid, documentinformationsubject.subjectentityid, documentinformationsubject.subjectentityidroot, documentinformationsubject.subjectentityname, documentinformationsubject.subjectentitytel, documentinformationsubject.subjectentityemail, documentinformationsubject.otherentityinternalid, documentinformationsubject.confidentialitycode, documentinformationsubject.actinternalid, documentinformationsubject.code_code, documentinformationsubject.code_displayname, q.code_code, q.code_displayname, an.extension, an.root, documentinformationsubject_2.subjectentitycode, documentinformationsubject_2.subjectentitycodesystem, documentinformationsubject_2.effectivetime_low, documentinformationsubject_2.effectivetime_high, documentinformationsubject_2.statuscode, documentinformationsubject_2.code_code, agencyid.extension, agencyname.trivialname, documentinformationsubject_1.subjectentitycode, documentinformationsubject_1.subjectentityinternalid        Sort Method: external merge  Disk: 40726720kB        ->  Hash Right Join  (cost=4255031.53..5530808.71 rows=34619 width=1197) (actual time=325240.679..1044194.775 rows=113478386 loops=1)              Hash Cond: (((q.documentinternalid)::text = (documentinformationsubject.documentinternalid)::text) AND ((r.targetinternalid)::text = (documentinformationsubject.actinternalid)::text))              ->  Hash Right Join  (cost=1341541.37..2612134.36 rows=13 width=341) (actual time=81093.327..81093.446 rows=236 loops=1)                    Hash Cond: (((documentinformationsubject_2.documentinternalid)::text = (q.documentinternalid)::text) AND ((documentinformationsubject_2.actinternalid)::text = (q.actinternalid)::text))                    ->  Gather  (cost=31291.54..1301884.52 rows=1 width=219) (actual time=41920.563..41929.780 rows=0 loops=1)                          Workers Planned: 2                          Workers Launched: 2                          ->  Parallel Hash Left Join  (cost=30291.54..1300884.42 rows=1 width=219) (actual time=41915.960..41915.960 rows=0 loops=3)                                Hash Cond: ((documentinformationsubject_2.otherentityinternalid)::text = (agencyid.entityinternalid)::text)                                ->  Parallel Hash Left Join  (cost=28606.13..1299199.00 rows=1 width=204) (actual time=41862.767..41862.769 rows=0 loops=3)                                      Hash Cond: ((documentinformationsubject_2.otherentityinternalid)::text = (agencyname.entityinternalid)::text)                                      ->  Parallel Seq Scan on documentinformationsubject documentinformationsubject_2  (cost=0.00..1268800.85 rows=1 width=177) (actual time=40805.337..40805.337 rows=0 loops=3)                                            Filter: ((participationtypecode)::text = 'AUT'::text)                                            Rows Removed by Filter: 2815562                                      ->  Parallel Hash  (cost=24733.28..24733.28 rows=166628 width=64) (actual time=981.000..981.001 rows=133303 loops=3)                                            Buckets: 65536  Batches: 16  Memory Usage: 3136kB                                            ->  Parallel Seq Scan on bestname agencyname  (cost=0.00..24733.28 rows=166628 width=64) (actual time=0.506..916.816 rows=133303 loops=3)                                ->  Parallel Hash  (cost=1434.07..1434.07 rows=20107 width=89) (actual time=52.350..52.350 rows=11393 loops=3)                                      Buckets: 65536  Batches: 1  Memory Usage: 4680kB                                      ->  Parallel Seq Scan on entity_id agencyid  (cost=0.00..1434.07 rows=20107 width=89) (actual time=0.376..46.875 rows=11393 loops=3)                    ->  Hash  (cost=1310249.63..1310249.63 rows=13 width=233) (actual time=39172.682..39172.682 rows=236 loops=1)                          Buckets: 1024  Batches: 1  Memory Usage: 70kB                          ->  Hash Right Join  (cost=829388.20..1310249.63 rows=13 width=233) (actual time=35084.850..39172.545 rows=236 loops=1)                                Hash Cond: ((an.actinternalid)::text = (q.actinternalid)::text)                                ->  Seq Scan on act_id an  (cost=0.00..425941.04 rows=14645404 width=134) (actual time=0.908..7583.123 rows=14676871 loops=1)                                ->  Hash  (cost=829388.19..829388.19 rows=1 width=136) (actual time=29347.614..29347.614 rows=236 loops=1)                                      Buckets: 1024  Batches: 1  Memory Usage: 63kB                                      ->  Gather  (cost=381928.46..829388.19 rows=1 width=136) (actual time=23902.428..29347.481 rows=236 loops=1)                                            Workers Planned: 2                                            Workers Launched: 2                                            ->  Parallel Hash Join  (cost=380928.46..828388.09 rows=1 width=136) (actual time=23915.790..29336.452 rows=79 loops=3)                                                  Hash Cond: ((q.actinternalid)::text = (r.sourceinternalid)::text)                                                  ->  Parallel Seq Scan on documentinformation q  (cost=0.00..447271.93 rows=50050 width=99) (actual time=10055.238..15484.478 rows=87921 loops=3)                                                        Filter: (((classcode)::text = 'CNTRCT'::text) AND ((moodcode)::text = 'EVN'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text))                                                        Rows Removed by Filter: 1540625                                                  ->  Parallel Hash  (cost=380928.44..380928.44 rows=1 width=74) (actual time=13825.726..13825.726 rows=79 loops=3)                                                        Buckets: 1024  Batches: 1  Memory Usage: 112kB                                                        ->  Parallel Seq Scan on actrelationship r  (cost=0.00..380928.44 rows=1 width=74) (actual time=5289.948..13825.576 rows=79 loops=3)                                                              Filter: ((typecode)::text = 'SUBJ'::text)                                                              Rows Removed by Filter: 3433326              ->  Hash  (cost=2908913.87..2908913.87 rows=34619 width=930) (actual time=244145.322..244145.322 rows=113478127 loops=1)                    Buckets: 8192 (originally 8192)  Batches: 65536 (originally 16)  Memory Usage: 1204250kB                    ->  Gather Merge  (cost=2892141.40..2908913.87 rows=34619 width=930) (actual time=75215.333..145622.427 rows=113478127 loops=1)                          Workers Planned: 2                          Workers Launched: 2                          ->  Merge Left Join  (cost=2891141.37..2903917.96 rows=14425 width=930) (actual time=75132.988..99411.448 rows=37826042 loops=3)                                Merge Cond: (((documentinformationsubject.documentinternalid)::text = (documentinformationsubject_1.documentinternalid)::text) AND ((documentinformationsubject.documentid)::text = (documentinformationsubject_1.documentid)::text) AND ((documentinformationsubject.actinternalid)::text = (documentinformationsubject_1.actinternalid)::text))                                ->  Sort  (cost=1301590.26..1301626.32 rows=14425 width=882) (actual time=39801.337..40975.780 rows=231207 loops=3)                                      Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.documentid, documentinformationsubject.actinternalidct_1.documentid, documentinformationsubject_1.actinternalid                                            Sort Method: external merge  Disk: 169768kB                                            Worker 0:  Sort Method: external merge  Disk: 169768kB                                            Worker 1:  Sort Method: external merge  Disk: 169768kB                                            ->  Seq Scan on documentinformationsubject documentinformationsubject_1  (cost=0.00..1329868.64 rows=1010585 width=159) (actual time=23401.537..31758.042 rows=1031106 loops=3)                                                  Filter: ((participationtypecode)::text = 'PRD'::text)                                                  Rows Removed by Filter: 7415579Planning Time: 40.559 msExecution Time: 6896581.566 ms
(70 rows)

For the first time this query has succeeded now. Memory was bounded. The time of nearly hours is crazy, but things sometimes take that long. The important thing was not to get an out of memory error.

Thank you. Anything else you want to try, I can do it.

regards,
-Gunther

Re: Out of Memory errors are frustrating as heck!

From

Justin Pryzby

Date:

23 April 2019, 20:43:48

On Tue, Apr 23, 2019 at 04:37:50PM -0400, Gunther wrote:
> On 4/21/2019 23:09, Tomas Vondra wrote:
> >What I think might work better is the attached v2 of the patch, with a
> Thanks for this, and I am trying this now.
...
> Aaaaaand, it's a winner!
> 
> Unique (cost=5551524.36..5554207.33 rows=34619 width=1197) (actual time=6150303.060..6895451.210 rows=435274
loops=1)
> -> Sort (cost=5551524.36..5551610.91 rows=34619 width=1197) (actual time=6150303.058..6801372.192 rows=113478386
loops=1)
 
> Sort Method: external merge Disk: 40726720kB
> 
> For the first time this query has succeeded now. Memory was bounded. The
> time of nearly hours is crazy, but things sometimes take that long

It wrote 40GB tempfiles - perhaps you can increase work_mem now to improve the
query time.

We didn't address it yet, but your issue was partially caused by a misestimate.
It's almost certainly because these conditions are correlated, or maybe
redundant.

> Merge Cond: (((documentinformationsubject.documentinternalid)::text =
(documentinformationsubject_1.documentinternalid)::text)AND ((documentinformationsubject.documentid)::text =
(documentinformationsubject_1.documentid)::text)AND ((documentinformationsubject.actinternalid)::text =
(documentinformationsubject_1.actinternalid)::text))

If they're completely redundant and you can get the same result after dropping
one or two of those conditions, then you should.

Alternately, if they're correlated but not redundant, you can use PG10
"dependency" statistics (CREATE STATISTICS) on the correlated columns (and
ANALYZE).

On Tue, Apr 16, 2019 at 10:24:53PM -0400, Gunther wrote:
>  Hash Right Join  (cost=4203858.53..5475530.71 rows=34619 width=4) (actual time=309603.384..459480.863 rows=113478386
loops=1)
...
>          ->  Hash  (cost=1310249.63..1310249.63 rows=13 width=111) (actual time=51077.049..51077.049 rows=236
loops=1)
...
>    ->  Hash  (cost=2861845.87..2861845.87 rows=34619 width=74) (actual time=199792.446..199792.446 rows=113478127
loops=1)
>          Buckets: 65536 (originally 65536)  Batches: 131072 (originally 2)  Memory Usage: 189207kB
>          ->  Gather Merge  (cost=2845073.40..2861845.87 rows=34619 width=74) (actual time=107620.262..156256.432
rows=113478127loops=1)
 
>                Workers Planned: 2
>                Workers Launched: 2
>                ->  Merge Left Join  (cost=2844073.37..2856849.96 rows=14425 width=74) (actual
time=107570.719..126113.792rows=37826042 loops=3)
 
>                      Merge Cond: (((documentinformationsubject.documentinternalid)::text =
(documentinformationsubject_1.documentinternalid)::text)AND ((documentinformationsubject.documentid)::text =
(documentinformationsubject_1.documentid)::text)AND ((documentinformationsubject.actinternalid)::text =
(documentinformationsubject_1.actinternalid)::text))
>                      ->  Sort  (cost=1295969.26..1296005.32 rows=14425 width=111) (actual time=57700.723..58134.751
rows=231207loops=3)
 
>                            Sort Key: documentinformationsubject.documentinternalid,
documentinformationsubject.documentid,documentinformationsubject.actinternalid
 
>                            Sort Method: external merge  Disk: 26936kB
>                            Worker 0:  Sort Method: external merge  Disk: 27152kB
>                            Worker 1:  Sort Method: external merge  Disk: 28248kB
>                            ->  Parallel Seq Scan on documentinformationsubject  (cost=0.00..1294972.76 rows=14425
width=111)(actual time=24866.656..57424.420 rows=231207 loops=3)
 
>                                  Filter: (((participationtypecode)::text = ANY ('{PPRF,PRF}'::text[])) AND
((classcode)::text= 'ACT'::text) AND ((moodcode)::text = 'DEF'::text) AND ((code_codesystem)::text =
'2.16.840.1.113883.3.26.1.1'::text))
>                                  Rows Removed by Filter: 2584355
>                      ->  Materialize  (cost=1548104.12..1553157.04 rows=1010585 width=111) (actual
time=49869.984..54191.701rows=38060250 loops=3)
 
>                            ->  Sort  (cost=1548104.12..1550630.58 rows=1010585 width=111) (actual
time=49869.980..50832.205rows=1031106 loops=3)
 
>                                  Sort Key: documentinformationsubject_1.documentinternalid,
documentinformationsubject_1.documentid,documentinformationsubject_1.actinternalid
 
>                                  Sort Method: external merge  Disk: 122192kB
>                                  Worker 0:  Sort Method: external merge  Disk: 122192kB
>                                  Worker 1:  Sort Method: external merge  Disk: 122192kB
>                                  ->  Seq Scan on documentinformationsubject documentinformationsubject_1
(cost=0.00..1329868.64rows=1010585 width=111) (actual time=20366.166..47751.267 rows=1031106 loops=3)
 
>                                        Filter: ((participationtypecode)::text = 'PRD'::text)
>                                        Rows Removed by Filter: 7415579

Re: Out of Memory errors are frustrating as heck!

From

Tomas Vondra

Date:

23 April 2019, 21:46:52

On Tue, Apr 23, 2019 at 03:43:48PM -0500, Justin Pryzby wrote:
>On Tue, Apr 23, 2019 at 04:37:50PM -0400, Gunther wrote:
>> On 4/21/2019 23:09, Tomas Vondra wrote:
>> >What I think might work better is the attached v2 of the patch, with a
>> Thanks for this, and I am trying this now.
>...
>> Aaaaaand, it's a winner!
>>
>> Unique (cost=5551524.36..5554207.33 rows=34619 width=1197) (actual time=6150303.060..6895451.210 rows=435274
loops=1)
>> -> Sort (cost=5551524.36..5551610.91 rows=34619 width=1197) (actual time=6150303.058..6801372.192 rows=113478386
loops=1)
>> Sort Method: external merge Disk: 40726720kB
>>
>> For the first time this query has succeeded now. Memory was bounded. The
>> time of nearly hours is crazy, but things sometimes take that long
>
>It wrote 40GB tempfiles - perhaps you can increase work_mem now to improve the
>query time.
>

That's unlikely to reduce the amount of data written to temporary files,
it just means there will be fewer larger files - in total it's still
going to be ~40GB. And it's not guaranteed it'll improve performance,
because work_mem=4MB might fit into CPU caches and larger values almost
certainly won't. I don't think there's much to gain, really.

>We didn't address it yet, but your issue was partially caused by a misestimate.
>It's almost certainly because these conditions are correlated, or maybe
>redundant.
>

Right. Chances are that with a bettwe estimate the optimizer would pick
merge join instead. I wonder if that would be significantly faster.

>> Merge Cond: (((documentinformationsubject.documentinternalid)::text =
>> (documentinformationsubject_1.documentinternalid)::text) AND
>> ((documentinformationsubject.documentid)::text =
>> (documentinformationsubject_1.documentid)::text) AND
>> ((documentinformationsubject.actinternalid)::text =
>> (documentinformationsubject_1.actinternalid)::text))
>
>If they're completely redundant and you can get the same result after
>dropping one or two of those conditions, then you should.
>
>Alternately, if they're correlated but not redundant, you can use PG10
>"dependency" statistics (CREATE STATISTICS) on the correlated columns
>(and ANALYZE).
>

That's not going to help, because we don't use functional dependencies
in join estimation yet.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Out of Memory errors are frustrating as heck!

From

Tomas Vondra

Date:

23 April 2019, 21:59:18

On Tue, Apr 23, 2019 at 04:37:50PM -0400, Gunther wrote:
>   On 4/21/2019 23:09, Tomas Vondra wrote:
>
>     What I think might work better is the attached v2 of the patch, with a
>     single top-level condition, comparing the combined memory usage
>     (spaceUsed + BufFile) against spaceAllowed. But it also tweaks
>     spaceAllowed once the size needed for BufFile gets over work_mem/3.
>
>   Thanks for this, and I am trying this now.
>
>   So far it is promising.
>
>   I see the memory footprint contained under 1 GB. I see it go up, but also
>   down again. CPU, IO, all being live.
>
> foo=# set enable_nestloop to off;
> SET
> foo=# explain analyze select * from reports.v_BusinessOperation;
> WARNING:  ExecHashIncreaseNumBatches: nbatch=32 spaceAllowed=4194304
> WARNING:  ExecHashIncreaseNumBatches: nbatch=64 spaceAllowed=4194304
> WARNING:  ExecHashIncreaseNumBatches: nbatch=128 spaceAllowed=4194304
> WARNING:  ExecHashIncreaseNumBatches: nbatch=256 spaceAllowed=6291456
> WARNING:  ExecHashIncreaseNumBatches: nbatch=512 spaceAllowed=12582912
> WARNING:  ExecHashIncreaseNumBatches: nbatch=1024 spaceAllowed=25165824
> WARNING:  ExecHashIncreaseNumBatches: nbatch=2048 spaceAllowed=50331648
> WARNING:  ExecHashIncreaseNumBatches: nbatch=4096 spaceAllowed=100663296
> WARNING:  ExecHashIncreaseNumBatches: nbatch=8192 spaceAllowed=201326592
> WARNING:  ExecHashIncreaseNumBatches: nbatch=16384 spaceAllowed=402653184
> WARNING:  ExecHashIncreaseNumBatches: nbatch=32768 spaceAllowed=805306368
> WARNING:  ExecHashIncreaseNumBatches: nbatch=65536 spaceAllowed=1610612736
>
>   Aaaaaand, it's a winner!
>

Good ;-)

>  Unique  (cost=5551524.36..5554207.33 rows=34619 width=1197) (actual time=6150303.060..6895451.210 rows=435274
loops=1)
>    ->  Sort  (cost=5551524.36..5551610.91 rows=34619 width=1197) (actual time=6150303.058..6801372.192 rows=113478386
loops=1)
>          Sort Key: ...
>          Sort Method: external merge  Disk: 40726720kB
>          ->  Hash Right Join  (cost=4255031.53..5530808.71 rows=34619 width=1197) (actual
time=325240.679..1044194.775rows=113478386 loops=1)
 
>                Hash Cond: ...
> ...
>  Planning Time: 40.559 ms
>  Execution Time: 6896581.566 ms
> (70 rows)
>
>
>   For the first time this query has succeeded now. Memory was bounded. The
>   time of nearly hours is crazy, but things sometimes take that long. The
>   important thing was not to get an out of memory error.
>

TBH I don't think there's much we can do to improve this further - it's
a rather desperate effort to keep the memory usage as low as possible,
without any real guarantees.

Also, the hash join only takes about 1000 seconds out of the 6900 total.
So even if we got it much faster, the query would still take almost two
hours, give or take.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Out of Memory errors are frustrating as heck!

From

Gunther

Date:

23 April 2019, 23:09:00

On 4/23/2019 16:43, Justin Pryzby wrote:

It wrote 40GB tempfiles - perhaps you can increase work_mem now to improve the
query time.

I now upped my shared_buffers back from 1 to 2GB and work_mem from 4 to 16MB. Need to set vm.overcommit_ratio from 50 to 75 (percent, with vm.overcommit_memory = 2 as it is.)

We didn't address it yet, but your issue was partially caused by a misestimate.
It's almost certainly because these conditions are correlated, or maybe
redundant.

That may be so, but mis-estimates happen. And I can still massively improve this query logically I am sure. In fact it sticks out like a sore thumb, logically it makes no sense to churn over 100 million rows here, but the point is that hopefully PostgreSQL runs stable in such outlier situations, comes back and presents you with 2 hours of work time, 40 GB temp space, or whatever, and then we users can figure out how to make it work better. The frustrating thing it to get out of memory and we not knowing what we can possibly do about it.

From my previous attempt with this tmp_r and tmp_q table, I also know that the Sort/Uniqe step is taking a lot of extra time. I can cut that out too by addressing the causes of the "repeated result" rows. But again, that is all secondary optimizations.

Merge Cond: (((documentinformationsubject.documentinternalid)::text = (documentinformationsubject_1.documentinternalid)::text) AND ((documentinformationsubject.documentid)::text = (documentinformationsubject_1.documentid)::text) AND ((documentinformationsubject.actinternalid)::text = (documentinformationsubject_1.actinternalid)::text))

If they're completely redundant and you can get the same result after dropping
one or two of those conditions, then you should.

I understand. You are saying by reducing the amount of columns in the join condition, somehow you might be able to reduce the size of the hash temporary table?

Alternately, if they're correlated but not redundant, you can use PG10
"dependency" statistics (CREATE STATISTICS) on the correlated columns (and
ANALYZE).

I think documentId and documentInternalId is 1:1 they are both primary / alternate keys. So I could go with only one of them, but since I end up needing both elsewhere inside the query I like to throw them all into the natural join key, so that I don't have to deal with the duplicate result columns.

Now running:

integrator=# set enable_nestloop to off;
SET
integrator=# explain analyze select * from reports.v_BusinessOperation;
WARNING:  ExecHashIncreaseNumBatches: nbatch=8 spaceAllowed=16777216
WARNING:  ExecHashIncreaseNumBatches: nbatch=16 spaceAllowed=16777216
WARNING:  ExecHashIncreaseNumBatches: nbatch=32 spaceAllowed=16777216
WARNING:  ExecHashIncreaseNumBatches: nbatch=64 spaceAllowed=16777216
WARNING:  ExecHashIncreaseNumBatches: nbatch=128 spaceAllowed=16777216
WARNING:  ExecHashIncreaseNumBatches: nbatch=256 spaceAllowed=16777216
WARNING:  ExecHashIncreaseNumBatches: nbatch=512 spaceAllowed=16777216
WARNING:  ExecHashIncreaseNumBatches: nbatch=1024 spaceAllowed=25165824
WARNING:  ExecHashIncreaseNumBatches: nbatch=2048 spaceAllowed=50331648
WARNING:  ExecHashIncreaseNumBatches: nbatch=4096 spaceAllowed=100663296
WARNING:  ExecHashIncreaseNumBatches: nbatch=8192 spaceAllowed=201326592
WARNING:  ExecHashIncreaseNumBatches: nbatch=16384 spaceAllowed=402653184
WARNING:  ExecHashIncreaseNumBatches: nbatch=32768 spaceAllowed=805306368
WARNING:  ExecHashIncreaseNumBatches: nbatch=65536 spaceAllowed=1610612736

I am waiting now, probably for that Sort/Unique to finish I think that the vast majority of the time spent is in this sort

 Unique  (cost=5551524.36..5554207.33 rows=34619 width=1197) (actual time=6150303.060..6895451.210 rows=435274 loops=1)  ->  Sort  (cost=5551524.36..5551610.91 rows=34619 width=1197) (actual time=6150303.058..6801372.192 rows=113478386 loops=1)        Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.is_current, documentinformationsubject.documentid, documentinformationsubject.documenttypecode, documentinformationsubject.subjectroleinternalid, documentinformationsubject.subjectentityinternalid, documentinformationsubject.subjectentityid, documentinformationsubject.subjectentityidroot, documentinformationsubject.subjectentityname, documentinformationsubject.subjectentitytel, documentinformationsubject.subjectentityemail, documentinformationsubject.otherentityinternalid, documentinformationsubject.confidentialitycode, documentinformationsubject.actinternalid, documentinformationsubject.code_code, documentinformationsubject.code_displayname, q.code_code, q.code_displayname, an.extension, an.root, documentinformationsubject_2.subjectentitycode, documentinformationsubject_2.subjectentitycodesystem, documentinformationsubject_2.effectivetime_low, documentinformationsubject_2.effectivetime_high, documentinformationsubject_2.statuscode, documentinformationsubject_2.code_code, agencyid.extension, agencyname.trivialname, documentinformationsubject_1.subjectentitycode, documentinformationsubject_1.subjectentityinternalid        Sort Method: external merge  Disk: 40726720kB        ->  Hash Right Join  (cost=4255031.53..5530808.71 rows=34619 width=1197) (actual time=325240.679..1044194.775 rows=113478386 loops=1)

isn't it?

Unique/Sort actual time 6,150,303.060 ms = 6,150 s <~ 2 h.
Hash Right Join actual time 325,240.679 ms.

So really all time is wasted in that sort, no need for you guys to worry about anything else with these 2 hours. Tomas just stated the same thing.

Right. Chances are that with a bettwe estimate the optimizer would pick
merge join instead. I wonder if that would be significantly faster.

The prospect of a merge join is interesting here to consider: with the Sort/Unique step taking so long, it seems the Merge Join might also take a lot of time? I see my disks are churning for the most time in this way:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle          7.50    0.00    2.50   89.50    0.00    0.50

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme1n1           0.00     0.00  253.00  131.00    30.15    32.20   332.50     2.01    8.40    8.41    8.37   2.59  99.60
nvme1n1p24        0.00     0.00  253.00  131.00    30.15    32.20   332.50     2.01    8.40    8.41    8.37   2.59  99.60

I.e. 400 IOPS at 60 MB/s half of it read, half of it write. During the previous steps, the hash join presumably, throughput was a lot higher, like 2000 IOPS with 120 MB/s read or write.

But even if the Merge Join would have taken about the same or a little more time than the Hash Join, I wonder, if one could not use that to collapse the Sort/Unique step into that? Like it seems that after the Sort/Merge has completed, one should be able to read Uniqe records without any further sorting? In that case the Merge would be a great advantage.

What I like about the situation now is that with that 4x bigger work_mem, the overall memory situation remains the same. I.e., we are scraping just below 1GB for this process and we see oscillation, growth and shrinkage occurring. So I consider this case closed for me. That doesn't mean I wouldn't be available if you guys want to try anything else about it.

OK, now here is the result with the 16 MB work_mem:

 Unique  (cost=5462874.86..5465557.83 rows=34619 width=1197) (actual time=6283539.282..7003311.451 rows=435274 loops=1)  ->  Sort  (cost=5462874.86..5462961.41 rows=34619 width=1197) (actual time=6283539.280..6908879.456 rows=113478386 loops=1)        Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.is_current, documentinformationsubject.documentid, documentinformationsubject.documenttypecode, documentinformationsubject.subjectroleinternalid, documentinformati
onsubject.subjectentityinternalid, documentinformationsubject.subjectentityid, documentinformationsubject.subjectentityidroot, documentinformationsubject.subjectentityname, documentinformationsubject.subjectentitytel, documentinformationsubject.subjectenti
tyemail, documentinformationsubject.otherentityinternalid, documentinformationsubject.confidentialitycode, documentinformationsubject.actinternalid, documentinformationsubject.code_code, documentinformationsubject.code_displayname, q.code_code, q.code_disp
layname, an.extension, an.root, documentinformationsubject_2.subjectentitycode, documentinformationsubject_2.subjectentitycodesystem, documentinformationsubject_2.effectivetime_low, documentinformationsubject_2.effectivetime_high, documentinformationsubjec
t_2.statuscode, documentinformationsubject_2.code_code, agencyid.extension, agencyname.trivialname, documentinformationsubject_1.subjectentitycode, documentinformationsubject_1.subjectentityinternalid        Sort Method: external merge  Disk: 40726872kB        ->  Hash Right Join  (cost=4168174.03..5442159.21 rows=34619 width=1197) (actual time=337057.290..1695675.896 rows=113478386 loops=1)              Hash Cond: (((q.documentinternalid)::text = (documentinformationsubject.documentinternalid)::text) AND ((r.targetinternalid)::text = (documentinformationsubject.actinternalid)::text))              ->  Hash Right Join  (cost=1339751.37..2608552.36 rows=13 width=341) (actual time=84109.143..84109.238 rows=236 loops=1)                    Hash Cond: (((documentinformationsubject_2.documentinternalid)::text = (q.documentinternalid)::text) AND ((documentinformationsubject_2.actinternalid)::text = (q.actinternalid)::text))                    ->  Gather  (cost=29501.54..1298302.52 rows=1 width=219) (actual time=43932.534..43936.888 rows=0 loops=1)                          Workers Planned: 2                          Workers Launched: 2                          ->  Parallel Hash Left Join  (cost=28501.54..1297302.42 rows=1 width=219) (actual time=43925.304..43925.304 rows=0 loops=3)                          ...                    ->  Hash  (cost=1310249.63..1310249.63 rows=13 width=233) (actual time=40176.581..40176.581 rows=236 loops=1)                          Buckets: 1024  Batches: 1  Memory Usage: 70kB                          ->  Hash Right Join  (cost=829388.20..1310249.63 rows=13 width=233) (actual time=35925.031..40176.447 rows=236 loops=1)                                Hash Cond: ((an.actinternalid)::text = (q.actinternalid)::text)                                ->  Seq Scan on act_id an  (cost=0.00..425941.04 rows=14645404 width=134) (actual time=1.609..7687.986 rows=14676871 loops=1)                                ->  Hash  (cost=829388.19..829388.19 rows=1 width=136) (actual time=30106.123..30106.123 rows=236 loops=1)                                      Buckets: 1024  Batches: 1  Memory Usage: 63kB                                      ->  Gather  (cost=381928.46..829388.19 rows=1 width=136) (actual time=24786.510..30105.983 rows=236 loops=1)                                      ...              ->  Hash  (cost=2823846.37..2823846.37 rows=34619 width=930) (actual time=252946.367..252946.367 rows=113478127 loops=1)                    Buckets: 32768 (originally 32768)  Batches: 65536 (originally 4)  Memory Usage: 1204250kB                    ->  Gather Merge  (cost=2807073.90..2823846.37 rows=34619 width=930) (actual time=83891.069..153380.040 rows=113478127 loops=1)                          Workers Planned: 2                          Workers Launched: 2                          ->  Merge Left Join  (cost=2806073.87..2818850.46 rows=14425 width=930) (actual time=83861.921..108022.671 rows=37826042 loops=3)                                Merge Cond: (((documentinformationsubject.documentinternalid)::text = (documentinformationsubject_1.documentinternalid)::text) AND ((documentinformationsubject.documentid)::text = (documentinformationsubject_1.documentid)::
text) AND ((documentinformationsubject.actinternalid)::text = (documentinformationsubject_1.actinternalid)::text))                                ->  Sort  (cost=1295969.26..1296005.32 rows=14425 width=882) (actual time=44814.114..45535.398 rows=231207 loops=3)                                      Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.docum...                          Workers Planned: 2                          Workers Launched: 2                          ->  Merge Left Join  (cost=2806073.87..2818850.46 rows=14425 width=930) (actual time=83861.921..108022.671 rows=37826042 loops=3)                                Merge Cond: (((documentinformationsubject.documentinternalid)::text = (documentinformationsubject_1.documentinternalid)::text) AND ((documentinformationsubject.documentid)::text = (documentinformationsubject_1.documentid)::
text) AND ((documentinformationsubject.actinternalid)::text = (documentinformationsubject_1.actinternalid)::text))                                ->  Sort  (cost=1295969.26..1296005.32 rows=14425 width=882) (actual time=44814.114..45535.398rows=231207 loops=3)                                ...Planning Time: 2.953 msExecution Time: 7004340.091 ms
(70 rows)

There isn't really any big news here. But what matters is that it works.

thanks & regards,
-Gunther Schadow

Re: Out of Memory errors are frustrating as heck!

From

Tomas Vondra

Date:

24 April 2019, 00:36:33

On Tue, Apr 23, 2019 at 07:09:00PM -0400, Gunther wrote:
>   On 4/23/2019 16:43, Justin Pryzby wrote:
>
> It wrote 40GB tempfiles - perhaps you can increase work_mem now to improve the
> query time.
>
>   I now upped my shared_buffers back from 1 to 2GB and work_mem from 4 to
>   16MB. Need to set vm.overcommit_ratio from 50 to 75 (percent, with
>   vm.overcommit_memory = 2 as it is.)
>
> We didn't address it yet, but your issue was partially caused by a misestimate.
> It's almost certainly because these conditions are correlated, or maybe
> redundant.
>
>   That may be so, but mis-estimates happen. And I can still massively
>   improve this query logically I am sure.  In fact it sticks out like a sore
>   thumb, logically it makes no sense to churn over 100 million rows here,
>   but the point is that hopefully PostgreSQL runs stable in such outlier
>   situations, comes back and presents you with 2 hours of work time, 40 GB
>   temp space, or whatever, and then we users can figure out how to make it
>   work better. The frustrating thing it to get out of memory and we not
>   knowing what we can possibly do about it.
>

Sure. And I think the memory balancing algorithm implemented in the v2
patch is a step in that direction. I think we can do better in terms of
memory consumption (essentially keeping it closer to work_mem) but it's
unlikely to be any faster.

In a way this is similar to underestimates in hash aggregate, except
that in that case we don't have any spill-to-disk fallback at all.

>   From my previous attempt with this tmp_r and tmp_q table, I also know that
>   the Sort/Uniqe step is taking  a lot of extra time. I can cut that out too
>   by addressing the causes of the "repeated result" rows. But again, that is
>   all secondary optimizations.
>
> Merge Cond: (((documentinformationsubject.documentinternalid)::text =
(documentinformationsubject_1.documentinternalid)::text)AND ((documentinformationsubject.documentid)::text =
(documentinformationsubject_1.documentid)::text)AND ((documentinformationsubject.actinternalid)::text =
(documentinformationsubject_1.actinternalid)::text))
>
> If they're completely redundant and you can get the same result after dropping
> one or two of those conditions, then you should.
>
>   I understand. You are saying by reducing the amount of columns in the join
>   condition, somehow you might be able to reduce the size of the hash
>   temporary table?
>

No. When estimating the join result size with multiple join clauses, the
optimizer essentially has to compute 

1:    P((x1 = y1) && (x2 = y2) && (x3 = y3))

so it assumes statistical independence of those conditions and splits
that into

2:    P(x1 = y1) * P(x2 = y2) * P(x3 = y3)

But when those conditions are dependent - for example when (x1=y1) means
that ((x2=y2) && (x3=y3)) - this results into significant underestimate.
E.g. let's assume that each of those conditions matches 1/100 rows, but
that essentially x1=x2=x3 and y1=y2=y3. Then (1) is 1/100 but (2) ends
up being 1/1000000, so 10000x off.

Chances are this is what's happenning with the inner side of the hash
join, which is estimated to return 14425 but ends up returning 37826042.

There's one trick you might try, though - using indexes on composite types:

  create table t1 (a int, b int);
  create table t2 (a int, b int);
  
  
  insert into t1 select mod(i,1000), mod(i,1000)
    from generate_series(1,100000) s(i);
  
  insert into t2 select mod(i,1000), mod(i,1000)
    from generate_series(1,100000) s(i);
  
  analyze t1;
  analyze t2;
  
  explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);
  
                                     QUERY PLAN
  --------------------------------------------------------------------
   Merge Join  (cost=19495.72..21095.56 rows=9999 width=16)
               (actual time=183.043..10360.276 rows=10000000 loops=1)
     Merge Cond: ((t1.a = t2.a) AND (t1.b = t2.b))
     ...
  
  create type composite_id as (a int, b int);
  
  create index on t1 (((a,b)::composite_id));
  create index on t2 (((a,b)::composite_id));
  
  analyze t1;
  analyze t2;
  
  explain analyze select * from t1 join t2
    on ((t1.a,t1.b)::composite_id = (t2.a,t2.b)::composite_id);
                                      QUERY PLAN
  --------------------------------------------------------------------------
   Merge Join  (cost=0.83..161674.40 rows=9999460 width=16)
               (actual time=0.020..12726.767 rows=10000000 loops=1)
     Merge Cond: (ROW(t1.a, t1.b)::composite_id = ROW(t2.a, t2.b)::composite_id)

Obviously, that's not exactly free - you have to pay price for the index
creation, maintenance and storage.

> ...
>
>   Unique/Sort actual time   6,150,303.060 ms = 6,150 s <~ 2 h.
>   Hash Right Join actual time 325,240.679 ms.
>
>   So really all time is wasted in that sort, no need for you guys to worry
>   about anything else with these 2 hours.  Tomas just stated the same thing.
>
>     Right. Chances are that with a bettwe estimate the optimizer would pick
>     merge join instead. I wonder if that would be significantly faster.
>
>   The prospect of a merge join is interesting here to consider: with the
>   Sort/Unique step taking so long, it seems the Merge Join might also take a
>   lot of time? I see my disks are churning for the most time in this way:
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            7.50    0.00    2.50   89.50    0.00    0.50
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm
%util
> nvme1n1           0.00     0.00  253.00  131.00    30.15    32.20   332.50     2.01    8.40    8.41    8.37   2.59
99.60
> nvme1n1p24        0.00     0.00  253.00  131.00    30.15    32.20   332.50     2.01    8.40    8.41    8.37   2.59
99.60
>
>   I.e. 400 IOPS at 60 MB/s half of it read, half of it write. During the
>   previous steps, the hash join presumably, throughput was a lot higher,
>   like 2000 IOPS with 120 MB/s read or write.
>
>   But even if the Merge Join would have taken about the same or a little
>   more time than the Hash Join, I wonder, if one could not use that to
>   collapse the Sort/Unique step into that? Like it seems that after the
>   Sort/Merge has completed, one should be able to read Uniqe records without
>   any further sorting? In that case the Merge would be a great advantage.
>

Probably not, because there are far more columns in the Unique step. We
might have done something with "incremental sort" but we don't have that
capability yet.

>   What I like about the situation now is that with that 4x bigger work_mem,
>   the overall memory situation remains the same. I.e., we are scraping just
>   below 1GB for this process and we see oscillation, growth and shrinkage
>   occurring. So I consider this case closed for me. That doesn't mean I
>   wouldn't be available if you guys want to try anything else about it.
>
>   OK, now here is the result with the 16 MB work_mem:
>
>  Unique  (cost=5462874.86..5465557.83 rows=34619 width=1197) (actual time=6283539.282..7003311.451 rows=435274
loops=1)
>    ->  Sort  (cost=5462874.86..5462961.41 rows=34619 width=1197) (actual time=6283539.280..6908879.456 rows=113478386
loops=1)
>                                  ...
>  Planning Time: 2.953 ms
>  Execution Time: 7004340.091 ms
> (70 rows)
>
>   There isn't really any big news here. But what matters is that it works.
>

Yeah. Once the hash join outgrows the work_mem, the fallback logick
starts ignoring that in the effort to keep the memory usage minimal.

I still think the idea with an "overflow batch" is worth considering,
because it'd allow us to keep the memory usage within work_mem. And
after getting familiar with the hash join code again (haven't messed
with it since 9.5 or so) I think it should not be all that difficult.
I'll give it a try over the weekend if I get bored for a while.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Out of Memory errors are frustrating as heck!

From

Tomas Vondra

Date:

28 April 2019, 14:19:01

On Wed, Apr 24, 2019 at 02:36:33AM +0200, Tomas Vondra wrote:
>
> ...
>
>I still think the idea with an "overflow batch" is worth considering,
>because it'd allow us to keep the memory usage within work_mem. And
>after getting familiar with the hash join code again (haven't messed
>with it since 9.5 or so) I think it should not be all that difficult.
>I'll give it a try over the weekend if I get bored for a while.
>

OK, so I took a stab at this, and overall it seems to be workable. The
patches I have are nowhere near committable, but I think the approach
works fairly well - the memory is kept in check, and the performance is
comparable to the "ballancing" approach tested before.

To explain it a bit, the idea is that we can compute how many BufFile
structures we can keep in memory - we can't use more than work_mem/2 for
that, because then we'd mostly eliminate space for the actual data. For
example with 4MB, we know we can keep 128 batches - we need 128 for
outer and inner side, so 256 in total, and 256*8kB = 2MB.

And then, we just increase the number of batches but instead of adding
the BufFile entries, we split batches into slices that we can keep in
memory (say, the 128 batches). And we keep BufFiles for the current one
and an "overflow file" for the other slices. After processing a slice,
we simply switch to the next one, and use the overflow file as a temp
file for the first batch - we redistribute it into the other batches in
the slice and another overflow file.

That's what the v3 patch (named 'single overflow file') does. I does
work, but unfortunately it significantly inflates the amount of data
written to temporary files. Assume we need e.g. 1024 batches, but only
128 fit into memory. That means we'll need 8 slices, and during the
first pass we'll handle 1/8 of the data and write 7/8 to the overflow
file.  Then after processing the slice and switching to the next one, we
repeat this dance - 1/8 gets processed, 6/8 written to another overflow
file. So essentially we "forward" about

    7/8 + 6/8 + 5/8 + ... + 1/8 = 28/8 = 3.5

of data between slices, and we need to re-shuffle data in each slice,
which amounts to additional 1x data. That's pretty significant overhead,
as will be clear from the measurements I'll present shortly.

But luckily, there's a simple solution to this - instead of writing the
data into a single overflow file, we can create one overflow file for
each slice. That will leave us with the ~1x of additional writes when
distributing data into batches in the current slice, but it eliminates
the main source of write amplification - awalanche-like forwarding of
data between slices.

This relaxes the memory limit a bit again, because we can't really keep
the number of overflow files constrained by work_mem, but we should only
need few of them (much less than when adding one file per batch right
away). For example with 128 in-memory batches, this reduces the amount
of necessary memory 128x.

And this is what v4 (per-slice overflow file) does, pretty much.

Two more comments, regarding memory accounting in previous patches. It
was a bit broken, because we actually need 2x the number of BufFiles. We
needed nbatch files for outer side and nbatch files for inner side, but
we only considered one of those - both when deciding when to increase
the number of batches / increase spaceAllowed, and when reporting the
memory usage. So with large number of batches the reported amount of
used memory was roughly 1/2 of the actual value :-/

The memory accounting was a bit bogus for another reason - spaceUsed
simply tracks the amount of memory for hash table contents. But at the
end we were simply adding the current space for BufFile stuff, ignoring
the fact that that's likely much larger than when the spacePeak value
got stored. For example we might have kept early spaceUsed when it was
almost work_mem, and then added the final large BufFile allocation.

I've fixed both issues in the patches attached to this message. It does
not make a huge difference in practice, but it makes it easier to
compare values between patches.

Now, some test results - I've repeated the simple test with uniform data
set, which is pretty much ideal for hash joins (no unexlectedly large
batches that can't be split, etc.). I've done this with 1M, 5M, 10M, 25M
and 50M rows in the large table (which gets picked for the "hash" side),
and measured how much memory gets used, how many batches, how long it
takes and how much data gets written to temp files.

See the hashjoin-test.sh script for more details.

So, here are the results with work_mem = 4MB (so the number of in-memory
batches for the last two entries is 128). The columns are:

* nbatch - the final number of batches
* memory - memory usage, as reported by explain analyze
* time - duration of the query (without explain analyze) in seconds
* size - size of the large table
* temp - amount of data written to temp files
* amplif - write amplification (temp / size)

  1M rows
  ===================================================================
                  nbatch  memory   time  size (MB)  temp (MB)  amplif
  -------------------------------------------------------------------
  master             256    7681    3.3        730        899    1.23
  rebalance          256    7711    3.3        730        884    1.21
  single file       1024    4161    7.2        730       3168    4.34
  per-slice file    1024    4161    4.7        730       1653    2.26

  5M rows
  ===================================================================
                  nbatch  memory   time  size (MB)  temp (MB)  amplif
  -------------------------------------------------------------------
  master            2048   36353     22       3652       5276    1.44
  rebalance          512   16515     18       3652       4169    1.14
  single file       4096    4353    156       3652      53897   14.76
  per-slice file    4096    4353     28       3652       8106    2.21

  10M rows
  ===================================================================
                  nbatch  memory   time  size (MB)  temp (MB)  amplif
  -------------------------------------------------------------------
  master            4096   69121     61       7303      10556    1.45
  rebalance          512   24326     46       7303       7405    1.01
  single file       8192    4636    762       7303     211234   28.92
  per-slice file    8192    4636     65       7303      16278    2.23

  25M rows
  ===================================================================
                  nbatch  memory   time  size (MB)  temp (MB)  amplif
  -------------------------------------------------------------------
  master            8192  134657    190       7303      24279    1.33
  rebalance         1024   36611    158       7303      20024    1.10
  single file      16384    6011   4054       7303    1046174   57.32
  per-slice file   16384    6011    207       7303      39073    2.14

  50M rows
  ===================================================================
                  nbatch  memory   time  size (MB)  temp (MB)  amplif
  -------------------------------------------------------------------
  master           16384  265729    531      36500      48519   1.33
  rebalance         2048   53241    447      36500      48077   1.32
  single file          -       -      -      36500          -      -
  per-slice file   32768    8125    451      36500      78662   2.16

From those numbers it's pretty clear that per-slice overflow file does
by far the best job in enforcing work_mem and minimizing the amount of
data spilled to temp files. It does write a bit more data than both
master and the simple rebalancing, but that's the cost for enforcing
work_mem more strictly. It's generally a bit slower than those two
approaches, although on the largest scale it's actually a bit faster
than master. I think that's pretty acceptable, considering this is meant
to address extreme underestimates where we currently just eat memory.

The case with single overflow file performs rather poorly - I haven't
even collected data from the largest scale, but considering it spilled
1TB of temp files with a dataset half the size, that's not an issue.
(Note that this does not mean it needs 1TB of temp space, those writes
are spread over time and the files are created/closed as we go. The
system only has ~100GB of free disk space.)

Gunther, could you try the v2 and v4 patches on your data set? That
would be an interesting data point, I think.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Out of Memory errors are frustrating as heck!

From

Gunther

Date:

23 August 2019, 13:17:38

Hi all, I am connecting to a discussion back from April this year. My data has grown and now I am running into new out of memory situations. Meanwhile the world turned from 11.2 to 11.5 which I just installed only to find the same out of memory error.

Have any of the things discussed and proposed, especially this last one by Tomas Vondra, been applied to the 11 releases? Should I try these older patches from April?

regards,
-Gunther

For what it is worth, this is what I am getting:

TopMemoryContext: 67424 total in 5 blocks; 7184 free (7 chunks); 60240 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 blocks; 7720 free (1 chunks); 472 used Operator lookup cache: 24576 total in 2 blocks; 10760 free (3 chunks); 13816 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 8388608 total in 11 blocks; 3094872 free (4 chunks); 5293736 used   JoinRelHashTable: 16384 total in 2 blocks; 5576 free (1 chunks); 10808 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 12720 free (8 chunks); 20048 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used   PortalContext: 1024 total in 1 blocks; 624 free (0 chunks); 400 used:     ExecutorState: 202528536 total in 19 blocks; 433464 free (12 chunks); 202095072 used       HashTableContext: 8192 total in 1 blocks; 7656 free (0 chunks); 536 used         HashBatchContext: 10615104 total in 261 blocks; 7936 free (0 chunks); 10607168 used       HashTableContext: 8192 total in 1 blocks; 7688 free (1 chunks); 504 used         HashBatchContext: 13079304 total in 336 blocks; 7936 free (0 chunks); 13071368 used       TupleSort main: 49208 total in 3 blocks; 8552 free (7 chunks); 40656 used         Caller tuples: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       Subplan HashTable Temp Context: 1024 total in 1 blocks; 768 free (0 chunks); 256 used       Subplan HashTable Context: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       Subplan HashTable Temp Context: 1024 total in 1 blocks; 768 free (0 chunks); 256 used       Subplan HashTable Context: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       Subplan HashTable Temp Context: 1024 total in 1 blocks; 768 free (0 chunks); 256 used       Subplan HashTable Context: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       Subplan HashTable Temp Context: 1024 total in 1 blocks; 768 free (0 chunks); 256 used       Subplan HashTable Context: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       Subplan HashTable Temp Context: 1024 total in 1 blocks; 768 free (0 chunks); 256 used       Subplan HashTable Context: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7360 free (0 chunks); 832 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 1107296256 total in 142 blocks; 6328 free (101 chunks); 1107289928 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       1 more child contexts containing 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 2472 free (2 chunks); 13912 used CacheMemoryContext: 1113488 total in 14 blocks; 16776 free (0 chunks); 1096712 used   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: docsubjh_sjrcode_ndx   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: docsubjh_sjrclass_ndx   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: docsubjh_scopeiid_ndx   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: docsubjh_dociid_ndx   index info: 4096 total in 3 blocks; 2064 free (2 chunks); 2032 used: role_telecom_idx   index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: role_addr_fkidx   index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: role_id_fkidx   index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: role_id_idx   index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: role_name_fkidx   index info: 4096 total in 3 blocks; 2064 free (2 chunks); 2032 used: entity_telecom_idx   index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx   index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: entity_id_idx   index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: entity_det_code_idx   index info: 4096 total in 3 blocks; 2016 free (2 chunks); 2080 used: entity_code_nodash_idx   index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_pkey   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: connect_rule_pkey   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: role_context_idx   index info: 2048 total in 2 blocks; 640 free (2 chunks); 1408 used: role_partitions   index info: 2048 total in 2 blocks; 640 free (2 chunks); 1408 used: role_scoper_idx   index info: 2048 total in 2 blocks; 640 free (2 chunks); 1408 used: role_player_idx   index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: role__pkey   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index   index info: 2048 total in 2 blocks; 592 free (1 chunks); 1456 used: pg_constraint_conrelid_contypid_conname_index   index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: participation_act_idx   index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: participation_role_idx   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: participation_pkey   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_statistic_ext_relid_index   index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: doc_ndx_internaiddoctype   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2618_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_index_indrelid_index   relation rules: 827392 total in 104 blocks; 2400 free (1 chunks); 824992 used: v_documentsubjecthistory   index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: pg_db_role_setting_databaseid_rol_index   index info: 2048 total in 2 blocks; 624 free (2 chunks); 1424 used: pg_opclass_am_name_nsp_index   index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_foreign_data_wrapper_name_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_enum_oid_index   index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_class_relname_nsp_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_server_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_pubname_index   index info: 2048 total in 2 blocks; 592 free (3 chunks); 1456 used: pg_statistic_relid_att_inh_index   index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_cast_source_target_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_language_name_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_transform_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_collation_oid_index   index info: 3072 total in 2 blocks; 1136 free (2 chunks); 1936 used: pg_amop_fam_strat_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_index_indexrelid_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_template_tmplname_index   index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_ts_config_map_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_opclass_oid_index   index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_foreign_data_wrapper_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_event_trigger_evtname_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_statistic_ext_name_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_dict_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_event_trigger_oid_index   index info: 3072 total in 2 blocks; 1216 free (3 chunks); 1856 used: pg_conversion_default_index   index info: 3072 total in 2 blocks; 1136 free (2 chunks); 1936 used: pg_operator_oprname_l_r_n_index   index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_trigger_tgrelid_tgname_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_enum_typid_label_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_config_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_user_mapping_oid_index   index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_opfamily_am_name_nsp_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_table_relid_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_type_oid_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_aggregate_fnoid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_constraint_oid_index   index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_rewrite_rel_rulename_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_parser_prsname_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_config_cfgname_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_parser_oid_index   index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_publication_rel_prrelid_prpubid_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_operator_oid_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_namespace_nspname_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_template_oid_index   index info: 2048 total in 2 blocks; 624 free (2 chunks); 1424 used: pg_amop_opr_fam_index   index info: 2048 total in 2 blocks; 672 free (3 chunks); 1376 used: pg_default_acl_role_nsp_obj_index   index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_collation_name_enc_nsp_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_rel_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_range_rngtypid_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_dict_dictname_index   index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_type_typname_nsp_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_opfamily_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_statistic_ext_oid_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index   index info: 2048 total in 2 blocks; 624 free (2 chunks); 1424 used: pg_proc_proname_args_nsp_index   index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_partitioned_table_partrelid_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_transform_type_lang_index   index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_attribute_relid_attnum_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_proc_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_language_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_namespace_oid_index   index info: 3072 total in 2 blocks; 1136 free (2 chunks); 1936 used: pg_amproc_fam_proc_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_server_name_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_attribute_relid_attnam_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_conversion_oid_index   index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_user_mapping_user_server_index   index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_subscription_rel_srrelid_srsubid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_sequence_seqrelid_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_conversion_name_nsp_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_authid_oid_index   index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_auth_members_member_role_index   10 more child contexts containing 17408 total in 17 blocks; 6080 free (10 chunks); 11328 used WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 6408 free (0 chunks); 1784 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (4 chunks); 256 used
Grand total: 1345345736 bytes in 1209 blocks; 4529600 free (270 chunks); 1340816136 used

On 4/28/2019 10:19, Tomas Vondra wrote:

On Wed, Apr 24, 2019 at 02:36:33AM +0200, Tomas Vondra wrote:

...

I still think the idea with an "overflow batch" is worth considering,
because it'd allow us to keep the memory usage within work_mem. And
after getting familiar with the hash join code again (haven't messed
with it since 9.5 or so) I think it should not be all that difficult.
I'll give it a try over the weekend if I get bored for a while.

OK, so I took a stab at this, and overall it seems to be workable. The
patches I have are nowhere near committable, but I think the approach
works fairly well - the memory is kept in check, and the performance is
comparable to the "ballancing" approach tested before.

To explain it a bit, the idea is that we can compute how many BufFile
structures we can keep in memory - we can't use more than work_mem/2 for
that, because then we'd mostly eliminate space for the actual data. For
example with 4MB, we know we can keep 128 batches - we need 128 for
outer and inner side, so 256 in total, and 256*8kB = 2MB.

And then, we just increase the number of batches but instead of adding
the BufFile entries, we split batches into slices that we can keep in
memory (say, the 128 batches). And we keep BufFiles for the current one
and an "overflow file" for the other slices. After processing a slice,
we simply switch to the next one, and use the overflow file as a temp
file for the first batch - we redistribute it into the other batches in
the slice and another overflow file.

That's what the v3 patch (named 'single overflow file') does. I does
work, but unfortunately it significantly inflates the amount of data
written to temporary files. Assume we need e.g. 1024 batches, but only
128 fit into memory. That means we'll need 8 slices, and during the
first pass we'll handle 1/8 of the data and write 7/8 to the overflow
file. Then after processing the slice and switching to the next one, we
repeat this dance - 1/8 gets processed, 6/8 written to another overflow
file. So essentially we "forward" about

   7/8 + 6/8 + 5/8 + ... + 1/8 = 28/8 = 3.5

of data between slices, and we need to re-shuffle data in each slice,
which amounts to additional 1x data. That's pretty significant overhead,
as will be clear from the measurements I'll present shortly.

But luckily, there's a simple solution to this - instead of writing the
data into a single overflow file, we can create one overflow file for
each slice. That will leave us with the ~1x of additional writes when
distributing data into batches in the current slice, but it eliminates
the main source of write amplification - awalanche-like forwarding of
data between slices.

This relaxes the memory limit a bit again, because we can't really keep
the number of overflow files constrained by work_mem, but we should only
need few of them (much less than when adding one file per batch right
away). For example with 128 in-memory batches, this reduces the amount
of necessary memory 128x.

And this is what v4 (per-slice overflow file) does, pretty much.

Two more comments, regarding memory accounting in previous patches. It
was a bit broken, because we actually need 2x the number of BufFiles. We
needed nbatch files for outer side and nbatch files for inner side, but
we only considered one of those - both when deciding when to increase
the number of batches / increase spaceAllowed, and when reporting the
memory usage. So with large number of batches the reported amount of
used memory was roughly 1/2 of the actual value :-/

The memory accounting was a bit bogus for another reason - spaceUsed
simply tracks the amount of memory for hash table contents. But at the
end we were simply adding the current space for BufFile stuff, ignoring
the fact that that's likely much larger than when the spacePeak value
got stored. For example we might have kept early spaceUsed when it was
almost work_mem, and then added the final large BufFile allocation.

I've fixed both issues in the patches attached to this message. It does
not make a huge difference in practice, but it makes it easier to
compare values between patches.

Now, some test results - I've repeated the simple test with uniform data
set, which is pretty much ideal for hash joins (no unexlectedly large
batches that can't be split, etc.). I've done this with 1M, 5M, 10M, 25M
and 50M rows in the large table (which gets picked for the "hash" side),
and measured how much memory gets used, how many batches, how long it
takes and how much data gets written to temp files.

See the hashjoin-test.sh script for more details.

So, here are the results with work_mem = 4MB (so the number of in-memory
batches for the last two entries is 128). The columns are:

* nbatch - the final number of batches
* memory - memory usage, as reported by explain analyze
* time - duration of the query (without explain analyze) in seconds
* size - size of the large table
* temp - amount of data written to temp files
* amplif - write amplification (temp / size)

1M rows
===================================================================
                 nbatch memory   time size (MB) temp (MB) amplif
-------------------------------------------------------------------
master             256    7681    3.3        730        899    1.23
rebalance          256    7711    3.3        730        884    1.21
single file       1024    4161    7.2        730       3168    4.34
per-slice file    1024    4161    4.7        730       1653    2.26

5M rows
===================================================================
                 nbatch memory   time size (MB) temp (MB) amplif
-------------------------------------------------------------------
master            2048   36353     22       3652       5276    1.44
rebalance          512   16515     18       3652       4169    1.14
single file       4096    4353    156       3652      53897   14.76
per-slice file    4096    4353     28       3652       8106    2.21

10M rows
===================================================================
                 nbatch memory   time size (MB) temp (MB) amplif
-------------------------------------------------------------------
master            4096   69121     61       7303      10556    1.45
rebalance          512   24326     46       7303       7405    1.01
single file       8192    4636    762       7303     211234   28.92
per-slice file    8192    4636     65       7303      16278    2.23

25M rows
===================================================================
                 nbatch memory   time size (MB) temp (MB) amplif
-------------------------------------------------------------------
master            8192 134657    190       7303      24279    1.33
rebalance         1024   36611    158       7303      20024    1.10
single file      16384    6011   4054       7303    1046174   57.32
per-slice file   16384    6011    207       7303      39073    2.14

50M rows
===================================================================
                 nbatch memory   time size (MB) temp (MB) amplif
-------------------------------------------------------------------
master           16384 265729    531      36500      48519   1.33
rebalance         2048   53241    447      36500      48077   1.32
single file          -       -      -      36500          -      -
per-slice file   32768    8125    451      36500      78662   2.16

From those numbers it's pretty clear that per-slice overflow file does
by far the best job in enforcing work_mem and minimizing the amount of
data spilled to temp files. It does write a bit more data than both
master and the simple rebalancing, but that's the cost for enforcing
work_mem more strictly. It's generally a bit slower than those two
approaches, although on the largest scale it's actually a bit faster
than master. I think that's pretty acceptable, considering this is meant
to address extreme underestimates where we currently just eat memory.

The case with single overflow file performs rather poorly - I haven't
even collected data from the largest scale, but considering it spilled
1TB of temp files with a dataset half the size, that's not an issue.
(Note that this does not mean it needs 1TB of temp space, those writes
are spread over time and the files are created/closed as we go. The
system only has ~100GB of free disk space.)

Gunther, could you try the v2 and v4 patches on your data set? That
would be an interesting data point, I think.

regards

Re: Out of Memory errors are frustrating as heck!

From

Gunther

Date:

23 August 2019, 14:19:51

OK, I went back through that old thread, and I noticed an early opinion by a certain Peter <pmc at citylink> who said that I should provision some swap space. Since I had plenty of disk and no other option I tried that. And it did some magic. Here this is a steady state now:

top - 14:07:32 up 103 days,  9:57,  5 users,  load average: 1.33, 1.05, 0.54
Tasks: 329 total,   2 running, 117 sleeping,   0 stopped,   0 zombie
%Cpu(s): 31.0 us, 11.4 sy,  0.0 ni, 35.3 id, 22.3 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  7910376 total,   120524 free,  2174940 used,  5614912 buff/cache
KiB Swap: 16777212 total, 16777212 free,        0 used.  3239724 avail Mem
 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND5068 postgres  20   0 4352496   4.0g   2.0g R  76.4 52.6   3:01.39 postgres: postgres integrator [local] INSERT 435 root      20   0       0      0      0 S   4.0  0.0  10:52.38 [kswapd0]

and the nice thing is, the backend server process appears to be bounded at 4GB, so there isn't really a "memory leak". And also, the swap space isn't really being used. This may have to do with these vm. sysctl settings, overcommit, etc.

vm.overcommit_memory = 2 -- values are
- 0 -- estimate free memory
- 1 -- always assume there is enough memory
- 2 -- no over-commit allocate only inside the following two parameters
vm.overcommit_kbytes = 0 -- how many kB above swap can be over-committed, EITHER this OR
vm.overcommit_ratio = 50 -- percent of main memory that can be committed over swap,
- with 0 swap, that percent can be committed
- i.e., this of 8 GB, 4 GB are reserved for buffer cache
- not a good idea probably
- at least we should allow 75% committed, i.e., 6 GB of 8 GB, leaving
  - 2 GB of buffer cache
  - 2 GB of shared buffers
  - 4 GB of all other memory

I have vm.overcommit_memory = 2, _kbytes = 0, _ratio = 50. So this means with _ratio = 50 I can commit 50% of memory, 4GB and this is exactly what the server process wants. So with little impact on the available buffer cache I am in a fairly good position now. The swap (that in my case I set at 2 x main memory = 16G) serves as a buffer to smooth out this peak usage without ever actually paging.

I suppose even without swap I could have set vm.overcommit_ratio = 75, and I notice now that I already commented this much (the above bullet points are my own notes.)

Anyway, for now, I am good. Thank you very much.

regards,
-Gunther

On 8/23/2019 9:17, Gunther wrote:

Hi all, I am connecting to a discussion back from April this year. My data has grown and now I am running into new out of memory situations. Meanwhile the world turned from 11.2 to 11.5 which I just installed only to find the same out of memory error.
Have any of the things discussed and proposed, especially this last one by Tomas Vondra, been applied to the 11 releases? Should I try these older patches from April?
regards,
-Gunther
For what it is worth, this is what I am getting:
TopMemoryContext: 67424 total in 5 blocks; 7184 free (7 chunks); 60240 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 blocks; 7720 free (1 chunks); 472 used Operator lookup cache: 24576 total in 2 blocks; 10760 free (3 chunks); 13816 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 8388608 total in 11 blocks; 3094872 free (4 chunks); 5293736 used   JoinRelHashTable: 16384 total in 2 blocks; 5576 free (1 chunks); 10808 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 12720 free (8 chunks); 20048 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used   PortalContext: 1024 total in 1 blocks; 624 free (0 chunks); 400 used:     ExecutorState: 202528536 total in 19 blocks; 433464 free (12 chunks); 202095072 used       HashTableContext: 8192 total in 1 blocks; 7656 free (0 chunks); 536 used         HashBatchContext: 10615104 total in 261 blocks; 7936 free (0 chunks); 10607168 used       HashTableContext: 8192 total in 1 blocks; 7688 free (1 chunks); 504 used         HashBatchContext: 13079304 total in 336 blocks; 7936 free (0 chunks); 13071368 used       TupleSort main: 49208 total in 3 blocks; 8552 free (7 chunks); 40656 used         Caller tuples: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       Subplan HashTable Temp Context: 1024 total in 1 blocks; 768 free (0 chunks); 256 used       Subplan HashTable Context: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       Subplan HashTable Temp Context: 1024 total in 1 blocks; 768 free (0 chunks); 256 used       Subplan HashTable Context: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       Subplan HashTable Temp Context: 1024 total in 1 blocks; 768 free (0 chunks); 256 used       Subplan HashTable Context: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       Subplan HashTable Temp Context: 1024 total in 1 blocks; 768 free (0 chunks); 256 used       Subplan HashTable Context: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       Subplan HashTable Temp Context: 1024 total in 1 blocks; 768 free (0 chunks); 256 used       Subplan HashTable Context: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7360 free (0 chunks); 832 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       ExprContext: 1107296256 total in 142 blocks; 6328 free (101 chunks); 1107289928 used       ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used       1 more child contexts containing 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 2472 free (2 chunks); 13912 used CacheMemoryContext: 1113488 total in 14 blocks; 16776 free (0 chunks); 1096712 used   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: docsubjh_sjrcode_ndx   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: docsubjh_sjrclass_ndx   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: docsubjh_scopeiid_ndx   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: docsubjh_dociid_ndx   index info: 4096 total in 3 blocks; 2064 free (2 chunks); 2032 used: role_telecom_idx   index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: role_addr_fkidx   index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: role_id_fkidx   index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: role_id_idx   index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: role_name_fkidx   index info: 4096 total in 3 blocks; 2064 free (2 chunks); 2032 used: entity_telecom_idx   index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx   index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: entity_id_idx   index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: entity_det_code_idx   index info: 4096 total in 3 blocks; 2016 free (2 chunks); 2080 used: entity_code_nodash_idx   index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_pkey   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: connect_rule_pkey   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: role_context_idx   index info: 2048 total in 2 blocks; 640 free (2 chunks); 1408 used: role_partitions   index info: 2048 total in 2 blocks; 640 free (2 chunks); 1408 used: role_scoper_idx   index info: 2048 total in 2 blocks; 640 free (2 chunks); 1408 used: role_player_idx   index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: role__pkey   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index   index info: 2048 total in 2 blocks; 592 free (1 chunks); 1456 used: pg_constraint_conrelid_contypid_conname_index   index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: participation_act_idx   index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: participation_role_idx   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: participation_pkey   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_statistic_ext_relid_index   index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: doc_ndx_internaiddoctype   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2618_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_index_indrelid_index   relation rules: 827392 total in 104 blocks; 2400 free (1 chunks); 824992 used: v_documentsubjecthistory   index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: pg_db_role_setting_databaseid_rol_index   index info: 2048 total in 2 blocks; 624 free (2 chunks); 1424 used: pg_opclass_am_name_nsp_index   index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_foreign_data_wrapper_name_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_enum_oid_index   index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_class_relname_nsp_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_server_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_pubname_index   index info: 2048 total in 2 blocks; 592 free (3 chunks); 1456 used: pg_statistic_relid_att_inh_index   index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_cast_source_target_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_language_name_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_transform_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_collation_oid_index   index info: 3072 total in 2 blocks; 1136 free (2 chunks); 1936 used: pg_amop_fam_strat_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_index_indexrelid_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_template_tmplname_index   index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_ts_config_map_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_opclass_oid_index   index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_foreign_data_wrapper_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_event_trigger_evtname_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_statistic_ext_name_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_dict_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_event_trigger_oid_index   index info: 3072 total in 2 blocks; 1216 free (3 chunks); 1856 used: pg_conversion_default_index   index info: 3072 total in 2 blocks; 1136 free (2 chunks); 1936 used: pg_operator_oprname_l_r_n_index   index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_trigger_tgrelid_tgname_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_enum_typid_label_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_config_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_user_mapping_oid_index   index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_opfamily_am_name_nsp_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_table_relid_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_type_oid_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_aggregate_fnoid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_constraint_oid_index   index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_rewrite_rel_rulename_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_parser_prsname_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_config_cfgname_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_parser_oid_index   index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_publication_rel_prrelid_prpubid_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_operator_oid_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_namespace_nspname_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_template_oid_index   index info: 2048 total in 2 blocks; 624 free (2 chunks); 1424 used: pg_amop_opr_fam_index   index info: 2048 total in 2 blocks; 672 free (3 chunks); 1376 used: pg_default_acl_role_nsp_obj_index   index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_collation_name_enc_nsp_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_rel_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_range_rngtypid_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_dict_dictname_index   index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_type_typname_nsp_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_opfamily_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_statistic_ext_oid_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index   index info: 2048 total in 2 blocks; 624 free (2 chunks); 1424 used: pg_proc_proname_args_nsp_index   index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_partitioned_table_partrelid_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_transform_type_lang_index   index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_attribute_relid_attnum_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_proc_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_language_oid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_namespace_oid_index   index info: 3072 total in 2 blocks; 1136 free (2 chunks); 1936 used: pg_amproc_fam_proc_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_server_name_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_attribute_relid_attnam_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_conversion_oid_index   index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_user_mapping_user_server_index   index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_subscription_rel_srrelid_srsubid_index   index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_sequence_seqrelid_index   index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_conversion_name_nsp_index   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_authid_oid_index   index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_auth_members_member_role_index   10 more child contexts containing 17408 total in 17 blocks; 6080 free (10 chunks); 11328 used WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 6408 free (0 chunks); 1784 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (4 chunks); 256 used
Grand total: 1345345736 bytes in 1209 blocks; 4529600 free (270 chunks); 1340816136 used
On 4/28/2019 10:19, Tomas Vondra wrote:
On Wed, Apr 24, 2019 at 02:36:33AM +0200, Tomas Vondra wrote:

...

I still think the idea with an "overflow batch" is worth considering,
because it'd allow us to keep the memory usage within work_mem. And
after getting familiar with the hash join code again (haven't messed
with it since 9.5 or so) I think it should not be all that difficult.
I'll give it a try over the weekend if I get bored for a while.

OK, so I took a stab at this, and overall it seems to be workable. The
patches I have are nowhere near committable, but I think the approach
works fairly well - the memory is kept in check, and the performance is
comparable to the "ballancing" approach tested before.

To explain it a bit, the idea is that we can compute how many BufFile
structures we can keep in memory - we can't use more than work_mem/2 for
that, because then we'd mostly eliminate space for the actual data. For
example with 4MB, we know we can keep 128 batches - we need 128 for
outer and inner side, so 256 in total, and 256*8kB = 2MB.

And then, we just increase the number of batches but instead of adding
the BufFile entries, we split batches into slices that we can keep in
memory (say, the 128 batches). And we keep BufFiles for the current one
and an "overflow file" for the other slices. After processing a slice,
we simply switch to the next one, and use the overflow file as a temp
file for the first batch - we redistribute it into the other batches in
the slice and another overflow file.

That's what the v3 patch (named 'single overflow file') does. I does
work, but unfortunately it significantly inflates the amount of data
written to temporary files. Assume we need e.g. 1024 batches, but only
128 fit into memory. That means we'll need 8 slices, and during the
first pass we'll handle 1/8 of the data and write 7/8 to the overflow
file. Then after processing the slice and switching to the next one, we
repeat this dance - 1/8 gets processed, 6/8 written to another overflow
file. So essentially we "forward" about

   7/8 + 6/8 + 5/8 + ... + 1/8 = 28/8 = 3.5

of data between slices, and we need to re-shuffle data in each slice,
which amounts to additional 1x data. That's pretty significant overhead,
as will be clear from the measurements I'll present shortly.

But luckily, there's a simple solution to this - instead of writing the
data into a single overflow file, we can create one overflow file for
each slice. That will leave us with the ~1x of additional writes when
distributing data into batches in the current slice, but it eliminates
the main source of write amplification - awalanche-like forwarding of
data between slices.

This relaxes the memory limit a bit again, because we can't really keep
the number of overflow files constrained by work_mem, but we should only
need few of them (much less than when adding one file per batch right
away). For example with 128 in-memory batches, this reduces the amount
of necessary memory 128x.

And this is what v4 (per-slice overflow file) does, pretty much.

Two more comments, regarding memory accounting in previous patches. It
was a bit broken, because we actually need 2x the number of BufFiles. We
needed nbatch files for outer side and nbatch files for inner side, but
we only considered one of those - both when deciding when to increase
the number of batches / increase spaceAllowed, and when reporting the
memory usage. So with large number of batches the reported amount of
used memory was roughly 1/2 of the actual value :-/

The memory accounting was a bit bogus for another reason - spaceUsed
simply tracks the amount of memory for hash table contents. But at the
end we were simply adding the current space for BufFile stuff, ignoring
the fact that that's likely much larger than when the spacePeak value
got stored. For example we might have kept early spaceUsed when it was
almost work_mem, and then added the final large BufFile allocation.

I've fixed both issues in the patches attached to this message. It does
not make a huge difference in practice, but it makes it easier to
compare values between patches.

Now, some test results - I've repeated the simple test with uniform data
set, which is pretty much ideal for hash joins (no unexlectedly large
batches that can't be split, etc.). I've done this with 1M, 5M, 10M, 25M
and 50M rows in the large table (which gets picked for the "hash" side),
and measured how much memory gets used, how many batches, how long it
takes and how much data gets written to temp files.

See the hashjoin-test.sh script for more details.

So, here are the results with work_mem = 4MB (so the number of in-memory
batches for the last two entries is 128). The columns are:

* nbatch - the final number of batches
* memory - memory usage, as reported by explain analyze
* time - duration of the query (without explain analyze) in seconds
* size - size of the large table
* temp - amount of data written to temp files
* amplif - write amplification (temp / size)

1M rows
===================================================================
                 nbatch memory   time size (MB) temp (MB) amplif
-------------------------------------------------------------------
master             256    7681    3.3        730        899    1.23
rebalance          256    7711    3.3        730        884    1.21
single file       1024    4161    7.2        730       3168    4.34
per-slice file    1024    4161    4.7        730       1653    2.26

5M rows
===================================================================
                 nbatch memory   time size (MB) temp (MB) amplif
-------------------------------------------------------------------
master            2048   36353     22       3652       5276    1.44
rebalance          512   16515     18       3652       4169    1.14
single file       4096    4353    156       3652      53897   14.76
per-slice file    4096    4353     28       3652       8106    2.21

10M rows
===================================================================
                 nbatch memory   time size (MB) temp (MB) amplif
-------------------------------------------------------------------
master            4096   69121     61       7303      10556    1.45
rebalance          512   24326     46       7303       7405    1.01
single file       8192    4636    762       7303     211234   28.92
per-slice file    8192    4636     65       7303      16278    2.23

25M rows
===================================================================
                 nbatch memory   time size (MB) temp (MB) amplif
-------------------------------------------------------------------
master            8192 134657    190       7303      24279    1.33
rebalance         1024   36611    158       7303      20024    1.10
single file      16384    6011   4054       7303    1046174   57.32
per-slice file   16384    6011    207       7303      39073    2.14

50M rows
===================================================================
                 nbatch memory   time size (MB) temp (MB) amplif
-------------------------------------------------------------------
master           16384 265729    531      36500      48519   1.33
rebalance         2048   53241    447      36500      48077   1.32
single file          -       -      -      36500          -      -
per-slice file   32768    8125    451      36500      78662   2.16

From those numbers it's pretty clear that per-slice overflow file does
by far the best job in enforcing work_mem and minimizing the amount of
data spilled to temp files. It does write a bit more data than both
master and the simple rebalancing, but that's the cost for enforcing
work_mem more strictly. It's generally a bit slower than those two
approaches, although on the largest scale it's actually a bit faster
than master. I think that's pretty acceptable, considering this is meant
to address extreme underestimates where we currently just eat memory.

The case with single overflow file performs rather poorly - I haven't
even collected data from the largest scale, but considering it spilled
1TB of temp files with a dataset half the size, that's not an issue.
(Note that this does not mean it needs 1TB of temp space, those writes
are spread over time and the files are created/closed as we go. The
system only has ~100GB of free disk space.)

Gunther, could you try the v2 and v4 patches on your data set? That
would be an interesting data point, I think.

regards

Re: Out of Memory errors are frustrating as heck!

From

Tom Lane

Date:

23 August 2019, 14:20:31

Gunther <raj@gusw.net> writes:
> Hi all, I am connecting to a discussion back from April this year. My 
> data has grown and now I am running into new out of memory situations. 

It doesn't look like this has much of anything to do with the hash-table
discussion.  The big hog is an ExprContext:

> ExprContext: 1107296256 total in 142 blocks; 6328 free (101 chunks); 
> 1107289928 used

So there's something leaking in there, but this isn't enough info
to guess what.

            regards, tom lane

Re: Out of Memory errors are frustrating as heck!

From

Gunther

Date:

24 August 2019, 15:40:09

Thanks Tom, yes I'd say it's using a lot of memory, but wouldn't call it 
"leak" as it doesn't grow during the 30 min or so that this query runs. 
It explodes to 4GB and then stays flat until done.

Yes, and this time the query is super complicated with many joins and 
tables involved. The query plan has 100 lines. Not easy to share for 
reproduce and I have my issue under control by adding some swap just in 
case. The swap space was never actually used.

thanks,
-Gunther

On 8/23/2019 10:20, Tom Lane wrote:
> Gunther <raj@gusw.net> writes:
>> Hi all, I am connecting to a discussion back from April this year. My
>> data has grown and now I am running into new out of memory situations.
> It doesn't look like this has much of anything to do with the hash-table
> discussion.  The big hog is an ExprContext:
>
>> ExprContext: 1107296256 total in 142 blocks; 6328 free (101 chunks);
>> 1107289928 used
> So there's something leaking in there, but this isn't enough info
> to guess what.
>
>             regards, tom lane

Re: Out of Memory errors are frustrating as heck!

From

Tomas Vondra

Date:

06 October 2019, 21:06:18

On Sat, Aug 24, 2019 at 11:40:09AM -0400, Gunther wrote:
>Thanks Tom, yes I'd say it's using a lot of memory, but wouldn't call 
>it "leak" as it doesn't grow during the 30 min or so that this query 
>runs. It explodes to 4GB and then stays flat until done.
>

Well, the memory context stats you've shared however show this:

total: 1345345736 bytes in 1209 blocks; 4529600 free (270 chunks); 1340816136 used

That's only ~1.3GB, and ~1.1GB of that is the expression context. So
when you say 4GB, when does that happen and can you share stats showing
state at that point?

>Yes, and this time the query is super complicated with many joins and 
>tables involved. The query plan has 100 lines. Not easy to share for 
>reproduce and I have my issue under control by adding some swap just 
>in case. The swap space was never actually used.
>

Still, without the query plan we can hardly do any guesses about what
might be the issue.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Out of Memory errors are frustrating as heck!

From

Tomas Vondra

Date:

06 October 2019, 21:11:31

On Fri, Aug 23, 2019 at 09:17:38AM -0400, Gunther wrote:
>Hi all, I am connecting to a discussion back from April this year. My 
>data has grown and now I am running into new out of memory situations. 
>Meanwhile the world turned from 11.2 to 11.5 which I just installed 
>only to find the same out of memory error.
>

As Tom already said, this seems like a quite independent issue. Next
time it'd be better to share it in a new thread, not to mix it up with
the old discussion.

>Have any of the things discussed and proposed, especially this last 
>one by Tomas Vondra, been applied to the 11 releases? Should I try 
>these older patches from April?
>

Unfortunately, no. We're still discussing what would be the right fix
(it's rather tricky and the patches I shared were way too experimental
for that). But I'm pretty sure whatever we end up doing it's going to be
way too invasive for backpatch. I.e. the older branches will likely have
this issue until EOL.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Out of Memory errors are frustrating as heck!

From

Merlin Moncure

Date:

08 October 2019, 17:44:39

On Sun, Apr 14, 2019 at 3:51 PM Gunther <raj@gusw.net> wrote:
>
> For weeks now, I am banging my head at an "out of memory" situation. There is only one query I am running on an 8 GB
system,whatever I try, I get knocked out on this out of memory. It is extremely impenetrable to understand and fix this
error.I guess I could add a swap file, and then I would have to take the penalty of swapping. But how can I actually
addressan out of memory condition if the system doesn't tell me where it is happening? 
> We can't really see anything too worrisome. There is always lots of memory used by cache, which could have been
mobilized.The only possible explanation I can think of is that in that moment of the crash the memory utilization
suddenlyskyrocketed in less than a second, so that the 2 second vmstat interval wouldn't show it??? Nah. 
>
> I have already much reduced work_mem, which has helped in some other cases before. Now I am going to reduce the
shared_buffersnow, but that seems counter-intuitive because we are sitting on all that cache memory unused! 
>
> Might this be a bug? It feels like a bug. It feels like those out of memory issues should be handled more gracefully
(garbagecollection attempt?) and that somehow there should be more information so the person can do anything about it. 

I kind of agree that nothing according to vmstat suggests you have a
problem.  One thing you left out is the precise mechanics of the
failure; is the database getting nuked by the oom killer?  Do you have
the logs?

*) what are values of shared_buffers and work_mem and maintenance_work_mem?

*) Is this a 32 bit build? (I'm guessing no, but worth asking)

*) I see that you've disabled swap.  Maybe it should be enabled?

*) Can you get the query to run through?  an 'explain analyze' might
point to gross misses in plan; say, sort memory overuse

*) If you're still getting failures, maybe we need to look at sampling
frequency of memory usage.

*) iowait is super high.

*) I see optimization potential in this query; explain analyze would
help here too.

merlin

Re: Out of Memory errors are frustrating as heck!

From

Merlin Moncure

Date:

08 October 2019, 17:50:32

On Tue, Oct 8, 2019 at 12:44 PM Merlin Moncure <mmoncure@gmail.com> wrote:
> On Sun, Apr 14, 2019 at 3:51 PM Gunther <raj@gusw.net> wrote:
> >
> > For weeks now, I am banging my head at an "out of memory" situation. There is only one query I am running on an 8
GBsystem, whatever I try, I get knocked out on this out of memory. It is extremely impenetrable to understand and fix
thiserror. I guess I could add a swap file, and then I would have to take the penalty of swapping. But how can I
actuallyaddress an out of memory condition if the system doesn't tell me where it is happening? 
> > We can't really see anything too worrisome. There is always lots of memory used by cache, which could have been
mobilized.The only possible explanation I can think of is that in that moment of the crash the memory utilization
suddenlyskyrocketed in less than a second, so that the 2 second vmstat interval wouldn't show it??? Nah. 
> >
> > I have already much reduced work_mem, which has helped in some other cases before. Now I am going to reduce the
shared_buffersnow, but that seems counter-intuitive because we are sitting on all that cache memory unused! 
> >
> > Might this be a bug? It feels like a bug. It feels like those out of memory issues should be handled more
gracefully(garbage collection attempt?) and that somehow there should be more information so the person can do anything
aboutit. 
>
> I kind of agree that nothing according to vmstat suggests you have a
> problem.  One thing you left out is the precise mechanics of the
> failure; is the database getting nuked by the oom killer?  Do you have
> the logs?

oops, I missed quite a bit of context upthread.  sorry for repeat noise.

merlin