Out of Memory errors are frustrating as heck! - Mailing list pgsql-performance

For weeks now, I am banging my head at an "out of memory" situation. There is only one query I am running on an 8 GB system, whatever I try, I get knocked out on this out of memory. It is extremely impenetrable to understand and fix this error. I guess I could add a swap file, and then I would have to take the penalty of swapping. But how can I actually address an out of memory condition if the system doesn't tell me where it is happening?

You might want to see the query, but it is a huge plan, and I can't really break this down. It shouldn't matter though. But just so you can get a glimpse here is the plan:

Insert on businessoperation  (cost=5358849.28..5361878.44 rows=34619 width=1197) ->  Unique  (cost=5358849.28..5361532.25 rows=34619 width=1197)       ->  Sort  (cost=5358849.28..5358935.83 rows=34619 width=1197)             Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.is_current, documentinformationsubject.documentid, documentinformationsubject.documenttypecode, documentinformationsubject.subjectroleinternalid, documentinformationsubject.subjectentityinternalid, documentinformationsubject.subjectentityid, documentinformationsubject.subjectentityidroot, documentinformationsubject.subjectentityname, documentinformationsubject.subjectentitytel, documentinformationsubject.subjectentityemail, documentinformationsubject.otherentityinternalid, documentinformationsubject.confidentialitycode, documentinformationsubject.actinternalid, documentinformationsubject.code_code, documentinformationsubject.code_displayname, q.code_code, q.code_displayname, an.extension, an.root, documentinformationsubject_2.subjectentitycode, documentinformationsubject_2.subjectentitycodesystem, documentinformationsubject_2.effectivetime_low, documentinformationsubject_2.effectivetime_high, documentinformationsubject_2.statuscode, documentinformationsubject_2.code_code, agencyid.extension, agencyname.trivialname, documentinformationsubject_1.subjectentitycode, documentinformationsubject_1.subjectentityinternalid             ->  Nested Loop Left Join  (cost=2998335.54..5338133.63 rows=34619 width=1197)                   Join Filter: (((documentinformationsubject.documentinternalid)::text = (q.documentinternalid)::text) AND ((documentinformationsubject.actinternalid)::text = (r.targetinternalid)::text))                   ->  Merge Left Join  (cost=2998334.98..3011313.54 rows=34619 width=930)                         Merge Cond: (((documentinformationsubject.documentinternalid)::text = (documentinformationsubject_1.documentinternalid)::text) AND ((documentinformationsubject.documentid)::text = (documentinformationsubject_1.documentid)::text) AND ((documentinformationsubject.actinternalid)::text = (documentinformationsubject_1.actinternalid)::text))                         ->  Sort  (cost=1408783.87..1408870.41 rows=34619 width=882)                               Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.documentid, documentinformationsubject.actinternalid                               ->  Seq Scan on documentinformationsubject  (cost=0.00..1392681.22 rows=34619 width=882)                                     Filter: (((participationtypecode)::text = ANY ('{PPRF,PRF}'::text[])) AND ((classcode)::text = 'ACT'::text) AND ((moodcode)::text = 'DEF'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text))                         ->  Materialize  (cost=1589551.12..1594604.04 rows=1010585 width=159)                               ->  Sort  (cost=1589551.12..1592077.58 rows=1010585 width=159)                                     Sort Key: documentinformationsubject_1.documentinternalid, documentinformationsubject_1.documentid, documentinformationsubject_1.actinternalid                                     ->  Seq Scan on documentinformationsubject documentinformationsubject_1  (cost=0.00..1329868.64 rows=1010585 width=159)                                           Filter: ((participationtypecode)::text = 'PRD'::text)                   ->  Materialize  (cost=0.56..2318944.31 rows=13 width=341)                         ->  Nested Loop Left Join  (cost=0.56..2318944.24 rows=13 width=341)                               ->  Nested Loop Left Join  (cost=0.00..2318893.27 rows=1 width=281)                                     Join Filter: ((agencyname.entityinternalid)::text = (documentinformationsubject_2.otherentityinternalid)::text)                                     ->  Nested Loop Left Join  (cost=0.00..2286828.33 rows=1 width=291)                                           Join Filter: ((agencyid.entityinternalid)::text = (documentinformationsubject_2.otherentityinternalid)::text)                                           ->  Nested Loop Left Join  (cost=0.00..2284826.24 rows=1 width=239)                                                 Join Filter: (((q.documentinternalid)::text = (documentinformationsubject_2.documentinternalid)::text) AND ((q.actinternalid)::text = (documentinformationsubject_2.actinternalid)::text))                                                 ->  Nested Loop  (cost=0.00..954957.59 rows=1 width=136)                                                       Join Filter: ((q.actinternalid)::text = (r.sourceinternalid)::text)                                                       ->  Seq Scan on actrelationship r  (cost=0.00..456015.26 rows=1 width=74)                                                             Filter: ((typecode)::text = 'SUBJ'::text)                                                       ->  Seq Scan on documentinformation q  (cost=0.00..497440.84 rows=120119 width=99)                                                             Filter: (((classcode)::text = 'CNTRCT'::text) AND ((moodcode)::text = 'EVN'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text))                                                 ->  Seq Scan on documentinformationsubject documentinformationsubject_2  (cost=0.00..1329868.64 rows=1 width=177)                                                       Filter: ((participationtypecode)::text = 'AUT'::text)                                           ->  Seq Scan on entity_id agencyid  (cost=0.00..1574.82 rows=34182 width=89)                                     ->  Seq Scan on bestname agencyname  (cost=0.00..27066.08 rows=399908 width=64)                               ->  Index Scan using act_id_fkidx on act_id an  (cost=0.56..50.85 rows=13 width=134)                                     Index Cond: ((q.actinternalid)::text = (actinternalid)::text)

I have monitored the activity with vmstat and iostat, and it looks like the memory grabbing happens rapidly after a Sort Merge step. I see in the iostat a heavy read and write activity, which I attribute to a sort-merge step, then that is followed by a sudden spike in write activity, and then the out of memory crash.

procs -----------------------memory---------------------- ---swap-- -----io---- -system-- --------cpu-------- -----timestamp-----r  b         swpd         free         buff        cache   si   so    bi    bo   in   cs  us  sy  id  wa  st                 UTC0  2            0       119344            0      7616672    0    0 11681  3107    9    0   6   1  72  21   0 2019-04-14 16:19:520  2            0       128884            0      7607288    0    0  2712 55386  500  509   3   2  15  80   0 2019-04-14 16:19:540  2            0       116984            0      7619916    0    0   880 59241  548  525   2   2   9  87   0 2019-04-14 16:19:560  2            0       131492            0      7604816    0    0   128 56512  518  401   1   1  12  86   0 2019-04-14 16:19:58...0  2            0       134508            0      7601480    0    0     0 58562  428  353   0   1   4  95   0 2019-04-14 16:21:460  2            0       125360            0      7611320    0    0     0 59392  481  369   0   1  11  89   0 2019-04-14 16:21:480  2            0       122896            0      7612872    0    0     0 58564  431  342   0   1  17  82   0 2019-04-14 16:21:501  1            0       121456            0      7614248    0    0    54 57347  487  399   0   1  13  85   0 2019-04-14 16:21:520  2            0       122820            0      7613324    0    0    12 59964  460  346   0   1  20  79   0 2019-04-14 16:21:540  2            0       120344            0      7616528    0    0  1844 55691  645  676   5   3   6  85   0 2019-04-14 16:21:560  2            0       124900            0      7611404    0    0   936 58261  795 1215   2   3  13  83   0 2019-04-14 16:21:580  2            0       124572            0      7612192    0    0  1096 55340  518  487   1   2   0  97   0 2019-04-14 16:22:000  2            0       123040            0      7612740    0    0   888 57574  573  620   1   2   5  92   0 2019-04-14 16:22:020  2            0       125112            0      7610592    0    0   124 59164  498  480   1   1  13  85   0 2019-04-14 16:22:041  1            0       129440            0      7607592    0    0   568 60196  563  612   2   2   8  88   0 2019-04-14 16:22:060  2            0       124020            0      7612364    0    0     0 58260  629  725   0   1   8  91   0 2019-04-14 16:22:082  1            0       124480            0      7611848    0    0     0 58852  447  331   0   1   1  98   0 2019-04-14 16:22:100  3            0       137636            0      7598484    0    0 11908 44995  619  714   1   1  11  87   0 2019-04-14 16:22:120  2            0       123128            0      7613392    0    0 29888 28901  532  972   1   1  29  68   0 2019-04-14 16:22:140  2            0       126260            0      7609984    0    0 39872 18836  706 1435   1   2  28  70   0 2019-04-14 16:22:160  2            0       130748            0      7605536    0    0 36096 22488  658 1272   2   1   8  89   0 2019-04-14 16:22:18
...
 0  2            0       127216            0      7609192    0    0 29192 29696  472  949   1   1  23  75   0 2019-04-14 16:22:400  2            0       147428            0      7588556    0    0 29120 29696  523  974   1   1  19  79   0 2019-04-14 16:22:420  1            0       120644            0      7615388    0    0 32320 25276  566  998   1   2  49  47   0 2019-04-14 16:22:440  1            0       128456            0      7607904    0    0 58624     0  621 1103   3   2  49  46   0 2019-04-14 16:22:460  1            0       127836            0      7608260    0    0 58624     0  631 1119   3   2  50  46   0 2019-04-14 16:22:480  1            0       126712            0      7609616    0    0 58624     0  616 1110   2   2  50  47   0 2019-04-14 16:22:50
...0  1            0       157408            0      7578060    0    0 58628     0  736 1206   3   3  50  44   0 2019-04-14 16:27:220  1            0       142420            0      7593400    0    0 58688     0  623 1099   1   4  50  45   0 2019-04-14 16:27:240  1            0       247016            0      7488184    0    0 58568     0  649 1113   1   4  50  45   0 2019-04-14 16:27:260  1            0       123232            0      7612088    0    0 58412   215  675 1141   2   3  50  46   0 2019-04-14 16:27:280  2            0       144920            0      7586576    0    0 48376 11046  788 1455   1   5  34  60   0 2019-04-14 16:27:301  1            0       125636            0      7595704    0    0 36736 21381  702 1386   1   4  21  74   0 2019-04-14 16:27:320  3            0       156700            0      7559328    0    0 35556 23364  709 1367   1   3  22  74   0 2019-04-14 16:27:340  2            0       315580            0      7382748    0    0 33608 24731  787 1407   1   5  18  76   0 2019-04-14 16:27:36
...0  2            0       684412            0      6152040    0    0 29832 28356  528  994   1   2  32  66   0 2019-04-14 16:38:040  2            0       563512            0      6272264    0    0 29696 29506  546  987   1   2  32  65   0 2019-04-14 16:38:060  2            0       595488            0      6241068    0    0 27292 30858  549  971   1   2  26  71   0 2019-04-14 16:38:080  2            0       550120            0      6285352    0    0 28844 29696  567  995   1   2  29  68   0 2019-04-14 16:38:101  1            0       432380            0      6402964    0    0 28992 29696  557  979   1   2  37  61   0 2019-04-14 16:38:120  2            0       445796            0      6384412    0    0 26768 32134  628 1029   1   4  27  69   0 2019-04-14 16:38:140  2            0       374972            0      6453592    0    0 28172 30839  529  962   1   2  43  54   0 2019-04-14 16:38:160  2            0       317824            0      6507992    0    0 29172 29386  560 1001   1   3  27  68   0 2019-04-14 16:38:180  3            0       215092            0      6609132    0    0 33116 25210  621 1148   1   3  19  77   0 2019-04-14 16:38:200  2            0       194836            0      6621524    0    0 27786 30959  704 1152   0   5  18  77   0 2019-04-14 16:38:220  3            0       315648            0      6500196    0    0 31434 27226  581 1073   0   3  31  65   0 2019-04-14 16:38:24
 0  2            0       256180            0      6554676    0    0 29828 29017  668 1174   0   4  20  76   0 2019-04-14 16:38:26 <<< CRASH0  1            0       378220            0      6552496    0    0  4348 53686 2210 3816   1   5  46  49   0 2019-04-14 16:38:280  1            0       389888            0      6536296    0    0  2704 56529 2454 4178   0   5  42  52   0 2019-04-14 16:38:300  2            0       923572            0      5998992    0    0  1612 56863 2384 3928   0   6  16  78   0 2019-04-14 16:38:320  0            0       908336            0      6006696    0    0  3584 49280 8961 17334   0  19  39  42   0 2019-04-14 16:38:340  1            0      1306480            0      5607088    0    0   264 63632 18605 37933   3  58  35   4   0 2019-04-14 16:38:362  1            0      1355448            0      5558576    0    0     8 59222 14817 30296   2  46  24  27   0 2019-04-14 16:38:382  2            0      1358224            0      5555884    0    0     0 58544 14226 28331   2  44   3  50   0 2019-04-14 16:38:402  1            0      1446348            0      5468748    0    0     0 58846 14376 29185   2  44  11  42   0 2019-04-14 16:38:420  0            0      2639648            0      4357608    0    0     0 28486 12909 26770   2  44  49   5   0 2019-04-14 16:38:440  0            0      2639524            0      4357800    0    0     0     0  158  154   0   0 100   0   0 2019-04-14 16:38:460  0            0      2687316            0      4309976    0    0     0     0  181  188   0   2  98   0   0 2019-04-14 16:38:480  0            0      2706920            0      4300116    0    0     0   105  137  263   0   0 100   0   0 2019-04-14 16:38:500  0            0      2706672            0      4300232    0    0     0     0  142  204   0   0 100   0   0 2019-04-14 16:38:520  0            0      2815116            0      4191928    0    0     0     0  116  242   0   0 100   0   0 2019-04-14 16:38:540  0            0      2815364            0      4192008    0    0     0     0  116  239   0   0 100   0   0 2019-04-14 16:38:560  0            0      2815116            0      4192164    0    0     0     0  159  236   0   0 100   0   0 2019-04-14 16:38:58

ending after the out of memory crash, that occurred exactly at the marked point 16:38:26.355 UTC.

We can't really see anything too worrisome. There is always lots of memory used by cache, which could have been mobilized. The only possible explanation I can think of is that in that moment of the crash the memory utilization suddenly skyrocketed in less than a second, so that the 2 second vmstat interval wouldn't show it??? Nah.

I have already much reduced work_mem, which has helped in some other cases before. Now I am going to reduce the shared_buffers now, but that seems counter-intuitive because we are sitting on all that cache memory unused!

Might this be a bug? It feels like a bug. It feels like those out of memory issues should be handled more gracefully (garbage collection attempt?) and that somehow there should be more information so the person can do anything about it.

Any ideas?

-Gunther

pgsql-performance by date:

Previous
From: Andreas Kretschmer
Date:
Subject: Re: PostgreSQL upgrade.
Next
From: Tom Lane
Date:
Subject: Re: Out of Memory errors are frustrating as heck!