Thread: Out of Memory errors are frustrating as heck!
For weeks now, I am banging my head at an "out of memory" situation. There is only one query I am running on an 8 GB system, whatever I try, I get knocked out on this out of memory. It is extremely impenetrable to understand and fix this error. I guess I could add a swap file, and then I would have to take the penalty of swapping. But how can I actually address an out of memory condition if the system doesn't tell me where it is happening?
You might want to see the query, but it is a huge plan, and I can't really break this down. It shouldn't matter though. But just so you can get a glimpse here is the plan:
Insert on businessoperation (cost=5358849.28..5361878.44 rows=34619 width=1197) -> Unique (cost=5358849.28..5361532.25 rows=34619 width=1197) -> Sort (cost=5358849.28..5358935.83 rows=34619 width=1197) Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.is_current, documentinformationsubject.documentid, documentinformationsubject.documenttypecode, documentinformationsubject.subjectroleinternalid, documentinformationsubject.subjectentityinternalid, documentinformationsubject.subjectentityid, documentinformationsubject.subjectentityidroot, documentinformationsubject.subjectentityname, documentinformationsubject.subjectentitytel, documentinformationsubject.subjectentityemail, documentinformationsubject.otherentityinternalid, documentinformationsubject.confidentialitycode, documentinformationsubject.actinternalid, documentinformationsubject.code_code, documentinformationsubject.code_displayname, q.code_code, q.code_displayname, an.extension, an.root, documentinformationsubject_2.subjectentitycode, documentinformationsubject_2.subjectentitycodesystem, documentinformationsubject_2.effectivetime_low, documentinformationsubject_2.effectivetime_high, documentinformationsubject_2.statuscode, documentinformationsubject_2.code_code, agencyid.extension, agencyname.trivialname, documentinformationsubject_1.subjectentitycode, documentinformationsubject_1.subjectentityinternalid -> Nested Loop Left Join (cost=2998335.54..5338133.63 rows=34619 width=1197) Join Filter: (((documentinformationsubject.documentinternalid)::text = (q.documentinternalid)::text) AND ((documentinformationsubject.actinternalid)::text = (r.targetinternalid)::text)) -> Merge Left Join (cost=2998334.98..3011313.54 rows=34619 width=930) Merge Cond: (((documentinformationsubject.documentinternalid)::text = (documentinformationsubject_1.documentinternalid)::text) AND ((documentinformationsubject.documentid)::text = (documentinformationsubject_1.documentid)::text) AND ((documentinformationsubject.actinternalid)::text = (documentinformationsubject_1.actinternalid)::text)) -> Sort (cost=1408783.87..1408870.41 rows=34619 width=882) Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.documentid, documentinformationsubject.actinternalid -> Seq Scan on documentinformationsubject (cost=0.00..1392681.22 rows=34619 width=882) Filter: (((participationtypecode)::text = ANY ('{PPRF,PRF}'::text[])) AND ((classcode)::text = 'ACT'::text) AND ((moodcode)::text = 'DEF'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text)) -> Materialize (cost=1589551.12..1594604.04 rows=1010585 width=159) -> Sort (cost=1589551.12..1592077.58 rows=1010585 width=159) Sort Key: documentinformationsubject_1.documentinternalid, documentinformationsubject_1.documentid, documentinformationsubject_1.actinternalid -> Seq Scan on documentinformationsubject documentinformationsubject_1 (cost=0.00..1329868.64 rows=1010585 width=159) Filter: ((participationtypecode)::text = 'PRD'::text) -> Materialize (cost=0.56..2318944.31 rows=13 width=341) -> Nested Loop Left Join (cost=0.56..2318944.24 rows=13 width=341) -> Nested Loop Left Join (cost=0.00..2318893.27 rows=1 width=281) Join Filter: ((agencyname.entityinternalid)::text = (documentinformationsubject_2.otherentityinternalid)::text) -> Nested Loop Left Join (cost=0.00..2286828.33 rows=1 width=291) Join Filter: ((agencyid.entityinternalid)::text = (documentinformationsubject_2.otherentityinternalid)::text) -> Nested Loop Left Join (cost=0.00..2284826.24 rows=1 width=239) Join Filter: (((q.documentinternalid)::text = (documentinformationsubject_2.documentinternalid)::text) AND ((q.actinternalid)::text = (documentinformationsubject_2.actinternalid)::text)) -> Nested Loop (cost=0.00..954957.59 rows=1 width=136) Join Filter: ((q.actinternalid)::text = (r.sourceinternalid)::text) -> Seq Scan on actrelationship r (cost=0.00..456015.26 rows=1 width=74) Filter: ((typecode)::text = 'SUBJ'::text) -> Seq Scan on documentinformation q (cost=0.00..497440.84 rows=120119 width=99) Filter: (((classcode)::text = 'CNTRCT'::text) AND ((moodcode)::text = 'EVN'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text)) -> Seq Scan on documentinformationsubject documentinformationsubject_2 (cost=0.00..1329868.64 rows=1 width=177) Filter: ((participationtypecode)::text = 'AUT'::text) -> Seq Scan on entity_id agencyid (cost=0.00..1574.82 rows=34182 width=89) -> Seq Scan on bestname agencyname (cost=0.00..27066.08 rows=399908 width=64) -> Index Scan using act_id_fkidx on act_id an (cost=0.56..50.85 rows=13 width=134) Index Cond: ((q.actinternalid)::text = (actinternalid)::text)
I have monitored the activity with vmstat and iostat, and it looks like the memory grabbing happens rapidly after a Sort Merge step. I see in the iostat a heavy read and write activity, which I attribute to a sort-merge step, then that is followed by a sudden spike in write activity, and then the out of memory crash.
procs -----------------------memory---------------------- ---swap-- -----io---- -system-- --------cpu-------- -----timestamp-----r b swpd free buff cache si so bi bo in cs us sy id wa st UTC0 2 0 119344 0 7616672 0 0 11681 3107 9 0 6 1 72 21 0 2019-04-14 16:19:520 2 0 128884 0 7607288 0 0 2712 55386 500 509 3 2 15 80 0 2019-04-14 16:19:540 2 0 116984 0 7619916 0 0 880 59241 548 525 2 2 9 87 0 2019-04-14 16:19:560 2 0 131492 0 7604816 0 0 128 56512 518 401 1 1 12 86 0 2019-04-14 16:19:58...0 2 0 134508 0 7601480 0 0 0 58562 428 353 0 1 4 95 0 2019-04-14 16:21:460 2 0 125360 0 7611320 0 0 0 59392 481 369 0 1 11 89 0 2019-04-14 16:21:480 2 0 122896 0 7612872 0 0 0 58564 431 342 0 1 17 82 0 2019-04-14 16:21:501 1 0 121456 0 7614248 0 0 54 57347 487 399 0 1 13 85 0 2019-04-14 16:21:520 2 0 122820 0 7613324 0 0 12 59964 460 346 0 1 20 79 0 2019-04-14 16:21:540 2 0 120344 0 7616528 0 0 1844 55691 645 676 5 3 6 85 0 2019-04-14 16:21:560 2 0 124900 0 7611404 0 0 936 58261 795 1215 2 3 13 83 0 2019-04-14 16:21:580 2 0 124572 0 7612192 0 0 1096 55340 518 487 1 2 0 97 0 2019-04-14 16:22:000 2 0 123040 0 7612740 0 0 888 57574 573 620 1 2 5 92 0 2019-04-14 16:22:020 2 0 125112 0 7610592 0 0 124 59164 498 480 1 1 13 85 0 2019-04-14 16:22:041 1 0 129440 0 7607592 0 0 568 60196 563 612 2 2 8 88 0 2019-04-14 16:22:060 2 0 124020 0 7612364 0 0 0 58260 629 725 0 1 8 91 0 2019-04-14 16:22:082 1 0 124480 0 7611848 0 0 0 58852 447 331 0 1 1 98 0 2019-04-14 16:22:100 3 0 137636 0 7598484 0 0 11908 44995 619 714 1 1 11 87 0 2019-04-14 16:22:120 2 0 123128 0 7613392 0 0 29888 28901 532 972 1 1 29 68 0 2019-04-14 16:22:140 2 0 126260 0 7609984 0 0 39872 18836 706 1435 1 2 28 70 0 2019-04-14 16:22:160 2 0 130748 0 7605536 0 0 36096 22488 658 1272 2 1 8 89 0 2019-04-14 16:22:18 ... 0 2 0 127216 0 7609192 0 0 29192 29696 472 949 1 1 23 75 0 2019-04-14 16:22:400 2 0 147428 0 7588556 0 0 29120 29696 523 974 1 1 19 79 0 2019-04-14 16:22:420 1 0 120644 0 7615388 0 0 32320 25276 566 998 1 2 49 47 0 2019-04-14 16:22:440 1 0 128456 0 7607904 0 0 58624 0 621 1103 3 2 49 46 0 2019-04-14 16:22:460 1 0 127836 0 7608260 0 0 58624 0 631 1119 3 2 50 46 0 2019-04-14 16:22:480 1 0 126712 0 7609616 0 0 58624 0 616 1110 2 2 50 47 0 2019-04-14 16:22:50 ...0 1 0 157408 0 7578060 0 0 58628 0 736 1206 3 3 50 44 0 2019-04-14 16:27:220 1 0 142420 0 7593400 0 0 58688 0 623 1099 1 4 50 45 0 2019-04-14 16:27:240 1 0 247016 0 7488184 0 0 58568 0 649 1113 1 4 50 45 0 2019-04-14 16:27:260 1 0 123232 0 7612088 0 0 58412 215 675 1141 2 3 50 46 0 2019-04-14 16:27:280 2 0 144920 0 7586576 0 0 48376 11046 788 1455 1 5 34 60 0 2019-04-14 16:27:301 1 0 125636 0 7595704 0 0 36736 21381 702 1386 1 4 21 74 0 2019-04-14 16:27:320 3 0 156700 0 7559328 0 0 35556 23364 709 1367 1 3 22 74 0 2019-04-14 16:27:340 2 0 315580 0 7382748 0 0 33608 24731 787 1407 1 5 18 76 0 2019-04-14 16:27:36 ...0 2 0 684412 0 6152040 0 0 29832 28356 528 994 1 2 32 66 0 2019-04-14 16:38:040 2 0 563512 0 6272264 0 0 29696 29506 546 987 1 2 32 65 0 2019-04-14 16:38:060 2 0 595488 0 6241068 0 0 27292 30858 549 971 1 2 26 71 0 2019-04-14 16:38:080 2 0 550120 0 6285352 0 0 28844 29696 567 995 1 2 29 68 0 2019-04-14 16:38:101 1 0 432380 0 6402964 0 0 28992 29696 557 979 1 2 37 61 0 2019-04-14 16:38:120 2 0 445796 0 6384412 0 0 26768 32134 628 1029 1 4 27 69 0 2019-04-14 16:38:140 2 0 374972 0 6453592 0 0 28172 30839 529 962 1 2 43 54 0 2019-04-14 16:38:160 2 0 317824 0 6507992 0 0 29172 29386 560 1001 1 3 27 68 0 2019-04-14 16:38:180 3 0 215092 0 6609132 0 0 33116 25210 621 1148 1 3 19 77 0 2019-04-14 16:38:200 2 0 194836 0 6621524 0 0 27786 30959 704 1152 0 5 18 77 0 2019-04-14 16:38:220 3 0 315648 0 6500196 0 0 31434 27226 581 1073 0 3 31 65 0 2019-04-14 16:38:24 0 2 0 256180 0 6554676 0 0 29828 29017 668 1174 0 4 20 76 0 2019-04-14 16:38:26 <<< CRASH0 1 0 378220 0 6552496 0 0 4348 53686 2210 3816 1 5 46 49 0 2019-04-14 16:38:280 1 0 389888 0 6536296 0 0 2704 56529 2454 4178 0 5 42 52 0 2019-04-14 16:38:300 2 0 923572 0 5998992 0 0 1612 56863 2384 3928 0 6 16 78 0 2019-04-14 16:38:320 0 0 908336 0 6006696 0 0 3584 49280 8961 17334 0 19 39 42 0 2019-04-14 16:38:340 1 0 1306480 0 5607088 0 0 264 63632 18605 37933 3 58 35 4 0 2019-04-14 16:38:362 1 0 1355448 0 5558576 0 0 8 59222 14817 30296 2 46 24 27 0 2019-04-14 16:38:382 2 0 1358224 0 5555884 0 0 0 58544 14226 28331 2 44 3 50 0 2019-04-14 16:38:402 1 0 1446348 0 5468748 0 0 0 58846 14376 29185 2 44 11 42 0 2019-04-14 16:38:420 0 0 2639648 0 4357608 0 0 0 28486 12909 26770 2 44 49 5 0 2019-04-14 16:38:440 0 0 2639524 0 4357800 0 0 0 0 158 154 0 0 100 0 0 2019-04-14 16:38:460 0 0 2687316 0 4309976 0 0 0 0 181 188 0 2 98 0 0 2019-04-14 16:38:480 0 0 2706920 0 4300116 0 0 0 105 137 263 0 0 100 0 0 2019-04-14 16:38:500 0 0 2706672 0 4300232 0 0 0 0 142 204 0 0 100 0 0 2019-04-14 16:38:520 0 0 2815116 0 4191928 0 0 0 0 116 242 0 0 100 0 0 2019-04-14 16:38:540 0 0 2815364 0 4192008 0 0 0 0 116 239 0 0 100 0 0 2019-04-14 16:38:560 0 0 2815116 0 4192164 0 0 0 0 159 236 0 0 100 0 0 2019-04-14 16:38:58
ending after the out of memory crash, that occurred exactly at the marked point 16:38:26.355 UTC.
We can't really see anything too worrisome. There is always lots of memory used by cache, which could have been mobilized. The only possible explanation I can think of is that in that moment of the crash the memory utilization suddenly skyrocketed in less than a second, so that the 2 second vmstat interval wouldn't show it??? Nah.
I have already much reduced work_mem, which has helped in some other cases before. Now I am going to reduce the shared_buffers now, but that seems counter-intuitive because we are sitting on all that cache memory unused!
Might this be a bug? It feels like a bug. It feels like those out of memory issues should be handled more gracefully (garbage collection attempt?) and that somehow there should be more information so the person can do anything about it.
Any ideas?
-Gunther
Gunther <raj@gusw.net> writes: > For weeks now, I am banging my head at an "out of memory" situation. > There is only one query I am running on an 8 GB system, whatever I try, > I get knocked out on this out of memory. It is extremely impenetrable to > understand and fix this error. I guess I could add a swap file, and then > I would have to take the penalty of swapping. But how can I actually > address an out of memory condition if the system doesn't tell me where > it is happening? > You might want to see the query, but it is a huge plan, and I can't > really break this down. It shouldn't matter though. But just so you can > get a glimpse here is the plan: Is that the whole plan? With just three sorts and two materializes, it really shouldn't use more than more-or-less 5X work_mem. What do you have work_mem set to, anyway? Is this a 64-bit build of PG? Also, are the estimated rowcounts shown here anywhere close to what you expect in reality? If there are any AFTER INSERT triggers on the insertion target table, you'd also be accumulating per-row trigger queue entries ... but if there's only circa 35K rows to be inserted, it's hard to credit that eating more than a couple hundred KB, either. > Might this be a bug? It's conceivable that you've hit some memory-leakage bug, but if so you haven't provided nearly enough info for anyone else to reproduce it. You haven't even shown us the actual error message :-( https://wiki.postgresql.org/wiki/Guide_to_reporting_problems regards, tom lane
For weeks now, I am banging my head at an "out of memory" situation. There is only one query I am running on an 8 GB system, whatever I try, I get knocked out on this out of memory.
You might want to see the query, but it is a huge plan, and I can't really break this down. It shouldn't matter though. But just so you can get a glimpse here is the plan:
Insert on businessoperation (cost=5358849.28..5361878.44 rows=34619 width=1197) -> Unique (cost=5358849.28..5361532.25 rows=34619 width=1197)
Thanks for looking at my problem Tom Lane and Jeff Janes. Sorry for not having given enough detail.
The version is 10.2 latest. The database was originally built with 10.1 and then just started with 10.2. No dump and reload or pg_upgrade. Underlying system is 64bit Amazon Linux (CentOS like) running on an AMD64 VM (m5a) right now.
I said "crash" and that is wrong. Not a signal nor core dump. It is the ERROR: out of memory. Only the query crashes. Although I don't know if may be the backend server might have left a core dump? Where would that be? Would it help anyone if I started the server with the -c option to get a core dump? I guess I could re-compile with gcc -g debugging symbols all on and then run with that -c option, and then use gdb to find out which line it was failing at and then inspect the query plan data structure? Would that be helpful? Does anyone want the coredump to inspect?
The short version is:
Grand total: 1437014672 bytes in 168424 blocks; 11879744 free (3423 chunks); 1425134928 used 2019-04-14 16:38:26.355 UTC [11061] ERROR: out of memory 2019-04-14 16:38:26.355 UTC [11061] DETAIL: Failed on request of size 8272 in memory context "ExecutorState".
Here is the out of memory error dump in its full glory.
TopMemoryContext: 2197400 total in 7 blocks; 42952 free (15 chunks); 2154448 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 blocks; 7720 free (2 chunks); 472 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 2097152 total in 9 blocks; 396480 free (10 chunks); 1700672 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used PortalContext: 1024 total in 1 blocks; 624 free (0 chunks); 400 used: ExecutorState: 1416621920 total in 168098 blocks; 8494152 free (3102 chunks); 1408127768 used HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used HashBatchContext: 57432 total in 3 blocks; 16072 free (6 chunks); 41360 used HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 40120 used TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 40120 used HashTableContext: 8454256 total in 6 blocks; 64848 free (32 chunks); 8389408 used HashBatchContext: 106640 total in 3 blocks; 7936 free (0 chunks); 98704 used TupleSort main: 452880 total in 8 blocks; 126248 free (27 chunks); 326632 used Caller tuples: 4194304 total in 10 blocks; 1434888 free (20 chunks); 2759416 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used CacheMemoryContext: 1101328 total in 14 blocks; 386840 free (1 chunks); 714488 used index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: businessop_docid_ndx index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: businessop_sbjentityidroot_ndx index info: 2048 total in 2 blocks; 704 free (1 chunks); 1344 used: businessop_sbjroleiid_ndx index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: entity_id_idx index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: act_id_fkidx index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: act_id_idx index info: 2048 total in 2 blocks; 592 free (1 chunks); 1456 used: pg_constraint_conrelid_contypid_conname_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: actrelationship_pkey index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: actrelationship_target_idx index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: actrelationship_source_idx index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: documentinformation_pk index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_statistic_ext_relid_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: docinfsubj_ndx_seii index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: docinfsubj_ndx_sbjentcodeonly index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2618_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_index_indrelid_index relation rules: 229376 total in 31 blocks; 5136 free (0 chunks); 224240 used: v_businessoperation index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: pg_db_role_setting_databaseid_rol_index index info: 2048 total in 2 blocks; 624 free (2 chunks); 1424 used: pg_opclass_am_name_nsp_index index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_foreign_data_wrapper_name_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_enum_oid_index index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_class_relname_nsp_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_server_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_pubname_index index info: 2048 total in 2 blocks; 592 free (3 chunks); 1456 used: pg_statistic_relid_att_inh_index index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_cast_source_target_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_language_name_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_transform_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_collation_oid_index index info: 3072 total in 2 blocks; 1136 free (2 chunks); 1936 used: pg_amop_fam_strat_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_index_indexrelid_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_template_tmplname_index index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_ts_config_map_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_opclass_oid_index index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_foreign_data_wrapper_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_event_trigger_evtname_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_statistic_ext_name_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_dict_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_event_trigger_oid_index index info: 3072 total in 2 blocks; 1216 free (3 chunks); 1856 used: pg_conversion_default_index index info: 3072 total in 2 blocks; 1216 free (3 chunks); 1856 used: pg_operator_oprname_l_r_n_index index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_trigger_tgrelid_tgname_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_enum_typid_label_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_config_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_user_mapping_oid_index index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_opfamily_am_name_nsp_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_table_relid_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_type_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_aggregate_fnoid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_constraint_oid_index index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_rewrite_rel_rulename_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_parser_prsname_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_config_cfgname_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_parser_oid_index index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_publication_rel_prrelid_prpubid_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_operator_oid_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_namespace_nspname_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_template_oid_index index info: 2048 total in 2 blocks; 624 free (2 chunks); 1424 used: pg_amop_opr_fam_index index info: 2048 total in 2 blocks; 672 free (3 chunks); 1376 used: pg_default_acl_role_nsp_obj_index index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_collation_name_enc_nsp_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_rel_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_range_rngtypid_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_dict_dictname_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_type_typname_nsp_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_opfamily_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_statistic_ext_oid_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_proc_proname_args_nsp_index index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_partitioned_table_partrelid_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_transform_type_lang_index index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_attribute_relid_attnum_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_proc_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_language_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_namespace_oid_index index info: 3072 total in 2 blocks; 1136 free (2 chunks); 1936 used: pg_amproc_fam_proc_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_server_name_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_attribute_relid_attnam_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_conversion_oid_index index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_user_mapping_user_server_index index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_subscription_rel_srrelid_srsubid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_sequence_seqrelid_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_conversion_name_nsp_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_authid_oid_index index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_auth_members_member_role_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_subscription_oid_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_tablespace_oid_index index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_shseclabel_object_index index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_replication_origin_roname_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_database_datname_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_subscription_subname_index index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_replication_origin_roiident_index index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_auth_members_role_member_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_database_oid_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_authid_rolname_index WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 7256 free (1 chunks); 936 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (4 chunks); 256 used Grand total: 1437014672 bytes in 168424 blocks; 11879744 free (3423 chunks); 1425134928 used 2019-04-14 16:38:26.355 UTC [11061] ERROR: out of memory 2019-04-14 16:38:26.355 UTC [11061] DETAIL: Failed on request of size 8272 in memory context "ExecutorState".
-Gunther
Gunther <raj@gusw.net> writes:For weeks now, I am banging my head at an "out of memory" situation. There is only one query I am running on an 8 GB system, whatever I try, I get knocked out on this out of memory. It is extremely impenetrable to understand and fix this error. I guess I could add a swap file, and then I would have to take the penalty of swapping. But how can I actually address an out of memory condition if the system doesn't tell me where it is happening?You might want to see the query, but it is a huge plan, and I can't really break this down. It shouldn't matter though. But just so you can get a glimpse here is the plan:Is that the whole plan? With just three sorts and two materializes, it really shouldn't use more than more-or-less 5X work_mem. What do you have work_mem set to, anyway? Is this a 64-bit build of PG? Also, are the estimated rowcounts shown here anywhere close to what you expect in reality? If there are any AFTER INSERT triggers on the insertion target table, you'd also be accumulating per-row trigger queue entries ... but if there's only circa 35K rows to be inserted, it's hard to credit that eating more than a couple hundred KB, either.Might this be a bug?It's conceivable that you've hit some memory-leakage bug, but if so you haven't provided nearly enough info for anyone else to reproduce it. You haven't even shown us the actual error message :-( https://wiki.postgresql.org/wiki/Guide_to_reporting_problems regards, tom lane
On Sun, Apr 14, 2019 at 05:19:11PM -0400, Tom Lane wrote: ! Gunther <raj@gusw.net> writes: ! > For weeks now, I am banging my head at an "out of memory" situation. ! > There is only one query I am running on an 8 GB system, whatever I try, ! > I get knocked out on this out of memory. It is extremely impenetrable to ! > understand and fix this error. I guess I could add a swap file, and then ! > I would have to take the penalty of swapping. But how can I actually ! > address an out of memory condition if the system doesn't tell me where ! > it is happening? Well, esactly with a swap space. No offense intended, but if You don't have a swap space, You should not complain about unintellegibe Out-of-memory situations. Swapspace is not usually used to run applications from (that would indeed give horrible performance), it is used to not get out-of-memory errors. With a swapspace, the out-of-memory situation will persist, and so one has time to take measurements and analyze system behaviour and from that, one can better understand what is causing the problem, and decide what actions should be taken, on an informed base (e.g. correct flaws in the system tuning, fix bad query, buy more memory, or what may be applicable) If I remember correctly, I did even see a DTRACE flag in my build, so what more is to wish? :)) P.
On Sun, Apr 14, 2019 at 09:05:48PM -0400, Gunther wrote: > Thanks for looking at my problem Tom Lane and Jeff Janes. Sorry for not > having given enough detail. > > The version is 10.2 latest. v10.7 is available; could you upgrade ? What are these set to ? shared_buffers? work_mem? Was postgres locally compiled, packaged by distribution, or PGDG RPM/DEB ? Can you show \d businessoperation ? > The short version is: > > Grand total: 1437014672 bytes in 168424 blocks; 11879744 free (3423 chunks); 1425134928 used > 2019-04-14 16:38:26.355 UTC [11061] ERROR: out of memory > 2019-04-14 16:38:26.355 UTC [11061] DETAIL: Failed on request of size 8272 in memory context "ExecutorState". Could you rerun the query with \set VERBOSITY verbose to show the file/line that's failing ? If you wanted to show a stack trace, you could attach gdb to PID from SELECT pg_backend_pid(), "b"reak on errdetail, run the query, and then "bt" when it fails. Justin
Thanks for looking at my problem Tom Lane and Jeff Janes. Sorry for not having given enough detail.
The version is 10.2 latest. The database was originally built with 10.1 and then just started with 10.2.
I said "crash" and that is wrong. Not a signal nor core dump. It is the ERROR: out of memory. Only the query crashes. Although I don't know if may be the backend server might have left a core dump?
The short version is:
Grand total: 1437014672 bytes in 168424 blocks; 11879744 free (3423 chunks); 1425134928 used 2019-04-14 16:38:26.355 UTC [11061] ERROR: out of memory 2019-04-14 16:38:26.355 UTC [11061] DETAIL: Failed on request of size 8272 in memory context "ExecutorState".
ExecutorState: 1416621920 total in 168098 blocks; 8494152 free (3102 chunks); 1408127768 used HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used HashBatchContext: 57432 total in 3 blocks; 16072 free (6 chunks); 41360 used
Sorry I meant 11.2 actually latest.The version is 10.2 latest.v10.7 is available; could you upgrade ?
What are these set to ? shared_buffers? work_mem?
shared_buffers=2G (of 8 total), then 1G, didn't help.
work_mem=4M by now (I had once been successful of avoiding out of memory by reducing work mem from 64M to 8M. But as Tom Lane says, it shouldn't be using more than 5 x work_mem in this query plan.
Jeff Janes said:
I don't know why a 8GB system with a lot of cache that could be evicted would get an OOM when something using 1.5GB asks for 8272 bytes more. But that is a question of how the kernel works, rather than how PostgreSQL works. But I also think the log you quote above belongs to a different event than the vmstat trace in your first email.and I agree, except that the vmstat log and the error really belong together, same timestamp. Nothing else running on that machine this Sunday. Yes I ran this several times with different parameters, so some mixup is possible, but always ending in the same crash anyway. So here again, without the vmstat log, which really wouldn't be any different than I showed you. (See below for the ENABLE_NESTLOOP=off setting, not having those settings same between explain and actual execution might account for the discrepancy that you saw.)
integrator=# SET ENABLE_NESTLOOP TO OFF; SET integrator=# \set VERBOSITY verbose integrator=# explain INSERT INTO reports.BusinessOperation SELECT * FROM reports.v_BusinessOperation; integrator=# \pset pager off Pager usage is off. integrator=# \pset format unaligned Output format is unaligned. integrator=# explain INSERT INTO reports.BusinessOperation SELECT * FROM reports.v_BusinessOperation; QUERY PLAN Insert on businessoperation (cost=5850091.58..5853120.74 rows=34619 width=1197) -> Unique (cost=5850091.58..5852774.55 rows=34619 width=1197) -> Sort (cost=5850091.58..5850178.13 rows=34619 width=1197) Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.is_current, documentinformationsubject.documentid, documentinformationsubject.documenttypecode, documentinformationsubject.subjectroleinternalid, documentinformationsubject.subjectentityinternalid, documentinformationsubject.subjectentityid, documentinformationsubject.subjectentityidroot, documentinformationsubject.subjectentityname, documentinformationsubject.subjectentitytel, documentinformationsubject.subjectentityemail, documentinformationsubject.otherentityinternalid, documentinformationsubject.confidentialitycode, documentinformationsubject.actinternalid, documentinformationsubject.code_code, documentinformationsubject.code_displayname, q.code_code, q.code_displayname, an.extension, an.root, documentinformationsubject_2.subjectentitycode, documentinformationsubject_2.subjectentitycodesystem, documentinformationsubject_2.effectivetime_low, documentinformationsubject_2.effectivetime_high, documentinformationsubject_2.statuscode, documentinformationsubject_2.code_code, agencyid.extension, agencyname.trivialname, documentinformationsubject_1.subjectentitycode, documentinformationsubject_1.subjectentityinternalid -> Hash Right Join (cost=4489522.06..5829375.93 rows=34619 width=1197) Hash Cond: (((q.documentinternalid)::text = (documentinformationsubject.documentinternalid)::text) AND ((r.targetinternalid)::text = (documentinformationsubject.actinternalid)::text)) -> Hash Right Join (cost=1473632.24..2808301.92 rows=13 width=341) Hash Cond: (((documentinformationsubject_2.documentinternalid)::text = (q.documentinternalid)::text) AND ((documentinformationsubject_2.actinternalid)::text = (q.actinternalid)::text)) -> Hash Left Join (cost=38864.03..1373533.69 rows=1 width=219) Hash Cond: ((documentinformationsubject_2.otherentityinternalid)::text = (agencyname.entityinternalid)::text) -> Hash Left Join (cost=2503.10..1332874.75 rows=1 width=229) Hash Cond: ((documentinformationsubject_2.otherentityinternalid)::text = (agencyid.entityinternalid)::text) -> Seq Scan on documentinformationsubject documentinformationsubject_2 (cost=0.00..1329868.64 rows=1 width=177) Filter: ((participationtypecode)::text = 'AUT'::text) -> Hash (cost=1574.82..1574.82 rows=34182 width=89) -> Seq Scan on entity_id agencyid (cost=0.00..1574.82 rows=34182 width=89) -> Hash (cost=27066.08..27066.08 rows=399908 width=64) -> Seq Scan on bestname agencyname (cost=0.00..27066.08 rows=399908 width=64) -> Hash (cost=1434768.02..1434768.02 rows=13 width=233) -> Hash Right Join (cost=953906.58..1434768.02 rows=13 width=233) Hash Cond: ((an.actinternalid)::text = (q.actinternalid)::text) -> Seq Scan on act_id an (cost=0.00..425941.04 rows=14645404 width=134) -> Hash (cost=953906.57..953906.57 rows=1 width=136) -> Hash Join (cost=456015.28..953906.57 rows=1 width=136) Hash Cond: ((q.actinternalid)::text = (r.sourceinternalid)::text) -> Seq Scan on documentinformation q (cost=0.00..497440.84 rows=120119 width=99) Filter: (((classcode)::text = 'CNTRCT'::text) AND ((moodcode)::text = 'EVN'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text)) -> Hash (cost=456015.26..456015.26 rows=1 width=74) -> Seq Scan on actrelationship r (cost=0.00..456015.26 rows=1 width=74) Filter: ((typecode)::text = 'SUBJ'::text) -> Hash (cost=3011313.54..3011313.54 rows=34619 width=930) -> Merge Left Join (cost=2998334.98..3011313.54 rows=34619 width=930) Merge Cond: (((documentinformationsubject.documentinternalid)::text = (documentinformationsubject_1.documentinternalid)::text) AND ((documentinformationsubject.documentid)::text = (documentinformationsubject_1.documentid)::text) AND ((documentinformationsubject.actinternalid)::text = (documentinformationsubject_1.actinternalid)::text)) -> Sort (cost=1408783.87..1408870.41 rows=34619 width=882) Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.documentid, documentinformationsubject.actinternalid -> Seq Scan on documentinformationsubject (cost=0.00..1392681.22 rows=34619 width=882) Filter: (((participationtypecode)::text = ANY ('{PPRF,PRF}'::text[])) AND ((classcode)::text = 'ACT'::text) AND ((moodcode)::text = 'DEF'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text)) -> Materialize (cost=1589551.12..1594604.04 rows=1010585 width=159) -> Sort (cost=1589551.12..1592077.58 rows=1010585 width=159) Sort Key: documentinformationsubject_1.documentinternalid, documentinformationsubject_1.documentid, documentinformationsubject_1.actinternalid -> Seq Scan on documentinformationsubject documentinformationsubject_1 (cost=0.00..1329868.64 rows=1010585 width=159) Filter: ((participationtypecode)::text = 'PRD'::text)
and the error memory status dump (I hope my grey boxes help a bit to lighten this massive amount of data...
TopMemoryContext: 4294552 total in 7 blocks; 42952 free (15 chunks); 4251600 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 blocks; 7720 free (2 chunks); 472 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 2097152 total in 9 blocks; 396480 free (10 chunks); 1700672 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used PortalContext: 1024 total in 1 blocks; 624 free (0 chunks); 400 used: ExecutorState: 2234123384 total in 266261 blocks; 3782328 free (17244 chunks); 2230341056 used HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used HashBatchContext: 57432 total in 3 blocks; 16072 free (6 chunks); 41360 used HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 40120 used TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 40120 used HashTableContext: 8454256 total in 6 blocks; 64848 free (32 chunks); 8389408 used HashBatchContext: 100711712 total in 3065 blocks; 7936 free (0 chunks); 100703776 used TupleSort main: 452880 total in 8 blocks; 126248 free (27 chunks); 326632 used Caller tuples: 1048576 total in 8 blocks; 21608 free (14 chunks); 1026968 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used CacheMemoryContext: 1101328 total in 14 blocks; 386840 free (1 chunks); 714488 used index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: businessop_docid_ndx index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: businessop_sbjentityidroot_ndx index info: 2048 total in 2 blocks; 704 free (1 chunks); 1344 used: businessop_sbjroleiid_ndx index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: entity_id_idx index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: act_id_fkidx index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: act_id_idx index info: 2048 total in 2 blocks; 592 free (1 chunks); 1456 used: pg_constraint_conrelid_contypid_conname_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: actrelationship_pkey index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: actrelationship_target_idx index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: actrelationship_source_idx index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: documentinformation_pk index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_statistic_ext_relid_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: docinfsubj_ndx_seii index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: docinfsubj_ndx_sbjentcodeonly index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2618_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_index_indrelid_index relation rules: 229376 total in 31 blocks; 5136 free (0 chunks); 224240 used: v_businessoperation index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: pg_db_role_setting_databaseid_rol_index index info: 2048 total in 2 blocks; 624 free (2 chunks); 1424 used: pg_opclass_am_name_nsp_index index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_foreign_data_wrapper_name_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_enum_oid_index index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_class_relname_nsp_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_server_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_pubname_index index info: 2048 total in 2 blocks; 592 free (3 chunks); 1456 used: pg_statistic_relid_att_inh_index index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_cast_source_target_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_language_name_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_transform_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_collation_oid_index index info: 3072 total in 2 blocks; 1136 free (2 chunks); 1936 used: pg_amop_fam_strat_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_index_indexrelid_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_template_tmplname_index index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_ts_config_map_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_opclass_oid_index index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_foreign_data_wrapper_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_event_trigger_evtname_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_statistic_ext_name_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_dict_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_event_trigger_oid_index index info: 3072 total in 2 blocks; 1216 free (3 chunks); 1856 used: pg_conversion_default_index index info: 3072 total in 2 blocks; 1216 free (3 chunks); 1856 used: pg_operator_oprname_l_r_n_index index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_trigger_tgrelid_tgname_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_enum_typid_label_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_config_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_user_mapping_oid_index index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_opfamily_am_name_nsp_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_table_relid_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_type_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_aggregate_fnoid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_constraint_oid_index index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_rewrite_rel_rulename_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_parser_prsname_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_config_cfgname_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_parser_oid_index index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_publication_rel_prrelid_prpubid_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_operator_oid_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_namespace_nspname_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_template_oid_index index info: 2048 total in 2 blocks; 624 free (2 chunks); 1424 used: pg_amop_opr_fam_index index info: 2048 total in 2 blocks; 672 free (3 chunks); 1376 used: pg_default_acl_role_nsp_obj_index index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_collation_name_enc_nsp_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_rel_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_range_rngtypid_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_dict_dictname_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_type_typname_nsp_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_opfamily_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_statistic_ext_oid_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_proc_proname_args_nsp_index index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_partitioned_table_partrelid_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_transform_type_lang_index index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_attribute_relid_attnum_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_proc_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_language_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_namespace_oid_index index info: 3072 total in 2 blocks; 1136 free (2 chunks); 1936 used: pg_amproc_fam_proc_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_server_name_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_attribute_relid_attnam_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_conversion_oid_index index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_user_mapping_user_server_index index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_subscription_rel_srrelid_srsubid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_sequence_seqrelid_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_conversion_name_nsp_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_authid_oid_index index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_auth_members_member_role_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_subscription_oid_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_tablespace_oid_index index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_shseclabel_object_index index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_replication_origin_roname_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_database_datname_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_subscription_subname_index index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_replication_origin_roiident_index index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_auth_members_role_member_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_database_oid_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_authid_rolname_index WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 7256 free (1 chunks); 936 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (4 chunks); 256 used Grand total: 2354072632 bytes in 269647 blocks; 5754640 free (17559 chunks); 2348317992 used
Locally compiled. I just recompiled with --enable-debug, ready to deploy that to create a core dump and check it out.Was postgres locally compiled, packaged by distribution, or PGDG RPM/DEB ?
Can you show \d businessoperation ?
Table "reports.businessoperation" Column | Type | Modifiers ---------------------------+------------------------+-----------documentinternalid | character varying(255) |is_current | character(1) |documentid | character varying(555) |documenttypecode | character varying(512) |subjectroleinternalid | character varying(255) |subjectentityinternalid | character varying(255) |subjectentityid | character varying(555) |subjectentityidroot | character varying(555) |subjectentityname | character varying |subjectentitytel | text |subjectentityemail | text |otherentityinternalid | character varying(255) |confidentialitycode | character varying(512) |actinternalid | character varying(255) |operationcode | character varying(512) |operationname | text |operationqualifiercode | character varying(512) |operationqualifiername | character varying(512) |approvalnumber | character varying(555) |approvalnumbersystem | character varying(555) |approvalstatecode | character varying(512) |approvalstatecodesystem | character varying(512) |approvaleffectivetimelow | character varying(512) |approvaleffectivetimehigh | character varying(512) |approvalstatuscode | character varying(32) |licensecode | character varying(512) |agencyid | character varying(555) |agencyname | text |productitemcode | character varying(512) |productinternalid | character varying(255) |
Could you rerun the query with \set VERBOSITY verbose to show the file/line that's failing ?
Here goes:
integrator=# \set VERBOSITY verbose integrator=# SET ENABLE_NESTLOOP TO OFF; SET integrator=# INSERT INTO reports.BusinessOperation SELECT * FROM reports.v_BusinessOperation; ERROR: 53200: out of memory DETAIL: Failed on request of size 32800 in memory context "HashBatchContext". LOCATION: MemoryContextAlloc, mcxt.c:798
you notice that I set ENABLE_NESTLOOP to off, that is because the planner goes off thinking the NL plan is marginally more efficient, but in fact it will take 5 hours to get to the same out of memory crash, while the no NL plan gets there in half an hour. That verbose setting didn't help much I guess.
If you wanted to show a stack trace, you could attach gdb to PID from SELECT pg_backend_pid(), "b"reak on errdetail, run the query, and then "bt" when it fails.
gdb -p 27930 GNU gdb (GDB) Red Hat Enterprise Linux 8.0.1-30.amzn2.0.3 ... Attaching to process 27930 Reading symbols from /usr/local/pgsql/bin/postgres...done. ... (gdb) b errdetail Breakpoint 1 at 0x82b210: file elog.c, line 872. (gdb) cont Continuing. Breakpoint 1, errdetail (fmt=fmt@entry=0x9d9958 "Failed on request of size %zu in memory context \"%s\".") at elog.c:872 872 { (gdb) bt #0 errdetail (fmt=fmt@entry=0x9d9958 "Failed on request of size %zu in memory context \"%s\".") at elog.c:872 #1 0x000000000084e320 in MemoryContextAlloc (context=0x1111600, size=size@entry=32800) at mcxt.c:794 #2 0x000000000060ce7a in dense_alloc (size=384, size@entry=381, hashtable=<optimized out>, hashtable=<optimized out>) at nodeHash.c:2696 #3 0x000000000060d788 in ExecHashTableInsert (hashtable=hashtable@entry=0x10ead08, slot=<optimized out>, hashvalue=194758122) at nodeHash.c:1614 #4 0x0000000000610c6f in ExecHashJoinNewBatch (hjstate=0x10806b0) at nodeHashjoin.c:1051 #5 ExecHashJoinImpl (parallel=false, pstate=0x10806b0) at nodeHashjoin.c:539 #6 ExecHashJoin (pstate=0x10806b0) at nodeHashjoin.c:565 #7 0x000000000061ce4e in ExecProcNode (node=0x10806b0) at ../../../src/include/executor/executor.h:247 #8 ExecSort (pstate=0x1080490) at nodeSort.c:107 #9 0x000000000061d2c4 in ExecProcNode (node=0x1080490) at ../../../src/include/executor/executor.h:247 #10 ExecUnique (pstate=0x107ff60) at nodeUnique.c:73 #11 0x0000000000619732 in ExecProcNode (node=0x107ff60) at ../../../src/include/executor/executor.h:247 #12 ExecModifyTable (pstate=0x107fd20) at nodeModifyTable.c:2025 #13 0x00000000005f75ba in ExecProcNode (node=0x107fd20) at ../../../src/include/executor/executor.h:247 #14 ExecutePlan (execute_once=<optimized out>, dest=0x7f0442721998, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_INSERT, use_parallel_mode=<optimized out>, planstate=0x107fd20, estate=0x107f830) at execMain.c:1723 #15 standard_ExecutorRun (queryDesc=0x1086880, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364 #16 0x000000000072a972 in ProcessQuery (plan=<optimized out>, sourceText=0xf4a710 "INSERT INTO reports.BusinessOperation SELECT * FROM reports.v_BusinessOperation;", params=0x0, queryEnv=0x0, dest=0x7f0442721998, completionTag=0x7fff2e4cad30 "") at pquery.c:161 #17 0x000000000072abb0 in PortalRunMulti (portal=portal@entry=0xfb06b0, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x7f0442721998, altdest=altdest@entry=0x7f0442721998, completionTag=completionTag@entry=0x7fff2e4cad30 "") at pquery.c:1286 #18 0x000000000072b661 in PortalRun (portal=portal@entry=0xfb06b0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x7f0442721998, altdest=altdest@entry=0x7f0442721998, completionTag=0x7fff2e4cad30 "") at pquery.c:799 #19 0x00000000007276e8 in exec_simple_query ( query_string=0xf4a710 "INSERT INTO reports.BusinessOperation SELECT * FROM reports.v_BusinessOperation;") at postgres.c:1145 #20 0x0000000000729534 in PostgresMain (argc=<optimized out>, argv=argv@entry=0xf76ce8, dbname=<optimized out>, username=<optimized out>) at postgres.c:4182 #21 0x00000000006be215 in BackendRun (port=0xf6dfe0) at postmaster.c:4361 #22 BackendStartup (port=0xf6dfe0) at postmaster.c:4033 #23 ServerLoop () at postmaster.c:1706 #24 0x00000000006bf122 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0xf45320) at postmaster.c:1379 #25 0x00000000004822dc in main (argc=3, argv=0xf45320) at main.c:228
That's it.
Thank you all very much for your interest in this case.
-Gunther
Gunther <raj@gusw.net> writes: > ExecutorState: 2234123384 total in 266261 blocks; 3782328 free (17244 chunks); 2230341056 used Oooh, that looks like a memory leak right enough. The ExecutorState should not get that big for any reasonable query. Your error and stack trace show a failure in HashBatchContext, which is probably the last of these four: > HashBatchContext: 57432 total in 3 blocks; 16072 free (6 chunks); 41360 used > HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used > HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used > HashBatchContext: 100711712 total in 3065 blocks; 7936 free (0 chunks); 100703776 used Perhaps that's more than it should be, but it's silly to obsess over 100M when there's a 2.2G problem elsewhere. I think it's likely that it was just coincidence that the failure happened right there. Unfortunately, that leaves us with no info about where the actual leak is coming from. The memory map shows that there were three sorts and four hashes going on, so I'm not sure I believe that this corresponds to the query plan you showed us before. Any chance of extracting a self-contained test case that reproduces this? regards, tom lane
On 4/14/2019 23:24, Tom Lane wrote: >> ExecutorState: 2234123384 total in 266261 blocks; 3782328 free (17244 chunks); 2230341056 used > Oooh, that looks like a memory leak right enough. The ExecutorState > should not get that big for any reasonable query. 2.2 GB is massive yes. > Your error and stack trace show a failure in HashBatchContext, > which is probably the last of these four: > >> HashBatchContext: 57432 total in 3 blocks; 16072 free (6 chunks); 41360 used >> HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used >> HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used >> HashBatchContext: 100711712 total in 3065 blocks; 7936 free (0 chunks); 100703776 used > Perhaps that's more than it should be, but it's silly to obsess over 100M > when there's a 2.2G problem elsewhere. Yes. > I think it's likely that it was > just coincidence that the failure happened right there. Unfortunately, > that leaves us with no info about where the actual leak is coming from. Strange though, that the vmstat tracking never showed that the cache allocated memory goes much below 6 GB. Even if this 2.2 GB memory leak is there, and even if I had 2 GB of shared_buffers, I would still have enough for the OS to give me. Is there any doubt that this might be a problem with Linux? Because if you want, I can whip out a FreeBSD machine, compile pgsql, and attach the same disk, and try it there. I am longing to have a reason to move back to FreeBSD anyway. But I have tons of stuff to do, so if you do not have reason to suspect Linux to do wrong here, I prefer skipping that futile attempt > The memory map shows that there were three sorts and four hashes going > on, so I'm not sure I believe that this corresponds to the query plan > you showed us before. Like I said, the first explain was not using the same constraints (no NL). Now what I sent last should all be consistent. Memory dump and explain plan and gdb backtrace. > Any chance of extracting a self-contained test case that reproduces this? With 18 million rows involved in the base tables, hardly. But I am ready to try some other things with the debugger that you want me to try. If we have a memory leak issue, we might just as well try to plug it! I could even to give someone of you access to the system that runs this. thanks, -Gunther
On Sun, Apr 14, 2019 at 11:59:45PM -0400, Gunther wrote: > On 4/14/2019 23:24, Tom Lane wrote: > >Any chance of extracting a self-contained test case that reproduces this? > With 18 million rows involved in the base tables, hardly. Were you able to reproduce the problem with SELECT (without INSERT) ? How many rows does it output ? Show explain analyze if possible. If that still errors, can you make it work with a small enough LIMIT ? We haven't seen the view - maybe it's very complicated, but can you reproduce with a simpler one ? Fewer joins ? Or fewer join conditions ? Justin
Could you rerun the query with \set VERBOSITY verbose to show the file/line that's failing ?Here goes:
integrator=# \set VERBOSITY verbose integrator=# SET ENABLE_NESTLOOP TO OFF; SET integrator=# INSERT INTO reports.BusinessOperation SELECT * FROM reports.v_BusinessOperation; ERROR: 53200: out of memory DETAIL: Failed on request of size 32800 in memory context "HashBatchContext". LOCATION: MemoryContextAlloc, mcxt.c:798you notice that I set ENABLE_NESTLOOP to off, that is because the planner goes off thinking the NL plan is marginally more efficient, but in fact it will take 5 hours to get to the same out of memory crash, while the no NL plan gets there in half an hour. That verbose setting didn't help much I guess.
Is there any doubt that this might be a problem with Linux? Because if
you want, I can whip out a FreeBSD machine, compile pgsql, and attach
the same disk, and try it there. I am longing to have a reason to move
back to FreeBSD anyway. But I have tons of stuff to do, so if you do not
have reason to suspect Linux to do wrong here, I prefer skipping that
futile attempt
On Sun, Apr 14, 2019 at 11:59:45PM -0400, Gunther wrote: >On 4/14/2019 23:24, Tom Lane wrote: >>> ExecutorState: 2234123384 total in 266261 blocks; 3782328 free (17244 chunks); 2230341056 used >>Oooh, that looks like a memory leak right enough. The ExecutorState >>should not get that big for any reasonable query. >2.2 GB is massive yes. >>Your error and stack trace show a failure in HashBatchContext, >>which is probably the last of these four: >> >>> HashBatchContext: 57432 total in 3 blocks; 16072 free (6 chunks); 41360 used >>> HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used >>> HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used >>> HashBatchContext: 100711712 total in 3065 blocks; 7936 free (0 chunks); 100703776 used >>Perhaps that's more than it should be, but it's silly to obsess over 100M >>when there's a 2.2G problem elsewhere. >Yes. >> I think it's likely that it was >>just coincidence that the failure happened right there. Unfortunately, >>that leaves us with no info about where the actual leak is coming from. > >Strange though, that the vmstat tracking never showed that the cache >allocated memory goes much below 6 GB. Even if this 2.2 GB memory leak >is there, and even if I had 2 GB of shared_buffers, I would still have >enough for the OS to give me. > Depends on how the kernel is configured. What are vm.overcommit_memory and vm.overcommit_ratio set to, for example? It may easily be the case that the kernel is only allowing 50% of RAM to be committed to user space, and then refusing to allocate more despite having free memory. That's fairly common issue on swapless systems. Try running the query again, watch cat /proc/meminfo | grep Commit and if it crashes when Committed_AS hits the CommitLimit. That doesn't explain where the memory leak is, though :-( regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Jeff Janes <jeff.janes@gmail.com> writes: > To get it to happen faster, maybe you could run the server with a small > setting of "ulimit -v"? Or, you could try to capture it live in gdb. > Unfortunately I don't know how to set a breakpoint for allocations into a > specific context, and setting a breakpoint for any memory allocation is > probably going to fire too often to be useful. If you can use gdb at all, it's not that hard to break on allocations into a specific context; I've done it many times. The strategy is basically 1. Let query run long enough for memory usage to start increasing, then attach to backend with gdb. 2. Set breakpoint at, probably, AllocSetAlloc. (In some cases, reallocs could be the problem, but I doubt it here.) Then "c". 3. When it stops, "p *context" and see if this is the context you're looking for. In this case, since we want to know about allocations into ExecutorState and we know there's only one active one, you just have to look at the context name. In general you might have to look at the backtrace. Anyway, if it isn't the one you want, just "c" until you get to an allocation into the one you do want. 4. Once you have found out the address of the context you care about, make the breakpoint conditional on the context argument being that one. It might look like this: Breakpoint 1, AllocSetAlloc (context=0x1483be0, size=480) at aset.c:715 715 { (gdb) p *context $1 = {type = T_AllocSetContext, isReset = false, allowInCritSection = false, methods = 0xa33f40, parent = 0x0, firstchild = 0x1537f30, prevchild = 0x0, nextchild = 0x0, name = 0xa3483f "TopMemoryContext", ident = 0x0, reset_cbs = 0x0} (gdb) cond 1 context == 0x1483be0 5. Now repeatedly "c", and check the stack trace each time, for a dozen or two times to get a feeling for where the allocations are being requested. In some cases you might be able to find the context address in a more efficient way than what I suggest in #3 --- for instance, you could instead set a breakpoint where the context is created and snag its address immediately, or you could dig around in backend data structures to find it. But these ways generally require more familiarity with the code than just watching the requests go by. > Are you not showing the view definition for proprietary reasons, or just > because you don't think it will be useful? As far as that goes, I think the most likely theory right now is that some particular function being used in the view is leaking memory. So yes, we need to see the view ... or you can try removing bits of it to see if the leak goes away. regards, tom lane
On Sun, Apr 14, 2019 at 05:19:50PM -0400, Jeff Janes wrote: > On Sun, Apr 14, 2019 at 4:51 PM Gunther <raj@gusw.net> wrote: > > For weeks now, I am banging my head at an "out of memory" situation. > There is only one query I am running on an 8 GB system, whatever I try, > I get knocked out on this out of memory. > > Is PostgreSQL throwing an error with OOM, or is getting killed -9 by the > OOM killer? Do you get a core file you can inspect with gdb? > > You might want to see the query, but it is a huge plan, and I can't > really break this down. It shouldn't matter though. But just so you can > get a glimpse here is the plan: > > Insert on businessoperation (cost=5358849.28..5361878.44 rows=34619 width=1197) > -> Unique (cost=5358849.28..5361532.25 rows=34619 width=1197) > > > Maybe it is memory for trigger or constraint checking, although I don't > know why that would appear instantly. What triggers or constraints do you > have on businessoperation? Yeah, that would be my guess too. If I had to guess, something likely gets confused and allocates memory in es_query_ctx instead of the per-tuple context (es_per_tuple_exprcontext). Triggers, constraints and expr evaluation all seem like a plausible candidates. It's going to be hard to nail the exact place, though :-( > What if you just run the SELECT without the INSERT? Or insert into a temp > table rather than into businessoperation? And if that doesn't crash, what > if you then insert to businessoperation from the temp table? > Yeah. What's the schema of "businessoperation"? Anything special about it? Triggers, expression indexes, check constraints, ... Gunther, you mentioned you build postgres from sources. Would it be possible to add some sort of extra debugging to see where the memory is allocated from? It's a bit heavy-handed, though. Or maybe splitting es_query_ctx into smaller contexts. That might be easier to evaluate than sifting throuht god-knows-how-many-gbs of log. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I wrote: > If you can use gdb at all, it's not that hard to break on allocations > into a specific context; I've done it many times. The strategy is > basically > 1. Let query run long enough for memory usage to start increasing, > then attach to backend with gdb. BTW, just to clarify that strategy a bit: the usage pattern we expect for ExecutorState is that there are a bunch of individual allocations during executor startup, but then none while the query is running (or at least, if there are any in that phase, they get freed before moving on to the next row). The form of the leak, almost certainly, is that some allocation is happening per-row and *not* getting freed. So there's no point in groveling through the startup behavior. What we want to know about is what happens after we reach the ought-to-be steady state behavior. regards, tom lane
OK Guys, you are very kind to continue taking an interest in this matter.
I will try what I can to help squish the bug.
Tomas Vondra just added a good idea that explains why I get the out of memory with still having so much cache available:
# sysctl vm.overcommit_memory vm.overcommit_memory = 2 # sysctl vm.overcommit_ratio vm.overcommit_ratio = 50
as he predicted.
# cat /proc/meminfo |grep Commit CommitLimit: 3955192 kB Committed_AS: 2937352 kB
So I thing that explains why it turns into an out of memory error. We don't worry or wonder about that any more. I will change that parameter in the future to allow for some spikes. But it's not going to resolve the underlying memory leak issue.
Now I run explain analyze SELECT ... without the INSERT.
integrator=# \set VERBOSITY verbose integrator=# integrator=# \pset pager off Pager usage is off. integrator=# \pset format unaligned Output format is unaligned. integrator=# \set VERBOSITY verbose integrator=# integrator=# SET ENABLE_NESTLOOP TO OFF; SET integrator=# explain analyze SELECT * FROM reports.v_BusinessOperation; ERROR: 53200: out of memory DETAIL: Failed on request of size 32800 in memory context "HashBatchContext". LOCATION: MemoryContextAlloc, mcxt.c:798
And since that failed already, I guess we don't need to worry about the temporary table insert.
About adding LIMIT, I don't think it makes sense in the outer query, since the error is probably happening earlier. I did put a LIMIT 100 on one of the tables we join to, and it helped. But that doesn't really tell us anything I think.
Then yes, I can try the backtrace with the NLs enabled. It will just take a long long time and unfortunately it is extremely likely that I lose the console and then will be unable to get back to it. OK, screen(1) resolves that problem too. Will do, after I reported the above.
But now you have already produced more ideas ...
Maybe it is memory for trigger or constraint checking, although I don't
know why that would appear instantly. What triggers or constraints do you
have on businessoperation?
Yeah, that would be my guess too. If I had to guess, something likely gets
confused and allocates memory in es_query_ctx instead of the per-tuple
context (es_per_tuple_exprcontext).
Triggers, constraints and expr evaluation all seem like a plausible
candidates. It's going to be hard to nail the exact place, though
I think triggers and constraints is ruled out, because the problem happens without the INSERT.
That leaves us with expression evaluation. And OK, now you really wanna see the query, although it should be in the plan too. But for what it is worth:
SELECT DISTINCT documentInternalId, is_current,documentId,documentTypeCode,subjectRoleInternalId,subjectEntityInternalId,subjectEntityId,subjectEntityIdRoot,subjectEntityName,subjectEntityTel,subjectEntityEmail,otherEntityInternalId,confidentialityCode,actInternalId, code_code as operationCode, code_displayName AS operationName,operationQualifierCode,operationQualifierName,approvalNumber,approvalNumberSystem,approvalStateCode,approvalStateCodeSystem,approvalEffectiveTimeLow,approvalEffectiveTimeHigh,approvalStatusCode,licenseCode,agencyId,agencyName,productItemCode,productInternalId FROM reports.DocumentInformationSubject LEFT OUTER JOIN (SELECT documentInternalId, documentId, actInternalId, subjectEntityCode as productItemCode, subjectEntityInternalId as productInternalId FROM reports.DocumentInformationSubject WHERE participationTypeCode = 'PRD') prd USING(documentInternalId, documentId, actInternalId) LEFT OUTER JOIN ( SELECT documentInternalId, q.code_code AS operationQualifierCode, q.code_displayName AS operationQualifierName, r.targetInternalId AS actInternalId, actInternalId AS approvalInternalId, an.extension AS approvalNumber, an.root AS approvalNumberSystem, qs.subjectEntityCode AS approvalStateCode, qs.subjectEntityCodeSystem AS approvalStateCodeSystem, qs.effectivetime_low AS approvalEffectiveTimeLow, qs.effectivetime_high AS approvalEffectiveTimeHigh, qs.statusCode AS approvalStatusCode, qs.code_code AS licenseCode, agencyId.extension AS agencyId, agencyName.trivialName AS agencyName FROM reports.DocumentInformation q LEFT OUTER JOIN (SELECT * FROM reports.DocumentInformationSubject WHERE participationTypeCode = 'AUT') qs USING(documentInternalId, actInternalId) INNER JOIN integrator.ActRelationship r ON( r.sourceInternalId = actInternalId AND r.typeCode = 'SUBJ') LEFT OUTER JOIN integrator.Act_id an USING(actInternalId) LEFT OUTER JOIN integrator.Entity_id agencyId ON(agencyId.entityInternalId = otherEntityInternalId) LEFT OUTER JOIN reports.BestName agencyName ON(agencyName.entityInternalId = otherEntityInternalId) WHERE q.classCode = 'CNTRCT' AND q.moodCode = 'EVN' AND q.code_codeSystem = '2.16.840.1.113883.3.26.1.1' ) q USING(documentInternalId, actInternalId)WHERE classCode = 'ACT' AND moodCode = 'DEF' AND code_codeSystem = '2.16.840.1.113883.3.26.1.1' AND participationTypeCode IN ('PPRF','PRF');
You see that the expressions are all just equal operations, some IN, nothing outlandish.
Now I will try what Tom Lane suggested. Here you go. And I have it stopped at this state, so if you want me to inspect anything else, I can do it.
With screen(1) I can be sure I won't lose my stuff when my internet goes down. Nice.
I have one screen session with 3 windows:
- psql
- gdb
- misc (vmstat, etc.)
Now I have let this run for a good long time while setting up my screen stuff. And then:
ps -x
look for the postgres job with the EXPLAIN ... that's $PID, then:
gdb -p $PID
Then first I do
cont
but then it stops at SIGUSR1, because of the parallel workers signalling each other.
handle SIGUSR1 nostop
suppresses that stopping. Then I break CTRL-C, and set the breakpoint where Tom Lane said:
b AllocSetAlloc
once it stops there I do
Breakpoint 1, AllocSetAlloc (context=0x1168230, size=8) at aset.c:715 715 { (gdb) p context->name $4 = 0x96ce5b "ExecutorState"
So I should even be able to set a conditional breakpoint.
(gdb) delete Delete all breakpoints? (y or n) y (gdb) b AllocSetAlloc if strcmp(context->name, "ExecutorState") == 0 Breakpoint 2 at 0x848ed0: file aset.c, line 715. (gdb) cont (gdb) cont Continuing. Breakpoint 2, AllocSetAlloc (context=0x1168230, size=10) at aset.c:715 715 { (gdb) cont Continuing. Program received signal SIGUSR1, User defined signal 1. Breakpoint 2, AllocSetAlloc (context=0x1168230, size=152) at aset.c:715 715 { (gdb) cont Continuing. Program received signal SIGUSR1, User defined signal 1. Breakpoint 2, AllocSetAlloc (context=0x1168230, size=201) at aset.c:715 715 { (gdb) cont Continuing. Breakpoint 2, AllocSetAlloc (context=0x1168230, size=8272) at aset.c:715 715 { (gdb) p context->name $8 = 0x96ce5b "ExecutorState"
Nice. Now the question is, am I at the place where memory gets squeezed? And I think yes. With top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 31752 postgres 20 0 2772964 1.2g 329640 t 0.0 16.5 8:46.59 postgres: postgres integrator [local] EXPLAIN
I guess I should run this for a little longer. So I disable my breakpoints
(gdb) info breakpoints Num Type Disp Enb Address What 2 breakpoint keep y 0x0000000000848ed0 in AllocSetAlloc at aset.c:715 stop only if strcmp(context->name, "ExecutorState") == 0 breakpoint already hit 6 times (gdb) disable 2 (gdb) cont Continuing.
while watching top:
31752 postgres 20 0 2777060 1.3g 329920 D 33.2 17.9 8:52.07 postgres: postgres integrator [local] EXPLAIN 31752 postgres 20 0 2777060 1.4g 329920 D 33.2 17.9 8:52.07 postgres: postgres integrator [local] EXPLAIN 31752 postgres 20 0 2777060 1.5g 329920 D 33.2 17.9 8:52.07 postgres: postgres integrator [local] EXPLAIN
it went up pretty quick from 1.2 GB to 1.5 GB, but then it stopped growing fast, so now back to gdb and break:
^C Program received signal SIGINT, Interrupt. 0x00007f048f336d71 in read () from /lib64/libpthread.so.0 (gdb) enable 2 (gdb) cont Continuing. Breakpoint 2, AllocSetAlloc (context=0x1168230, size=385) at aset.c:715 715 {
Now I give you a bt so we have something to look at:
#0 AllocSetAlloc (context=0x1168230, size=385) at aset.c:715 #1 0x000000000084e6cd in palloc (size=385) at mcxt.c:938 #2 0x000000000061019c in ExecHashJoinGetSavedTuple (file=file@entry=0x8bbc528, hashvalue=hashvalue@entry=0x7fff2e4ca76c, tupleSlot=0x10856b8, hjstate=0x11688e0) at nodeHashjoin.c:1277 #3 0x0000000000610c83 in ExecHashJoinNewBatch (hjstate=0x11688e0) at nodeHashjoin.c:1042 #4 ExecHashJoinImpl (parallel=false, pstate=0x11688e0) at nodeHashjoin.c:539 #5 ExecHashJoin (pstate=0x11688e0) at nodeHashjoin.c:565 #6 0x00000000005fde68 in ExecProcNodeInstr (node=0x11688e0) at execProcnode.c:461 #7 0x000000000061ce4e in ExecProcNode (node=0x11688e0) at ../../../src/include/executor/executor.h:247 #8 ExecSort (pstate=0x11687d0) at nodeSort.c:107 #9 0x00000000005fde68 in ExecProcNodeInstr (node=0x11687d0) at execProcnode.c:461 #10 0x000000000061d2c4 in ExecProcNode (node=0x11687d0) at ../../../src/include/executor/executor.h:247 #11 ExecUnique (pstate=0x11685e0) at nodeUnique.c:73 #12 0x00000000005fde68 in ExecProcNodeInstr (node=0x11685e0) at execProcnode.c:461 #13 0x00000000005f75ba in ExecProcNode (node=0x11685e0) at ../../../src/include/executor/executor.h:247 #14 ExecutePlan (execute_once=<optimized out>, dest=0xcc60e0 <donothingDR>, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x11685e0, estate=0x1168340) at execMain.c:1723 #15 standard_ExecutorRun (queryDesc=0x119b6d8, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364 #16 0x000000000059c6f8 in ExplainOnePlan (plannedstmt=plannedstmt@entry=0x1199a68, into=into@entry=0x0, es=es@entry=0x1141d48, queryString=<optimized out>, params=0x0, queryEnv=queryEnv@entry=0x0, planduration=0x7fff2e4ca990) at explain.c:535 #17 0x000000000059c9ef in ExplainOneQuery (query=<optimized out>, cursorOptions=<optimized out>, into=0x0, es=0x1141d48, queryString=0xf4af30 "explain analyze\nSELECT DISTINCT\n documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"..., params=0x0, queryEnv=0x0) at explain.c:371 #18 0x000000000059ce37 in ExplainQuery (pstate=pstate@entry=0xf74608, stmt=stmt@entry=0x11ef240, queryString=queryString@entry=0xf4af30 "explain analyze\nSELECT DISTINCT\n documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"..., params=params@entry=0x0, queryEnv=queryEnv@entry=0x0, dest=dest@entry=0xf74578) at explain.c:254 #19 0x000000000072ca5d in standard_ProcessUtility (pstmt=0x11ef390, queryString=0xf4af30 "explain analyze\nSELECT DISTINCT\n documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"..., context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0xf74578, completionTag=0x7fff2e4cab20 "") at utility.c:675 #20 0x000000000072a052 in PortalRunUtility (portal=0xfb06b0, pstmt=0x11ef390, isTopLevel=<optimized out>, setHoldSnapshot=<optimized out>, dest=<optimized out>, completionTag=0x7fff2e4cab20 "") at pquery.c:1178 #21 0x000000000072add2 in FillPortalStore (portal=portal@entry=0xfb06b0, isTopLevel=isTopLevel@entry=true) at pquery.c:1038 #22 0x000000000072b855 in PortalRun (portal=portal@entry=0xfb06b0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0xf4c570, altdest=altdest@entry=0xf4c570, completionTag=0x7fff2e4cad30 "") at pquery.c:768 #23 0x00000000007276e8 in exec_simple_query ( query_string=0xf4af30 "explain analyze\nSELECT DISTINCT\n documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"...) at postgres.c:1145 #24 0x0000000000729534 in PostgresMain (argc=<optimized out>, argv=argv@entry=0xf76ce8, dbname=<optimized out>, username=<optimized out>) at postgres.c:4182 #25 0x00000000006be215 in BackendRun (port=0xf6efe0) at postmaster.c:4361 #26 BackendStartup (port=0xf6efe0) at postmaster.c:4033 #27 ServerLoop () at postmaster.c:1706 #28 0x00000000006bf122 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0xf45320) at postmaster.c:1379 #29 0x00000000004822dc in main (argc=3, argv=0xf45320) at main.c:228
But who knows if that's it. I continue and watch top again...
31752 postgres 20 0 3112352 1.8g 329920 D 32.2 23.7 9:43.75 postgres: postgres integrator [local] EXPLAIN
it went quickly to 1.6, then after some time to 1.7, then 1.8, and I stop again:
^C Program received signal SIGINT, Interrupt. 0x00007f048f336d71 in read () from /lib64/libpthread.so.0 (gdb) enable 2 (gdb) cont Continuing. Breakpoint 2, AllocSetAlloc (context=0x1168230, size=375) at aset.c:715 715 { bt #0 AllocSetAlloc (context=0x1168230, size=375) at aset.c:715 #1 0x000000000084e6cd in palloc (size=375) at mcxt.c:938 #2 0x000000000061019c in ExecHashJoinGetSavedTuple (file=file@entry=0x21df688, hashvalue=hashvalue@entry=0x7fff2e4ca76c, tupleSlot=0x10856b8, hjstate=0x11688e0) at nodeHashjoin.c:1277 #3 0x0000000000610c83 in ExecHashJoinNewBatch (hjstate=0x11688e0) at nodeHashjoin.c:1042 #4 ExecHashJoinImpl (parallel=false, pstate=0x11688e0) at nodeHashjoin.c:539 #5 ExecHashJoin (pstate=0x11688e0) at nodeHashjoin.c:565 #6 0x00000000005fde68 in ExecProcNodeInstr (node=0x11688e0) at execProcnode.c:461 #7 0x000000000061ce4e in ExecProcNode (node=0x11688e0) at ../../../src/include/executor/executor.h:247 #8 ExecSort (pstate=0x11687d0) at nodeSort.c:107 #9 0x00000000005fde68 in ExecProcNodeInstr (node=0x11687d0) at execProcnode.c:461 #10 0x000000000061d2c4 in ExecProcNode (node=0x11687d0) at ../../../src/include/executor/executor.h:247 #11 ExecUnique (pstate=0x11685e0) at nodeUnique.c:73 #12 0x00000000005fde68 in ExecProcNodeInstr (node=0x11685e0) at execProcnode.c:461 #13 0x00000000005f75ba in ExecProcNode (node=0x11685e0) at ../../../src/include/executor/executor.h:247 #14 ExecutePlan (execute_once=<optimized out>, dest=0xcc60e0 <donothingDR>, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x11685e0, estate=0x1168340) at execMain.c:1723 #15 standard_ExecutorRun (queryDesc=0x119b6d8, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364 #16 0x000000000059c6f8 in ExplainOnePlan (plannedstmt=plannedstmt@entry=0x1199a68, into=into@entry=0x0, es=es@entry=0x1141d48, queryString=<optimized out>, params=0x0, queryEnv=queryEnv@entry=0x0, planduration=0x7fff2e4ca990) at explain.c:535 #17 0x000000000059c9ef in ExplainOneQuery (query=<optimized out>, cursorOptions=<optimized out>, into=0x0, es=0x1141d48, queryString=0xf4af30 "explain analyze\nSELECT DISTINCT\n documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"..., params=0x0, queryEnv=0x0) at explain.c:371 #18 0x000000000059ce37 in ExplainQuery (pstate=pstate@entry=0xf74608, stmt=stmt@entry=0x11ef240, queryString=queryString@entry=0xf4af30 "explain analyze\nSELECT DISTINCT\n documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"..., params=params@entry=0x0, queryEnv=queryEnv@entry=0x0, dest=dest@entry=0xf74578) at explain.c:254 #19 0x000000000072ca5d in standard_ProcessUtility (pstmt=0x11ef390, queryString=0xf4af30 "explain analyze\nSELECT DISTINCT\n documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"..., context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0xf74578, completionTag=0x7fff2e4cab20 "") at utility.c:675 #20 0x000000000072a052 in PortalRunUtility (portal=0xfb06b0, pstmt=0x11ef390, isTopLevel=<optimized out>, setHoldSnapshot=<optimized out>, dest=<optimized out>, completionTag=0x7fff2e4cab20 "") at pquery.c:1178 #21 0x000000000072add2 in FillPortalStore (portal=portal@entry=0xfb06b0, isTopLevel=isTopLevel@entry=true) at pquery.c:1038 #22 0x000000000072b855 in PortalRun (portal=portal@entry=0xfb06b0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0xf4c570, altdest=altdest@entry=0xf4c570, completionTag=0x7fff2e4cad30 "") at pquery.c:768 #23 0x00000000007276e8 in exec_simple_query ( query_string=0xf4af30 "explain analyze\nSELECT DISTINCT\n documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"...) at postgres.c:1145 #24 0x0000000000729534 in PostgresMain (argc=<optimized out>, argv=argv@entry=0xf76ce8, dbname=<optimized out>, username=<optimized out>) at postgres.c:4182 #25 0x00000000006be215 in BackendRun (port=0xf6efe0) at postmaster.c:4361 #26 BackendStartup (port=0xf6efe0) at postmaster.c:4033 #27 ServerLoop () at postmaster.c:1706 #28 0x00000000006bf122 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0xf45320) at postmaster.c:1379 #29 0x00000000004822dc in main (argc=3, argv=0xf45320) at main.c:228
Good, now I leave this all sitting like that for you to ask me what else you might want to see.
We are now close to the edge of the cliff.
-Gunther
Jeff Janes <jeff.janes@gmail.com> writes:
> To get it to happen faster, maybe you could run the server with a small
> setting of "ulimit -v"? Or, you could try to capture it live in gdb.
> Unfortunately I don't know how to set a breakpoint for allocations into a
> specific context, and setting a breakpoint for any memory allocation is
> probably going to fire too often to be useful.
If you can use gdb at all, it's not that hard to break on allocations
into a specific context; I've done it many times. The strategy is
basically
1. Let query run long enough for memory usage to start increasing,
then attach to backend with gdb.
2. Set breakpoint at, probably, AllocSetAlloc. (In some cases,
reallocs could be the problem, but I doubt it here.) Then "c".
3. When it stops, "p *context" and see if this is the context
you're looking for. In this case, since we want to know about
allocations into ExecutorState and we know there's only one
active one, you just have to look at the context name. In general
you might have to look at the backtrace. Anyway, if it isn't the
one you want, just "c" until you get to an allocation into the
one you do want.
4. Once you have found out the address of the context you care
about, make the breakpoint conditional on the context argument
being that one. It might look like this:
Breakpoint 1, AllocSetAlloc (context=0x1483be0, size=480) at aset.c:715
715 {
(gdb) p *context
$1 = {type = T_AllocSetContext, isReset = false, allowInCritSection = false,
methods = 0xa33f40, parent = 0x0, firstchild = 0x1537f30, prevchild = 0x0,
nextchild = 0x0, name = 0xa3483f "TopMemoryContext", ident = 0x0,
reset_cbs = 0x0}
(gdb) cond 1 context == 0x1483be0
5. Now repeatedly "c", and check the stack trace each time, for a
dozen or two times to get a feeling for where the allocations are
being requested.
In some cases you might be able to find the context address in a
more efficient way than what I suggest in #3 --- for instance,
you could instead set a breakpoint where the context is created
and snag its address immediately, or you could dig around in
backend data structures to find it. But these ways generally
require more familiarity with the code than just watching the
requests go by.
Thanks for the recipe. I can use gdb at all, just not very skillfully :)
Gunther <raj@gusw.net> writes: > Now I give you a bt so we have something to look at: > #0 AllocSetAlloc (context=0x1168230, size=385) at aset.c:715 > #1 0x000000000084e6cd in palloc (size=385) at mcxt.c:938 > #2 0x000000000061019c in ExecHashJoinGetSavedTuple (file=file@entry=0x8bbc528, hashvalue=hashvalue@entry=0x7fff2e4ca76c, > tupleSlot=0x10856b8, hjstate=0x11688e0) at nodeHashjoin.c:1277 I'm pretty sure that's not the droid we're looking for. ExecHashJoinGetSavedTuple does palloc a new tuple, but it immediately sticks it into a TupleTableSlot that will be responsible for freeing it (when the next tuple is stuck into the same slot). I'd suggest continuing a few times and looking for other code paths leading to AllocSetAlloc in this context. My first thought on noticing the SELECT DISTINCT was that you might be hitting the grouping-function-related leak that Andres fixed in 9cf37a527; but that fix did make it into 11.2 (by just a couple of days...). Still, maybe there's another issue in the same area. regards, tom lane
On 2019-Apr-15, Gunther wrote: > #0 AllocSetAlloc (context=0x1168230, size=385) at aset.c:715 > #1 0x000000000084e6cd in palloc (size=385) at mcxt.c:938 > #2 0x000000000061019c in ExecHashJoinGetSavedTuple (file=file@entry=0x8bbc528, hashvalue=hashvalue@entry=0x7fff2e4ca76c, > tupleSlot=0x10856b8, hjstate=0x11688e0) at nodeHashjoin.c:1277 > #3 0x0000000000610c83 in ExecHashJoinNewBatch (hjstate=0x11688e0) at nodeHashjoin.c:1042 Seems that ExecHashJoinGetSavedTuple stores a minimalTuple and sets the shouldFree flag to "true", and then in ExecHashJoinNewBatch, callee ExecFetchSlotMinimalTuple sets shouldFree to false inconditionally when the slot uses minimal tuple ops. Maybe that's correct, but it does sound like a memory leak is not entirely impossible. I wonder if this fixes it, without causing crashes elsewhere. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Seems that ExecHashJoinGetSavedTuple stores a minimalTuple and sets the > shouldFree flag to "true", and then in ExecHashJoinNewBatch, callee > ExecFetchSlotMinimalTuple sets shouldFree to false inconditionally when > the slot uses minimal tuple ops. Maybe that's correct, but it does > sound like a memory leak is not entirely impossible. I wonder if this > fixes it, without causing crashes elsewhere. This discussion is about v11, not HEAD. Still, I agree that that coding in HEAD seems a bit fishy. regards, tom lane
Breakpoint 2, AllocSetAlloc (context=0x1168230, size=8272) at aset.c:715 715 { (gdb) p context->name $8 = 0x96ce5b "ExecutorState"
I guess I should run this for a little longer. So I disable my breakpoints
it went up pretty quick from 1.2 GB to 1.5 GB, but then it stopped growing fast, so now back to gdb and break:
Wow, we are getting somewhere.
Tom (BTW, your mail server rejects my direct mail, but I'm glad you got it through the list), you say:
I'm pretty sure that's not the droid we're looking for. ExecHashJoinGetSavedTuple does palloc a new tuple, but it immediately sticks it into a TupleTableSlot that will be responsible for freeing it (when the next tuple is stuck into the same slot). I'd suggest continuing a few times and looking for other code paths leading to AllocSetAlloc in this context.
I did continue a "few times", but few as in a dozen, it's always the same
(gdb) bt 6 #0 AllocSetAlloc (context=0x1168230, size=375) at aset.c:715 #1 0x000000000084e6cd in palloc (size=375) at mcxt.c:938 #2 0x000000000061019c in ExecHashJoinGetSavedTuple (file=file@entry=0x21df688, hashvalue=hashvalue@entry=0x7fff2e4ca76c, tupleSlot=0x10856b8, hjstate=0x11688e0) at nodeHashjoin.c:1277 #3 0x0000000000610c83 in ExecHashJoinNewBatch (hjstate=0x11688e0) at nodeHashjoin.c:1042 #4 ExecHashJoinImpl (parallel=false, pstate=0x11688e0) at nodeHashjoin.c:539 #5 ExecHashJoin (pstate=0x11688e0) at nodeHashjoin.c:565 (More stack frames follow...)
So I decided to just let it go until it exits the ExecHashJoin function:
#6 0x00000000005fde68 in ExecProcNodeInstr (node=0x11688e0) at execProcnode.c:461 461 result = node->ExecProcNodeReal(node); (gdb) list 456 { 457 TupleTableSlot *result; 458 459 InstrStartNode(node->instrument); 460 461 result = node->ExecProcNodeReal(node); 462 463 InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0); 464 465 return result; (gdb) break 463 Breakpoint 3 at 0x5fde68: file execProcnode.c, line 463. (gdb) disable 2 (gdb) cont Continuing. Breakpoint 3, ExecProcNodeInstr (node=0x11688e0) at execProcnode.c:463 463 InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
oops, that was fast, so up further ...
(gdb) cont Continuing. Breakpoint 4, ExecSort (pstate=0x11687d0) at nodeSort.c:109 109 if (TupIsNull(slot)) (gdb) cont Continuing. Breakpoint 3, ExecProcNodeInstr (node=0x11688e0) at execProcnode.c:463 463 InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0); (gdb) cont Continuing. Breakpoint 4, ExecSort (pstate=0x11687d0) at nodeSort.c:109 109 if (TupIsNull(slot)) (gdb) up #1 0x00000000005fde68 in ExecProcNodeInstr (node=0x11687d0) at execProcnode.c:461 461 result = node->ExecProcNodeReal(node); (gdb) up #2 0x000000000061d2c4 in ExecProcNode (node=0x11687d0) at ../../../src/include/executor/executor.h:247 247 return node->ExecProcNode(node); (gdb) up #3 ExecUnique (pstate=0x11685e0) at nodeUnique.c:73 73 slot = ExecProcNode(outerPlan); (gdb) list 68 for (;;) 69 { 70 /* 71 * fetch a tuple from the outer subplan 72 */ 73 slot = ExecProcNode(outerPlan); 74 if (TupIsNull(slot)) 75 { 76 /* end of subplan, so we're done */ 77 ExecClearTuple(resultTupleSlot);
... but whatever I do, ultimately I get to that allocation routine through the same path.
Since that is the bulk of the activity, and memory was still growing while we come through this path, I assume that this is it.
My first thought on noticing the SELECT DISTINCT was that you might be hitting the grouping-function-related leak that Andres fixed in 9cf37a527; but that fix did make it into 11.2 (by just a couple of days...). Still, maybe there's another issue in the same area.
I don't know about that one, I only know that I am running 11.2 freshly compiled.
The change suggested by Alvaro Herrera wasn't applicable.
Jeff Janes had more
Not sure how you could tell that? It's the same place as everything else. If we can find out what you're looking for, may be we can set a break point earlier up the call chain?Breakpoint 2, AllocSetAlloc (context=0x1168230, size=8272) at aset.c:715 715 { (gdb) p context->name $8 = 0x96ce5b "ExecutorState"I think that the above one might have been the one you wanted.
But why? If I see the memory still go up insanely fast, isn't that a sign for the leak?I guess I should run this for a little longer. So I disable my breakpoints
it went up pretty quick from 1.2 GB to 1.5 GB, but then it stopped growing fast, so now back to gdb and break:
Unfortunately, I think this means you missed your opportunity and are now getting backtraces of the innocent bystanders.
How about it's in the DISTINCT? I noticed while peeking up the call chain, that it was already in the UNIQUE sort thing also. I guess it's streaming the results from the hash join right into the unique sort step.Particularly since you report that the version using nested loops rather than hash joins also leaked, so it is probably not the hash-join specific code that is doing it.
I have just done that and it creates an insane amount of output from all the processes, I'm afraid there will be no way to keep that stuff separated. If there was a way of turning that one and off for one process only, then we could probably get more info...What I've done before is compile with the comments removed fromsrc/backend/utils/mmgr/aset.c:/* #define HAVE_ALLOCINFO */
Everything is also extremely slow that way. Like in a half hour the memory didn't even reach 100 MB.
I guess I should look for both, address and size to match it better.and then look for allocations sizes which are getting allocated but not freed, and then you can go back to gdb to look for allocations of those specific sizes.
Yes, massive, like I said. Impossible to use. File system fills up rapidly. I made it so that it can be turned on and off, with the debugger.This generates a massive amount of output, and it bypasses the logging configuration and goes directly to stderr--so it might not end up where you expect.
int _alloc_info = 0; #ifdef HAVE_ALLOCINFO #define AllocFreeInfo(_cxt, _chunk) \ if(_alloc_info) \ fprintf(stderr, "AllocFree: %s: %p, %zu\n", \ (_cxt)->header.name, (_chunk), (_chunk)->size) #define AllocAllocInfo(_cxt, _chunk) \ if(_alloc_info) \ fprintf(stderr, "AllocAlloc: %s: %p, %zu\n", \ (_cxt)->header.name, (_chunk), (_chunk)->size) #else #define AllocFreeInfo(_cxt, _chunk) #define AllocAllocInfo(_cxt, _chunk) #endif
so with this I do
(gdb) b AllocSetAlloc (gdb) cont (gdb) set _alloc_info=1 (gdb) disable (gdb) cont
then I wait, ... until it crashes again ... no, it's too much. It fills up my filesystem in no time with the logs. It produced 3 GB in just a minute of run time.
And also, I doubt we can find anything specifically by allocation size. It's just going to be 512 or whatever.
Isn't there some other way?
I'm going to try without that DISTINCT step, or perhaps by dismantling this query until it works without this excessive memory growth.
-Gunther
Gunther <raj@gusw.net> writes: > Tom (BTW, your mail server rejects my direct mail, [ raised eyebrow ] It's coming through fine AFAICS. >> I'm pretty sure that's not the droid we're looking for. >> ExecHashJoinGetSavedTuple does palloc a new tuple, but it immediately >> sticks it into a TupleTableSlot that will be responsible for freeing >> it (when the next tuple is stuck into the same slot). > I did continue a "few times", but few as in a dozen, it's always the same Well, I still don't believe that ExecHashJoinGetSavedTuple is the issue. It has a mechanism for freeing the allocation at the right time, and if that were broken then all hash joins would be leaking. It's easy to prove that that's not so, both by experiment and by the lack of other reports. It's barely conceivable that in your particular query, there's something acting to break that which doesn't manifest typically; but I think it's much more likely that you simply haven't found the culprit allocation. It's quite feasible that many many ExecHashJoinGetSavedTuple calls would go by in between problem allocations. Another line of thought is that maybe the problem is with realloc'ing something larger and larger? You could try trapping AllocSetRealloc to see. (BTW, it looks like we *do* have a leak with simple hash joins in HEAD. But not v11.) regards, tom lane
On 2019-Apr-15, Tom Lane wrote: > It's barely conceivable that in your particular query, there's something > acting to break that which doesn't manifest typically; but I think it's > much more likely that you simply haven't found the culprit allocation. > It's quite feasible that many many ExecHashJoinGetSavedTuple calls would > go by in between problem allocations. A possibly useful thing to do is use "commands" in gdb to print out a stack trace for each allocation that touches the problem memory context and collect them into some output, then classify the allocations based on the stack trace on each. No need to do it manually. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I'm going to try without that DISTINCT step, or perhaps by dismantling this query until it works without this excessive memory growth.
It also failed. Out of memory. The resident memory size of the backend was 1.5 GB before it crashed.
TopMemoryContext: 4335600 total in 8 blocks; 41208 free (16 chunks); 4294392 used HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used Operator lookup cache: 24576 total in 2 blocks; 10760 free (3 chunks); 13816 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 blocks; 5416 free (2 chunks); 2776 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 524288 total in 7 blocks; 186848 free (7 chunks); 337440 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used PortalContext: 1105920 total in 138 blocks; 10368 free (8 chunks); 1095552 used: ExecutorState: 2238648944 total in 266772 blocks; 3726944 free (16276 chunks); 2234922000 used HashTableContext: 16384 total in 2 blocks; 4032 free (5 chunks); 12352 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 40120 used TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 40120 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8454256 total in 6 blocks; 64848 free (32 chunks); 8389408 used HashBatchContext: 66935744 total in 2037 blocks; 7936 free (0 chunks); 66927808 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used CacheMemoryContext: 1101328 total in 14 blocks; 383480 free (0 chunks); 717848 used index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: entity_id_idx index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: act_id_fkidx index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: act_id_idx index info: 2048 total in 2 blocks; 592 free (1 chunks); 1456 used: pg_constraint_conrelid_contypid_conname_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: actrelationship_pkey index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: actrelationship_target_idx index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: actrelationship_source_idx index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: documentinformation_pk index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_statistic_ext_relid_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: docinfsubj_ndx_seii index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: docinfsubj_ndx_sbjentcodeonly index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_index_indrelid_index index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: pg_db_role_setting_databaseid_rol_index index info: 2048 total in 2 blocks; 624 free (2 chunks); 1424 used: pg_opclass_am_name_nsp_index index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_foreign_data_wrapper_name_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_enum_oid_index index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_class_relname_nsp_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_server_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_pubname_index index info: 2048 total in 2 blocks; 592 free (3 chunks); 1456 used: pg_statistic_relid_att_inh_index index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_cast_source_target_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_language_name_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_transform_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_collation_oid_index index info: 3072 total in 2 blocks; 1136 free (2 chunks); 1936 used: pg_amop_fam_strat_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_index_indexrelid_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_template_tmplname_index index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_ts_config_map_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_opclass_oid_index index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_foreign_data_wrapper_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_event_trigger_evtname_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_statistic_ext_name_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_dict_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_event_trigger_oid_index index info: 3072 total in 2 blocks; 1216 free (3 chunks); 1856 used: pg_conversion_default_index index info: 3072 total in 2 blocks; 1136 free (2 chunks); 1936 used: pg_operator_oprname_l_r_n_index index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_trigger_tgrelid_tgname_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_enum_typid_label_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_config_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_user_mapping_oid_index index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_opfamily_am_name_nsp_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_table_relid_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_type_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_aggregate_fnoid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_constraint_oid_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_rewrite_rel_rulename_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_parser_prsname_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_config_cfgname_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_parser_oid_index index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_publication_rel_prrelid_prpubid_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_operator_oid_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_namespace_nspname_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_template_oid_index index info: 2048 total in 2 blocks; 624 free (2 chunks); 1424 used: pg_amop_opr_fam_index index info: 2048 total in 2 blocks; 672 free (3 chunks); 1376 used: pg_default_acl_role_nsp_obj_index index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_collation_name_enc_nsp_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_rel_oid_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_range_rngtypid_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_dict_dictname_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_type_typname_nsp_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_opfamily_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_statistic_ext_oid_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_proc_proname_args_nsp_index index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_partitioned_table_partrelid_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_transform_type_lang_index index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_attribute_relid_attnum_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_proc_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_language_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_namespace_oid_index index info: 3072 total in 2 blocks; 1136 free (2 chunks); 1936 used: pg_amproc_fam_proc_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_server_name_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_attribute_relid_attnam_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_conversion_oid_index index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_user_mapping_user_server_index index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_subscription_rel_srrelid_srsubid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_sequence_seqrelid_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_conversion_name_nsp_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_authid_oid_index index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_auth_members_member_role_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_subscription_oid_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_tablespace_oid_index index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_shseclabel_object_index index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_replication_origin_roname_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_database_datname_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_subscription_subname_index index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_replication_origin_roiident_index index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_auth_members_role_member_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_database_oid_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_authid_rolname_index WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 7208 free (1 chunks); 984 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (5 chunks); 256 used Grand total: 2322733048 bytes in 269225 blocks; 5406896 free (16556 chunks); 2317326152 used
So what I am wondering now, is there seems to be an EXPLOSION of memory consumption near the time of the crash. That ExecutorState has 2,238,648,944 but just until the very last second(s) the RES memory as per top was 1.5 GB I swear. I looked at it. It went like this:
1.5 GB for a very long time
1.1 GB -- and I thought, yeah! it worked! it's shrinking now
and then it was gone, and there was the memory error.
So how can top tell me 1.5 GB while here the ExecutorState allocations alone have 2 GB???
And isn't even 1.5 GB way too big?
Is there a way of dumping that memory map info during normal runtime, by calling a function with the debugger? So one can see how it grows? It's like checking out memory leaks with Java where I keep looking at the heap_info summary. Tom Lane said that this ExecutorState should not grow to anything like this size, right?
-Gunther
Gunther <raj@gusw.net> writes: > Is there a way of dumping that memory map info during normal runtime, by > calling a function with the debugger? Sure, "call MemoryContextStats(TopMemoryContext)" (or actually, since you know which context is the problematic one, just print that one context) regards, tom lane
Gunther <raj@gusw.net> writes: > So what I am wondering now, is there seems to be an EXPLOSION of memory > consumption near the time of the crash. That ExecutorState has > 2,238,648,944 but just until the very last second(s) the RES memory as > per top was 1.5 GB I swear. That's not hugely surprising really, especially in a complex query. It could be performing some preliminary join that doesn't leak, and then when it starts to perform the join that does have the leak, kaboom. Also, given that you seem to be invoking multi-batch joins, maybe the preliminary phase is fine and there's only a leak when reading back a batch. Anyway, the upshot is that you need to investigate what's happening while the memory consumption is increasing. The behavior before that starts to happen isn't going to be very interesting. It might be a bit tricky to catch that if it only takes a few seconds to blow up, but you could try "c 10000" or so to step through a lot of AllocSetAlloc calls, repeating till the bad stuff starts to happen, and then going back to looking at just where the calls are coming from. regards, tom lane
I saw your replies, if there was a way of using gdb commands to have a conditional breakpoint which will only fire if the n-th caller in the chain is not a certain source location, then one could exclude the bulk of these allocations and focus better.
But I decided I try to re-factor this query. And I made an interesting observation.
There is a left outer join in parenthesis
... LEFT OUTER JOIN (SELECT ....) q ...
the biggest parenthesis. I turned this into a temporary table, tmp_bulk. Then I change the main query to
... LEFT OUTER JOIN tmp_bulk q ...
now I am running it again. But what I noticed is that the tmp_bulk table is tiny! It only has like 250 rows. So this means the vast majority of the right left rows in that join are unmatched. The plan is all different now. Heavy CPU% load. Must be merge sorting? No memory growth, not yet.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND5394 postgres 20 0 1284448 287880 271528 R 99.3 3.6 9:21.83 postgres: postgres integrator [local] EXPLAIN5425 postgres 20 0 1278556 93184 82296 S 27.6 1.2 0:38.72 postgres: parallel worker for PID 5394
No, I never trust when a database job has high CPU% and low IO for a long time. So I do
SET ENABLE_MERGEJOIN TO OFF;
and then do it again. Now I have high IO and low CPU%.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5394 postgres 20 0 1280904 282036 273616 D 2.3 3.6 13:01.51 postgres: postgres integrator [local] EXPLAIN 5510 postgres 20 0 1278892 87524 80804 D 2.3 1.1 0:05.20 postgres: parallel worker for PID 5394 5509 postgres 20 0 1278860 87484 80796 D 2.3 1.1 0:05.30 postgres: parallel worker for PID 5394
Still I slip into the high CPU% situation, I guess I'll have to wait it out ...
... and still waiting. No growth whatsoever. The plan is now so totally different that it probably won't trigger the problem.
The original plan that causes the leak involved right joins. This one only left joins. Even after ANALYZE tmp_bulk it still comes up with the same plan. And that plan isn't quick to succeed but also doesn't trigger the memory leak.
So what I can tell is this: that growth to 1.5 GB is consistently happening. It isn't just happening in the beginning and then the rest is just a follow-up problem. Also there seems to be a final spike in growth from 1.5 GB to 2.2 GB that happens inside a second. That seems very strange.
Back to the debugger and do a better job of conditional breakpoints ... I already have an idea how I'll do that. I set a flag when I enter the
Isn't 1.5 GB already way too big? There are 3 phases really.Anyway, the upshot is that you need to investigate what's happening while the memory consumption is increasing. The behavior before that starts to happen isn't going to be very interesting. It might be a bit tricky to catch that if it only takes a few seconds to blow up, but you could try "c 10000" or so to step through a lot of AllocSetAlloc calls, repeating till the bad stuff starts to happen, and then going back to looking at just where the calls are coming from.
- steady state at less than 500 M
- slow massive growth to 1 G to 1.5 - 1.8 G
- explosion within 1 second from whatever the final size of slow massive growth to the final 2.2 G
I thought that slow massive growth is already a sign of a leak?
I will now filter the calls that come through ExecHashJoinGetSavedTuple
I figured I can do this:
(gdb) info frame Stack level 0, frame at 0x7ffcbf92fdd0: rip = 0x849030 in AllocSetAlloc (aset.c:718); saved rip = 0x84e7dd called by frame at 0x7ffcbf92fdf0 source language c. Arglist at 0x7ffcbf92fdc0, args: context=0x29a6450, size=371 Locals at 0x7ffcbf92fdc0, Previous frame's sp is 0x7ffcbf92fdd0 Saved registers: rip at 0x7ffcbf92fdc8
so is the saved $rip is 0x84e7dd then we are coming this way. Therefore I set my new breakpoint like this:
(gdb) b AllocSetAlloc if (int)strcmp(context->name, "ExecutorState") == 0 && *(int *)$rsp != 0x84e7dd Breakpoint 6 at 0x849030: file aset.c, line 718. (gdb) info b Num Type Disp Enb Address What 6 breakpoint keep y 0x0000000000849030 in AllocSetAlloc at aset.c:718 stop only if (int)strcmp(context->name, "ExecutorState") == 0 && *(int *)$rsp != 0x84e7dd
And there we go:
Breakpoint 6, AllocSetAlloc (context=0x29a6450, size=8) at aset.c:718 718 { (gdb) bt 8 #0 AllocSetAlloc (context=0x29a6450, size=8) at aset.c:718 #1 0x000000000084e8ad in palloc0 (size=size@entry=8) at mcxt.c:969 #2 0x0000000000702b63 in makeBufFileCommon (nfiles=nfiles@entry=1) at buffile.c:119 #3 0x0000000000702e4c in makeBufFile (firstfile=68225) at buffile.c:138 #4 BufFileCreateTemp (interXact=interXact@entry=false) at buffile.c:201 #5 0x000000000061060b in ExecHashJoinSaveTuple (tuple=0x2ba1018, hashvalue=<optimized out>, fileptr=0x6305b00) at nodeHashjoin.c:1220 #6 0x000000000060d766 in ExecHashTableInsert (hashtable=hashtable@entry=0x2b50ad8, slot=<optimized out>, hashvalue=<optimized out>) at nodeHash.c:1663 #7 0x0000000000610c8f in ExecHashJoinNewBatch (hjstate=0x29a6be0) at nodeHashjoin.c:1051 (More stack frames follow...)
and on
(gdb) info frame Stack level 0, frame at 0x7ffcbf92fd90:rip = 0x849030 in AllocSetAlloc (aset.c:718); saved rip = 0x84e8adcalled by frame at 0x7ffcbf92fdb0source language c.Arglist at 0x7ffcbf92fd80, args: context=0x29a6450, size=8Locals at 0x7ffcbf92fd80, Previous frame's sp is 0x7ffcbf92fd90Saved registers: rip at 0x7ffcbf92fd88 (gdb) b AllocSetAlloc if (int)strcmp(context->name, "ExecutorState") == 0 && *(int *)$rsp != 0x84e7dd && 0x84e8ad != *(int *)$rsp Note: breakpoint 6 also set at pc 0x849030. Breakpoint 7 at 0x849030: file aset.c, line 718. (gdb) delete 6
Now if I continue I don't seem to be stopping any more.
Does this help now?
-Gunther
It is confirmed, these two call paths are the only ones. At least probably the only ones to occur with enough of a frequency.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 28576 postgres 20 0 2695304 1.0g 200764 R 11.3 13.8 4:20.13 postgres: postgres integrator [local] EXPLAIN 28580 postgres 20 0 646616 432784 36968 S 98.7 5.5 8:53.28 gdb -p 28576
there is a problem with gdb, it also has a memoy leak and is very expensive with the checking of my conditional breakpoint. So I can't run it all the way through.
Also here captured with
(gdb) call MemoryContextStats(TopPortalContext)
TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used PortalContext: 1482752 total in 184 blocks; 11216 free (8 chunks); 1471536 used: ExecutorState: 1369337168 total in 163397 blocks; 248840 free (36 chunks); 1369088328 used HashTableContext: 32768 total in 3 blocks; 17304 free (10 chunks); 15464 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used TupleSort main: 32824 total in 2 blocks; 144 free (0 chunks); 32680 used Caller tuples: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8454256 total in 6 blocks; 64848 free (32 chunks); 8389408 used HashBatchContext: 106640 total in 3 blocks; 7936 free (0 chunks); 98704 used TupleSort main: 452880 total in 8 blocks; 126248 free (27 chunks); 326632 used Caller tuples: 4194304 total in 10 blocks; 1496136 free (20 chunks); 2698168 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ... ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Grand total: 1384601904 bytes in 163660 blocks; 2303840 free (145 chunks); 1382298064 used
(gdb) info break Num Type Disp Enb Address What 1 breakpoint keep y 0x0000000000849030 in AllocSetAlloc at aset.c:718 stop only if (int)strcmp(context->name, "ExecutorState") == 0 && *(int *)$rsp != 0x84e7dd && 0x84e8ad != *(int *)$rsp breakpoint already hit 4 times (gdb) delete 1 (gdb) break AllocSetAlloc if (int)strcmp(context->name, "ExecutorState") == 0 && *(int *)$rsp != 0x84e7dd Breakpoint 2 at 0x849030: file aset.c, line 718. (gdb) cont Continuing. ^CError in testing breakpoint condition: Quit Breakpoint 2, AllocSetAlloc (context=0x2a1d190, size=381) at aset.c:718 718 { (gdb) bt 4 #0 AllocSetAlloc (context=0x2a1d190, size=381) at aset.c:718 #1 0x000000000084e7dd in palloc (size=381) at mcxt.c:938 #2 0x00000000006101bc in ExecHashJoinGetSavedTuple (file=file@entry=0x4b4a198, hashvalue=hashvalue@entry=0x7ffcbf92fe5c, tupleSlot=0x2ae0ab8, hjstate=0x2a1d920) at nodeHashjoin.c:1277 #3 0x0000000000610ca3 in ExecHashJoinNewBatch (hjstate=0x2a1d920) at nodeHashjoin.c:1042 (More stack frames follow...) (gdb) call MemoryContextStats(TopPortalContext)
TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used PortalContext: 1482752 total in 184 blocks; 11216 free (8 chunks); 1471536 used: ExecutorState: 1369337168 total in 163397 blocks; 248840 free (36 chunks); 1369088328 used
exact same as before:
ExecutorState: 1369337168 total in 163397 blocks; 248840 free (36 chunks); 1369088328 used
but now we get an increase to:
ExecutorState: 1369345496 total in 163398 blocks; 248840 free (36 chunks); 1369096656 used
(gdb) cont Continuing. Breakpoint 2, AllocSetAlloc (context=0x2a1d190, size=8) at aset.c:718 718 { (gdb) bt 4 #0 AllocSetAlloc (context=0x2a1d190, size=8) at aset.c:718 #1 0x000000000084e8ad in palloc0 (size=size@entry=8) at mcxt.c:969 #2 0x0000000000702b63 in makeBufFileCommon (nfiles=nfiles@entry=1) at buffile.c:119 #3 0x0000000000702e4c in makeBufFile (firstfile=163423) at buffile.c:138 (More stack frames follow...) (gdb) call MemoryContextStats(TopPortalContext)
ExecutorState: 1369345496 total in 163398 blocks; 248816 free (36 chunks); 1369096680 used
ExecutorState: 1369345496 total in 163398 blocks; 248792 free (36 chunks); 1369096704 used ExecutorState: 1369345496 total in 163398 blocks; 248792 free (36 chunks); 1369096704 usedI don't see a growth between individual invocations. Anyway, these are the two ways to get there:
(gdb) bt 4 #0 AllocSetAlloc (context=0x2a1d190, size=4) at aset.c:718 #1 0x000000000084e7dd in palloc (size=size@entry=4) at mcxt.c:938 #2 0x0000000000702e59 in makeBufFile (firstfile=163423) at buffile.c:140 #3 BufFileCreateTemp (interXact=interXact@entry=false) at buffile.c:201 (More stack frames follow...) (gdb) cont Continuing. Breakpoint 3, AllocSetAlloc (context=0x2a1d190, size=394) at aset.c:718 718 { (gdb) bt 3 #0 AllocSetAlloc (context=0x2a1d190, size=394) at aset.c:718 #1 0x000000000084e7dd in palloc (size=394) at mcxt.c:938 #2 0x00000000006101bc in ExecHashJoinGetSavedTuple (file=file@entry=0x4b4a198, hashvalue=hashvalue@entry=0x7ffcbf92fe5c, tupleSlot=0x2ae0ab8, hjstate=0x2a1d920) at nodeHashjoin.c:1277 (More stack frames follow...)
ExecutorState: 1369353824 total in 163399 blocks; 248792 free (36 chunks); 1369105032 usedIt increases every 3 times I stop at the breakpoint.
Gunther <raj@gusw.net> writes: > And there we go: > Breakpoint 6, AllocSetAlloc (context=0x29a6450, size=8) at aset.c:718 > 718 { > (gdb) bt 8 > #0 AllocSetAlloc (context=0x29a6450, size=8) at aset.c:718 > #1 0x000000000084e8ad in palloc0 (size=size@entry=8) at mcxt.c:969 > #2 0x0000000000702b63 in makeBufFileCommon (nfiles=nfiles@entry=1) at buffile.c:119 > #3 0x0000000000702e4c in makeBufFile (firstfile=68225) at buffile.c:138 > #4 BufFileCreateTemp (interXact=interXact@entry=false) at buffile.c:201 > #5 0x000000000061060b in ExecHashJoinSaveTuple (tuple=0x2ba1018, hashvalue=<optimized out>, fileptr=0x6305b00) at nodeHashjoin.c:1220 > #6 0x000000000060d766 in ExecHashTableInsert (hashtable=hashtable@entry=0x2b50ad8, slot=<optimized out>, hashvalue=<optimizedout>) > at nodeHash.c:1663 > #7 0x0000000000610c8f in ExecHashJoinNewBatch (hjstate=0x29a6be0) at nodeHashjoin.c:1051 Hmm ... this matches up with a vague thought I had that for some reason the hash join might be spawning a huge number of separate batches. Each batch would have a couple of files with associated in-memory state including an 8K I/O buffer, so you could account for the "slow growth" behavior you're seeing by periodic decisions to increase the number of batches. You might try watching calls to ExecHashIncreaseNumBatches and see if that theory holds water. This could only happen with a very unfriendly distribution of the hash keys, I think. There's a heuristic in there to shut off growth of nbatch if we observe that we're making no progress at all, but perhaps this is a skewed distribution that's not quite skewed enough to trigger that. regards, tom lane
Jeff Janes had more
Not sure how you could tell that? It's the same place as everything else. If we can find out what you're looking for, may be we can set a break point earlier up the call chain?Breakpoint 2, AllocSetAlloc (context=0x1168230, size=8272) at aset.c:715 715 { (gdb) p context->name $8 = 0x96ce5b "ExecutorState"I think that the above one might have been the one you wanted.
But why? If I see the memory still go up insanely fast, isn't that a sign for the leak?I guess I should run this for a little longer. So I disable my breakpoints
it went up pretty quick from 1.2 GB to 1.5 GB, but then it stopped growing fast, so now back to gdb and break:
Unfortunately, I think this means you missed your opportunity and are now getting backtraces of the innocent bystanders.
How about it's in the DISTINCT? I noticed while peeking up the call chain, that it was already in the UNIQUE sort thing also. I guess it's streaming the results from the hash join right into the unique sort step.Particularly since you report that the version using nested loops rather than hash joins also leaked, so it is probably not the hash-join specific code that is doing it.
I have just done that and it creates an insane amount of output from all the processes, I'm afraid there will be no way to keep that stuff separated. If there was a way of turning that one and off for one process only, then we could probably get more info...What I've done before is compile with the comments removed fromsrc/backend/utils/mmgr/aset.c:/* #define HAVE_ALLOCINFO */
Everything is also extremely slow that way. Like in a half hour the memory didn't even reach 100 MB.
I guess I should look for both, address and size to match it better.and then look for allocations sizes which are getting allocated but not freed, and then you can go back to gdb to look for allocations of those specific sizes.
Yes, massive, like I said. Impossible to use. File system fills up rapidly. I made it so that it can be turned on and off, with the debugger.This generates a massive amount of output, and it bypasses the logging configuration and goes directly to stderr--so it might not end up where you expect.int _alloc_info = 0; #ifdef HAVE_ALLOCINFO #define AllocFreeInfo(_cxt, _chunk) \ if(_alloc_info) \ fprintf(stderr, "AllocFree: %s: %p, %zu\n", \ (_cxt)->header.name, (_chunk), (_chunk)->size) #define AllocAllocInfo(_cxt, _chunk) \ if(_alloc_info) \ fprintf(stderr, "AllocAlloc: %s: %p, %zu\n", \ (_cxt)->header.name, (_chunk), (_chunk)->size) #else #define AllocFreeInfo(_cxt, _chunk) #define AllocAllocInfo(_cxt, _chunk) #endif
so with this I do
(gdb) b AllocSetAlloc (gdb) cont (gdb) set _alloc_info=1 (gdb) disable (gdb) cont
then I wait, ... until it crashes again ... no, it's too much. It fills up my filesystem in no time with the logs. It produced 3 GB in just a minute of run time.
And also, I doubt we can find anything specifically by allocation size. It's just going to be 512 or whatever.
Isn't there some other way?
Jeff Janes <jeff.janes@gmail.com> writes: > On Mon, Apr 15, 2019 at 9:49 PM Gunther <raj@gusw.net> wrote: >> Isn't there some other way? > I wonder of valgrind or something like that could be of use. I don't know > enough about those tools to know. One problem is that this is not really a > leak. If the query completely successfully, it would have freed the > memory. And when the query completed with an error, it also freed the > memory. So it just an inefficiency, not a true leak, and leak-detection > tools might not work. But as I said, I have not studied them. valgrind is a useful idea, given that Gunther is building his own postgres (so he could compile it with -DUSE_VALGRIND + --enable-cassert, which are needed to get valgrind to understand palloc allocations). I don't recall details right now, but it is possible to trigger a valgrind report intra-session similar to what you get by default at process exit. You could wait till the memory has bloated a good deal and then ask for one of those reports that classify allocations by call chain (I think you want the memcheck tool for this, not the default valgrind tool). However --- at least for the case involving hash joins, I think we have a decent fix on the problem location already: it seems to be a matter of continually deciding to increase nbatch, and now what we need to investigate is why that's happening. If there's a leak that shows up without any hash joins in the plan, then that's a separate matter for investigation. regards, tom lane
On 15/04/2019 08:23, Gunther wrote: > > For weeks now, I am banging my head at an "out of memory" situation. > There is only one query I am running on an 8 GB system, whatever I > try, I get knocked out on this out of memory. It is extremely > impenetrable to understand and fix this error. I guess I could add a > swap file, and then I would have to take the penalty of swapping. But > how can I actually address an out of memory condition if the system > doesn't tell me where it is happening? > [...] I strongly suigest having a swap file, I've got 32GB, and I've used 2.9GB of my swap space after 4 days, but I'm not really pushing my system. For me, mostly stuff that is only used once, or not at all, is swapped out. If you do have a memory leak, then it might be easier to diagnose, if you don't run out on Memory. I suspect that most things will run a little better with some swap space. Cherers, Gavin
Breakpoint 6, AllocSetAlloc (context=0x29a6450, size=8) at aset.c:718 718 { (gdb) bt 8 #0 AllocSetAlloc (context=0x29a6450, size=8) at aset.c:718 #1 0x000000000084e8ad in palloc0 (size=size@entry=8) at mcxt.c:969 #2 0x0000000000702b63 in makeBufFileCommon (nfiles=nfiles@entry=1) at buffile.c:119 #3 0x0000000000702e4c in makeBufFile (firstfile=68225) at buffile.c:138 #4 BufFileCreateTemp (interXact=interXact@entry=false) at buffile.c:201 #5 0x000000000061060b in ExecHashJoinSaveTuple (tuple=0x2ba1018, hashvalue=<optimized out>, fileptr=0x6305b00) at nodeHashjoin.c:1220 #6 0x000000000060d766 in ExecHashTableInsert (hashtable=hashtable@entry=0x2b50ad8, slot=<optimized out>, hashvalue=<optimized out>) at nodeHash.c:1663 #7 0x0000000000610c8f in ExecHashJoinNewBatch (hjstate=0x29a6be0) at nodeHashjoin.c:1051Hmm ... this matches up with a vague thought I had that for some reason the hash join might be spawning a huge number of separate batches. Each batch would have a couple of files with associated in-memory state including an 8K I/O buffer, so you could account for the "slow growth" behavior you're seeing by periodic decisions to increase the number of batches. You might try watching calls to ExecHashIncreaseNumBatches and see if that theory holds water.
OK, checking that ... well yes, this breaks quickly into that, here is one backtrace
This could only happen with a very unfriendly distribution of the hash keys, I think. There's a heuristic in there to shut off growth of nbatch if we observe that we're making no progress at all, but perhaps this is a skewed distribution that's not quite skewed enough to trigger that.
Your hunch is pretty right on. There is something very weirdly distributed in this particular join situation.
#0 ExecHashIncreaseNumBatches (hashtable=hashtable@entry=0x2ae8ca8) at nodeHash.c:893 #1 0x000000000060d84a in ExecHashTableInsert (hashtable=hashtable@entry=0x2ae8ca8, slot=slot@entry=0x2ae0238, hashvalue=<optimized out>) at nodeHash.c:1655 #2 0x000000000060fd9c in MultiExecPrivateHash (node=<optimized out>) at nodeHash.c:186 #3 MultiExecHash (node=node@entry=0x2ac6dc8) at nodeHash.c:114 #4 0x00000000005fe42f in MultiExecProcNode (node=node@entry=0x2ac6dc8) at execProcnode.c:501 #5 0x000000000061073d in ExecHashJoinImpl (parallel=false, pstate=0x2a1dd40) at nodeHashjoin.c:290 #6 ExecHashJoin (pstate=0x2a1dd40) at nodeHashjoin.c:565 #7 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dd40) at execProcnode.c:461 #8 0x000000000061ce6e in ExecProcNode (node=0x2a1dd40) at ../../../src/include/executor/executor.h:247 #9 ExecSort (pstate=0x2a1dc30) at nodeSort.c:107 #10 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dc30) at execProcnode.c:461 #11 0x000000000061d2e4 in ExecProcNode (node=0x2a1dc30) at ../../../src/include/executor/executor.h:247 #12 ExecUnique (pstate=0x2a1d9b0) at nodeUnique.c:73 #13 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1d9b0) at execProcnode.c:461 #14 0x00000000005f75da in ExecProcNode (node=0x2a1d9b0) at ../../../src/include/executor/executor.h:247 #15 ExecutePlan (execute_once=<optimized out>, dest=0xcc60e0 <donothingDR>, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x2a1d9b0, estate=0x2a1d6c0) at execMain.c:1723 #16 standard_ExecutorRun (queryDesc=0x2a7a478, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364 #17 0x000000000059c718 in ExplainOnePlan (plannedstmt=plannedstmt@entry=0x2a787f8, into=into@entry=0x0, es=es@entry=0x28f1048, queryString=<optimized out>, params=0x0, queryEnv=queryEnv@entry=0x0, planduration=0x7ffcbf930080) at explain.c:535
But this is still in the warm-up phase, we don't know if it is at the place where memory grows too much.
Let's see if I can count the occurrences ... I do cont 100. Now resident memory slowly grows, but not too much just 122 kB and CPU is at 88%. I think we haven't hit the problematic part of the plan. There is a sort merge at some leaf, which I believe is innocent. My gut feeling from looking at CPU% high that we are in one of those since NL is disabled.
Next stage is that memory shot up to 264 kB and CPU% down to 8.6. Heavy IO (write and read).
Yes! And now entering the 3rd stage, where memory shots up to 600 kB. This is where it starts "breaking out". And only now the 100 breakpoint conts are used up. And within a second another 100. And even 1000 go by in a second. cont 10000 goes by in 4 seconds. And during that time resident memory increased to over 700 kB. Let's measure:
736096 + cont 10000 --> 740056, that is 3960 bytes for 10000 conts, or 0.396 bytes per cont. Prediction: cont 10000 will now arrive at 744016? Aaaand ... BINGO! 744016 exactly! cont 50000 will take about 20 seconds and will boost memory to 763816. Bets are on ... drumroll ... 35, 36 , ... nope. This time didn't pan out. Breakpoint already hit 75727 times ignore next 5585 hits ... memory now 984052. So it took longer this time and memory increment was larger. We are now getting toward the edge of the cliff. Before we do here is the backtrace now:
#0 ExecHashIncreaseNumBatches (hashtable=hashtable@entry=0x2ae8ca8) at nodeHash.c:893 #1 0x000000000060d84a in ExecHashTableInsert (hashtable=hashtable@entry=0x2ae8ca8, slot=<optimized out>, hashvalue=<optimized out>) at nodeHash.c:1655 #2 0x0000000000610c8f in ExecHashJoinNewBatch (hjstate=0x2a1dd40) at nodeHashjoin.c:1051 #3 ExecHashJoinImpl (parallel=false, pstate=0x2a1dd40) at nodeHashjoin.c:539 #4 ExecHashJoin (pstate=0x2a1dd40) at nodeHashjoin.c:565 #5 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dd40) at execProcnode.c:461 #6 0x000000000061ce6e in ExecProcNode (node=0x2a1dd40) at ../../../src/include/executor/executor.h:247 #7 ExecSort (pstate=0x2a1dc30) at nodeSort.c:107 (More stack frames follow...) (gdb) bt 18 #0 ExecHashIncreaseNumBatches (hashtable=hashtable@entry=0x2ae8ca8) at nodeHash.c:893 #1 0x000000000060d84a in ExecHashTableInsert (hashtable=hashtable@entry=0x2ae8ca8, slot=<optimized out>, hashvalue=<optimized out>) at nodeHash.c:1655 #2 0x0000000000610c8f in ExecHashJoinNewBatch (hjstate=0x2a1dd40) at nodeHashjoin.c:1051 #3 ExecHashJoinImpl (parallel=false, pstate=0x2a1dd40) at nodeHashjoin.c:539 #4 ExecHashJoin (pstate=0x2a1dd40) at nodeHashjoin.c:565 #5 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dd40) at execProcnode.c:461 #6 0x000000000061ce6e in ExecProcNode (node=0x2a1dd40) at ../../../src/include/executor/executor.h:247 #7 ExecSort (pstate=0x2a1dc30) at nodeSort.c:107 #8 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dc30) at execProcnode.c:461 #9 0x000000000061d2e4 in ExecProcNode (node=0x2a1dc30) at ../../../src/include/executor/executor.h:247 #10 ExecUnique (pstate=0x2a1d9b0) at nodeUnique.c:73 #11 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1d9b0) at execProcnode.c:461 #12 0x00000000005f75da in ExecProcNode (node=0x2a1d9b0) at ../../../src/include/executor/executor.h:247 #13 ExecutePlan (execute_once=<optimized out>, dest=0xcc60e0 <donothingDR>, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x2a1d9b0, estate=0x2a1d6c0) at execMain.c:1723 #14 standard_ExecutorRun (queryDesc=0x2a7a478, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364 #15 0x000000000059c718 in ExplainOnePlan (plannedstmt=plannedstmt@entry=0x2a787f8, into=into@entry=0x0, es=es@entry=0x28f1048, queryString=<optimized out>, params=0x0, queryEnv=queryEnv@entry=0x0, planduration=0x7ffcbf930080) at explain.c:535
By the way, I ran the explain analyze of the plan while removing all the final result columns from the outer-most select, replacing them with simply SELECT 1 FROM .... And here is that plan. I am presenting it to you because you might glean something about the whatever skewed distribution.
Hash Right Join (cost=4203858.53..5475530.71 rows=34619 width=4) (actual time=309603.384..459480.863 rows=113478386 loops=1) Hash Cond: (((q.documentinternalid)::text = (documentinformationsubject.documentinternalid)::text) AND ((r.targetinternalid)::text = (documentinformationsubject.actinternalid)::text)) -> Hash Right Join (cost=1341053.37..2611158.36 rows=13 width=74) (actual time=109807.980..109808.040 rows=236 loops=1) Hash Cond: (((documentinformationsubject_2.documentinternalid)::text = (q.documentinternalid)::text) AND ((documentinformationsubject_2.actinternalid)::text = (q.actinternalid)::text)) -> Gather (cost=30803.54..1300908.52 rows=1 width=74) (actual time=58730.915..58737.757 rows=0 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Hash Left Join (cost=29803.54..1299908.42 rows=1 width=74) (actual time=58723.378..58723.379 rows=0 loops=3) Hash Cond: ((documentinformationsubject_2.otherentityinternalid)::text = (agencyid.entityinternalid)::text) -> Parallel Hash Left Join (cost=28118.13..1298223.00 rows=1 width=111) (actual time=58713.650..58713.652 rows=0 loops=3) Hash Cond: ((documentinformationsubject_2.otherentityinternalid)::text = (agencyname.entityinternalid)::text) -> Parallel Seq Scan on documentinformationsubject documentinformationsubject_2 (cost=0.00..1268800.85 rows=1 width=111) (actual time=58544.391..58544.391 rows=0 loops=3) Filter: ((participationtypecode)::text = 'AUT'::text) Rows Removed by Filter: 2815562 -> Parallel Hash (cost=24733.28..24733.28 rows=166628 width=37) (actual time=125.611..125.611 rows=133303 loops=3) Buckets: 65536 Batches: 16 Memory Usage: 2336kB -> Parallel Seq Scan on bestname agencyname (cost=0.00..24733.28 rows=166628 width=37) (actual time=0.009..60.685 rows=133303 loops=3) -> Parallel Hash (cost=1434.07..1434.07 rows=20107 width=37) (actual time=9.329..9.329 rows=11393 loops=3) Buckets: 65536 Batches: 1 Memory Usage: 2976kB -> Parallel Seq Scan on entity_id agencyid (cost=0.00..1434.07 rows=20107 width=37) (actual time=0.008..5.224 rows=11393 loops=3) -> Hash (cost=1310249.63..1310249.63 rows=13 width=111) (actual time=51077.049..51077.049 rows=236 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 41kB -> Hash Right Join (cost=829388.20..1310249.63 rows=13 width=111) (actual time=45607.852..51076.967 rows=236 loops=1) Hash Cond: ((an.actinternalid)::text = (q.actinternalid)::text) -> Seq Scan on act_id an (cost=0.00..425941.04 rows=14645404 width=37) (actual time=1.212..10883.350 rows=14676871 loops=1) -> Hash (cost=829388.19..829388.19 rows=1 width=111) (actual time=38246.715..38246.715 rows=236 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 41kB -> Gather (cost=381928.46..829388.19 rows=1 width=111) (actual time=31274.733..38246.640 rows=236 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Hash Join (cost=380928.46..828388.09 rows=1 width=111) (actual time=31347.260..38241.812 rows=79 loops=3) Hash Cond: ((q.actinternalid)::text = (r.sourceinternalid)::text) -> Parallel Seq Scan on documentinformation q (cost=0.00..447271.93 rows=50050 width=74) (actual time=13304.439..20265.733 rows=87921 loops=3) Filter: (((classcode)::text = 'CNTRCT'::text) AND ((moodcode)::text = 'EVN'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text)) Rows Removed by Filter: 1540625 -> Parallel Hash (cost=380928.44..380928.44 rows=1 width=74) (actual time=17954.106..17954.106 rows=79 loops=3) Buckets: 1024 Batches: 1 Memory Usage: 104kB -> Parallel Seq Scan on actrelationship r (cost=0.00..380928.44 rows=1 width=74) (actual time=7489.704..17953.959 rows=79 loops=3) Filter: ((typecode)::text = 'SUBJ'::text) Rows Removed by Filter: 3433326 -> Hash (cost=2861845.87..2861845.87 rows=34619 width=74) (actual time=199792.446..199792.446 rows=113478127 loops=1) Buckets: 65536 (originally 65536) Batches: 131072 (originally 2) Memory Usage: 189207kB -> Gather Merge (cost=2845073.40..2861845.87 rows=34619 width=74) (actual time=107620.262..156256.432 rows=113478127 loops=1) Workers Planned: 2 Workers Launched: 2 -> Merge Left Join (cost=2844073.37..2856849.96 rows=14425 width=74) (actual time=107570.719..126113.792 rows=37826042 loops=3) Merge Cond: (((documentinformationsubject.documentinternalid)::text = (documentinformationsubject_1.documentinternalid)::text) AND ((documentinformationsubject.documentid)::text = (documentinformationsubject_1.documentid)::text) AND ((documentinformationsubject.actinternalid)::text = (documentinformationsubject_1.actinternalid)::text)) -> Sort (cost=1295969.26..1296005.32 rows=14425 width=111) (actual time=57700.723..58134.751 rows=231207 loops=3) Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.documentid, documentinformationsubject.actinternalid Sort Method: external merge Disk: 26936kB Worker 0: Sort Method: external merge Disk: 27152kB Worker 1: Sort Method: external merge Disk: 28248kB -> Parallel Seq Scan on documentinformationsubject (cost=0.00..1294972.76 rows=14425 width=111) (actual time=24866.656..57424.420 rows=231207 loops=3) Filter: (((participationtypecode)::text = ANY ('{PPRF,PRF}'::text[])) AND ((classcode)::text = 'ACT'::text) AND ((moodcode)::text = 'DEF'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text)) Rows Removed by Filter: 2584355 -> Materialize (cost=1548104.12..1553157.04 rows=1010585 width=111) (actual time=49869.984..54191.701 rows=38060250 loops=3) -> Sort (cost=1548104.12..1550630.58 rows=1010585 width=111) (actual time=49869.980..50832.205 rows=1031106 loops=3) Sort Key: documentinformationsubject_1.documentinternalid, documentinformationsubject_1.documentid, documentinformationsubject_1.actinternalid Sort Method: external merge Disk: 122192kB Worker 0: Sort Method: external merge Disk: 122192kB Worker 1: Sort Method: external merge Disk: 122192kB -> Seq Scan on documentinformationsubject documentinformationsubject_1 (cost=0.00..1329868.64 rows=1010585 width=111) (actual time=20366.166..47751.267 rows=1031106 loops=3) Filter: ((participationtypecode)::text = 'PRD'::text) Rows Removed by Filter: 7415579Planning Time: 2.523 msExecution Time: 464825.391 ms (66 rows)
By the way, let me ask, do you have pretty-print functions I can call with, e.g., node in ExecProcNode, or pstate in ExecHashJoin? Because if there was, then we could determine where exactly in the current plan we are? And can I call the plan printer for the entire plan we are currently executing? Might it even give us preliminary counts of where in the process it is? (I ask the latter not only because it would be really useful for our present debugging, but also because it would be an awesome tool for monitoring of long running queries! Something I am sure tons of people would just love to have!
BTW, I also read the other responses. I agree that having a swap space available just in case is better than these annoying out of memory errors. And yes, I can add that memory profiler thing, if you think it would actually work. I've done it with java heap dumps, even upgrading the VM to a 32 GB VM just to crunch the heap dump. But can you tell me just a little more as to how I need to configure this thing to get the data you want without blowing up the memory and disk during this huge query?
regards,
-Gunther
I wonder if it'd be useful to compile with ./configure CFLAGS=-DHJDEBUG=1
On 4/16/19 6:39 PM, Gavin Flower wrote: > I suspect that most things will run a little better with some swap space. Not always. $ free total used free shared buffers cached Mem: 16254616 13120960 3133656 20820 646676 10765380 -/+ buffers/cache: 1708904 14545712 Swap: 4095996 17436 4078560 -- .~. Jean-David Beyer /V\ PGP-Key:166D840A 0C610C8B /( )\ Shrewsbury, New Jersey ^^-^^ 01:50:01 up 5 days, 56 min, 2 users, load average: 4.51, 4.59, 4.90
On 17/04/2019 18:01, Jean-David Beyer wrote: > On 4/16/19 6:39 PM, Gavin Flower wrote: >> I suspect that most things will run a little better with some swap space. > Not always. > > $ free > total used free shared buffers cached > Mem: 16254616 13120960 3133656 20820 646676 10765380 > -/+ buffers/cache: 1708904 14545712 > Swap: 4095996 17436 4078560 > Unclear what is the point you're trying to make, and the stats you quote don't enlighten me.
Hi guys. I don't want to be pushy, but I found it strange that after so much lively back and forth getting to the bottom of this, suddenly my last nights follow-up remained completely without reply. I wonder if it even got received. For those who read their emails with modern readers (I know I too am from a time where I wrote everything in plain text) I marked some important questions in bold.
Breakpoint 6, AllocSetAlloc (context=0x29a6450, size=8) at aset.c:718 718 { (gdb) bt 8 #0 AllocSetAlloc (context=0x29a6450, size=8) at aset.c:718 #1 0x000000000084e8ad in palloc0 (size=size@entry=8) at mcxt.c:969 #2 0x0000000000702b63 in makeBufFileCommon (nfiles=nfiles@entry=1) at buffile.c:119 #3 0x0000000000702e4c in makeBufFile (firstfile=68225) at buffile.c:138 #4 BufFileCreateTemp (interXact=interXact@entry=false) at buffile.c:201 #5 0x000000000061060b in ExecHashJoinSaveTuple (tuple=0x2ba1018, hashvalue=<optimized out>, fileptr=0x6305b00) at nodeHashjoin.c:1220 #6 0x000000000060d766 in ExecHashTableInsert (hashtable=hashtable@entry=0x2b50ad8, slot=<optimized out>, hashvalue=<optimized out>) at nodeHash.c:1663 #7 0x0000000000610c8f in ExecHashJoinNewBatch (hjstate=0x29a6be0) at nodeHashjoin.c:1051Hmm ... this matches up with a vague thought I had that for some reason the hash join might be spawning a huge number of separate batches. Each batch would have a couple of files with associated in-memory state including an 8K I/O buffer, so you could account for the "slow growth" behavior you're seeing by periodic decisions to increase the number of batches. You might try watching calls to ExecHashIncreaseNumBatches and see if that theory holds water.
OK, checking that ... well yes, this breaks quickly into that, here is one backtrace
#0 ExecHashIncreaseNumBatches (hashtable=hashtable@entry=0x2ae8ca8) at nodeHash.c:893 #1 0x000000000060d84a in ExecHashTableInsert (hashtable=hashtable@entry=0x2ae8ca8, slot=slot@entry=0x2ae0238, hashvalue=<optimized out>) at nodeHash.c:1655 #2 0x000000000060fd9c in MultiExecPrivateHash (node=<optimized out>) at nodeHash.c:186 #3 MultiExecHash (node=node@entry=0x2ac6dc8) at nodeHash.c:114 #4 0x00000000005fe42f in MultiExecProcNode (node=node@entry=0x2ac6dc8) at execProcnode.c:501 #5 0x000000000061073d in ExecHashJoinImpl (parallel=false, pstate=0x2a1dd40) at nodeHashjoin.c:290 #6 ExecHashJoin (pstate=0x2a1dd40) at nodeHashjoin.c:565 #7 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dd40) at execProcnode.c:461 #8 0x000000000061ce6e in ExecProcNode (node=0x2a1dd40) at ../../../src/include/executor/executor.h:247 #9 ExecSort (pstate=0x2a1dc30) at nodeSort.c:107 #10 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dc30) at execProcnode.c:461 #11 0x000000000061d2e4 in ExecProcNode (node=0x2a1dc30) at ../../../src/include/executor/executor.h:247 #12 ExecUnique (pstate=0x2a1d9b0) at nodeUnique.c:73 #13 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1d9b0) at execProcnode.c:461 #14 0x00000000005f75da in ExecProcNode (node=0x2a1d9b0) at ../../../src/include/executor/executor.h:247 #15 ExecutePlan (execute_once=<optimized out>, dest=0xcc60e0 <donothingDR>, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x2a1d9b0, estate=0x2a1d6c0) at execMain.c:1723 #16 standard_ExecutorRun (queryDesc=0x2a7a478, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364 #17 0x000000000059c718 in ExplainOnePlan (plannedstmt=plannedstmt@entry=0x2a787f8, into=into@entry=0x0, es=es@entry=0x28f1048, queryString=<optimized out>, params=0x0, queryEnv=queryEnv@entry=0x0, planduration=0x7ffcbf930080) at explain.c:535
But this is still in the warm-up phase, we don't know if it is at the place where memory grows too much.
This could only happen with a very unfriendly distribution of the hash keys, I think. There's a heuristic in there to shut off growth of nbatch if we observe that we're making no progress at all, but perhaps this is a skewed distribution that's not quite skewed enough to trigger that.
Your hunch is pretty right on. There is something very weirdly distributed in this particular join situation. Let's see if I can count the occurrences ... I do cont 100. Now resident memory slowly grows, but not too much just 122 kB and CPU is at 88%. I think we haven't hit the problematic part of the plan. There is a sort merge at some leaf, which I believe is innocent. My gut feeling from looking at CPU% high that we are in one of those since NL is disabled.
Next stage is that memory shot up to 264 kB and CPU% down to 8.6. Heavy IO (write and read).
Yes! And now entering the 3rd stage, where memory shots up to 600 kB. This is where it starts "breaking out". And only now the 100 breakpoint conts are used up. And within a second another 100. And even 1000 go by in a second. cont 10000 goes by in 4 seconds. And during that time resident memory increased to over 700 kB. Let's measure:
736096 + cont 10000 --> 740056, that is 3960 bytes for 10000 conts, or 0.396 bytes per cont. Prediction: cont 10000 will now arrive at 744016? Aaaand ... BINGO! 744016 exactly! cont 50000 will take about 20 seconds and will boost memory to 763816. Bets are on ... drumroll ... 35, 36 , ... nope. This time didn't pan out. Breakpoint already hit 75727 times ignore next 5585 hits ... memory now 984052. So it took longer this time and memory increment was larger. We are now getting toward the edge of the cliff. Before we do here is the backtrace now:
#0 ExecHashIncreaseNumBatches (hashtable=hashtable@entry=0x2ae8ca8) at nodeHash.c:893 #1 0x000000000060d84a in ExecHashTableInsert (hashtable=hashtable@entry=0x2ae8ca8, slot=<optimized out>, hashvalue=<optimized out>) at nodeHash.c:1655 #2 0x0000000000610c8f in ExecHashJoinNewBatch (hjstate=0x2a1dd40) at nodeHashjoin.c:1051 #3 ExecHashJoinImpl (parallel=false, pstate=0x2a1dd40) at nodeHashjoin.c:539 #4 ExecHashJoin (pstate=0x2a1dd40) at nodeHashjoin.c:565 #5 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dd40) at execProcnode.c:461 #6 0x000000000061ce6e in ExecProcNode (node=0x2a1dd40) at ../../../src/include/executor/executor.h:247 #7 ExecSort (pstate=0x2a1dc30) at nodeSort.c:107 (More stack frames follow...) (gdb) bt 18 #0 ExecHashIncreaseNumBatches (hashtable=hashtable@entry=0x2ae8ca8) at nodeHash.c:893 #1 0x000000000060d84a in ExecHashTableInsert (hashtable=hashtable@entry=0x2ae8ca8, slot=<optimized out>, hashvalue=<optimized out>) at nodeHash.c:1655 #2 0x0000000000610c8f in ExecHashJoinNewBatch (hjstate=0x2a1dd40) at nodeHashjoin.c:1051 #3 ExecHashJoinImpl (parallel=false, pstate=0x2a1dd40) at nodeHashjoin.c:539 #4 ExecHashJoin (pstate=0x2a1dd40) at nodeHashjoin.c:565 #5 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dd40) at execProcnode.c:461 #6 0x000000000061ce6e in ExecProcNode (node=0x2a1dd40) at ../../../src/include/executor/executor.h:247 #7 ExecSort (pstate=0x2a1dc30) at nodeSort.c:107 #8 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dc30) at execProcnode.c:461 #9 0x000000000061d2e4 in ExecProcNode (node=0x2a1dc30) at ../../../src/include/executor/executor.h:247 #10 ExecUnique (pstate=0x2a1d9b0) at nodeUnique.c:73 #11 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1d9b0) at execProcnode.c:461 #12 0x00000000005f75da in ExecProcNode (node=0x2a1d9b0) at ../../../src/include/executor/executor.h:247 #13 ExecutePlan (execute_once=<optimized out>, dest=0xcc60e0 <donothingDR>, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x2a1d9b0, estate=0x2a1d6c0) at execMain.c:1723 #14 standard_ExecutorRun (queryDesc=0x2a7a478, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364 #15 0x000000000059c718 in ExplainOnePlan (plannedstmt=plannedstmt@entry=0x2a787f8, into=into@entry=0x0, es=es@entry=0x28f1048, queryString=<optimized out>, params=0x0, queryEnv=queryEnv@entry=0x0, planduration=0x7ffcbf930080) at explain.c:535
I ran the explain analyze of the plan while removing all the final result columns from the outer-most select, replacing them with simply SELECT 1 FROM .... And here is that plan. I am presenting it to you because you might glean something about the whatever skewed distribution.
Hash Right Join (cost=4203858.53..5475530.71 rows=34619 width=4) (actual time=309603.384..459480.863 rows=113478386 loops=1) Hash Cond: (((q.documentinternalid)::text = (documentinformationsubject.documentinternalid)::text) AND ((r.targetinternalid)::text = (documentinformationsubject.actinternalid)::text)) -> Hash Right Join (cost=1341053.37..2611158.36 rows=13 width=74) (actual time=109807.980..109808.040 rows=236 loops=1) Hash Cond: (((documentinformationsubject_2.documentinternalid)::text = (q.documentinternalid)::text) AND ((documentinformationsubject_2.actinternalid)::text = (q.actinternalid)::text)) -> Gather (cost=30803.54..1300908.52 rows=1 width=74) (actual time=58730.915..58737.757 rows=0 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Hash Left Join (cost=29803.54..1299908.42 rows=1 width=74) (actual time=58723.378..58723.379 rows=0 loops=3) Hash Cond: ((documentinformationsubject_2.otherentityinternalid)::text = (agencyid.entityinternalid)::text) -> Parallel Hash Left Join (cost=28118.13..1298223.00 rows=1 width=111) (actual time=58713.650..58713.652 rows=0 loops=3) Hash Cond: ((documentinformationsubject_2.otherentityinternalid)::text = (agencyname.entityinternalid)::text) -> Parallel Seq Scan on documentinformationsubject documentinformationsubject_2 (cost=0.00..1268800.85 rows=1 width=111) (actual time=58544.391..58544.391 rows=0 loops=3) Filter: ((participationtypecode)::text = 'AUT'::text) Rows Removed by Filter: 2815562 -> Parallel Hash (cost=24733.28..24733.28 rows=166628 width=37) (actual time=125.611..125.611 rows=133303 loops=3) Buckets: 65536 Batches: 16 Memory Usage: 2336kB -> Parallel Seq Scan on bestname agencyname (cost=0.00..24733.28 rows=166628 width=37) (actual time=0.009..60.685 rows=133303 loops=3) -> Parallel Hash (cost=1434.07..1434.07 rows=20107 width=37) (actual time=9.329..9.329 rows=11393 loops=3) Buckets: 65536 Batches: 1 Memory Usage: 2976kB -> Parallel Seq Scan on entity_id agencyid (cost=0.00..1434.07 rows=20107 width=37) (actual time=0.008..5.224 rows=11393 loops=3) -> Hash (cost=1310249.63..1310249.63 rows=13 width=111) (actual time=51077.049..51077.049 rows=236 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 41kB -> Hash Right Join (cost=829388.20..1310249.63 rows=13 width=111) (actual time=45607.852..51076.967 rows=236 loops=1) Hash Cond: ((an.actinternalid)::text = (q.actinternalid)::text) -> Seq Scan on act_id an (cost=0.00..425941.04 rows=14645404 width=37) (actual time=1.212..10883.350 rows=14676871 loops=1) -> Hash (cost=829388.19..829388.19 rows=1 width=111) (actual time=38246.715..38246.715 rows=236 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 41kB -> Gather (cost=381928.46..829388.19 rows=1 width=111) (actual time=31274.733..38246.640 rows=236 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Hash Join (cost=380928.46..828388.09 rows=1 width=111) (actual time=31347.260..38241.812 rows=79 loops=3) Hash Cond: ((q.actinternalid)::text = (r.sourceinternalid)::text) -> Parallel Seq Scan on documentinformation q (cost=0.00..447271.93 rows=50050 width=74) (actual time=13304.439..20265.733 rows=87921 loops=3) Filter: (((classcode)::text = 'CNTRCT'::text) AND ((moodcode)::text = 'EVN'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text)) Rows Removed by Filter: 1540625 -> Parallel Hash (cost=380928.44..380928.44 rows=1 width=74) (actual time=17954.106..17954.106 rows=79 loops=3) Buckets: 1024 Batches: 1 Memory Usage: 104kB -> Parallel Seq Scan on actrelationship r (cost=0.00..380928.44 rows=1 width=74) (actual time=7489.704..17953.959 rows=79 loops=3) Filter: ((typecode)::text = 'SUBJ'::text) Rows Removed by Filter: 3433326 -> Hash (cost=2861845.87..2861845.87 rows=34619 width=74) (actual time=199792.446..199792.446 rows=113478127 loops=1) Buckets: 65536 (originally 65536) Batches: 131072 (originally 2) Memory Usage: 189207kB -> Gather Merge (cost=2845073.40..2861845.87 rows=34619 width=74) (actual time=107620.262..156256.432 rows=113478127 loops=1) Workers Planned: 2 Workers Launched: 2 -> Merge Left Join (cost=2844073.37..2856849.96 rows=14425 width=74) (actual time=107570.719..126113.792 rows=37826042 loops=3) Merge Cond: (((documentinformationsubject.documentinternalid)::text = (documentinformationsubject_1.documentinternalid)::text) AND ((documentinformationsubject.documentid)::text = (documentinformationsubject_1.documentid)::text) AND ((documentinformationsubject.actinternalid)::text = (documentinformationsubject_1.actinternalid)::text)) -> Sort (cost=1295969.26..1296005.32 rows=14425 width=111) (actual time=57700.723..58134.751 rows=231207 loops=3) Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.documentid, documentinformationsubject.actinternalid Sort Method: external merge Disk: 26936kB Worker 0: Sort Method: external merge Disk: 27152kB Worker 1: Sort Method: external merge Disk: 28248kB -> Parallel Seq Scan on documentinformationsubject (cost=0.00..1294972.76 rows=14425 width=111) (actual time=24866.656..57424.420 rows=231207 loops=3) Filter: (((participationtypecode)::text = ANY ('{PPRF,PRF}'::text[])) AND ((classcode)::text = 'ACT'::text) AND ((moodcode)::text = 'DEF'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text)) Rows Removed by Filter: 2584355 -> Materialize (cost=1548104.12..1553157.04 rows=1010585 width=111) (actual time=49869.984..54191.701 rows=38060250 loops=3) -> Sort (cost=1548104.12..1550630.58 rows=1010585 width=111) (actual time=49869.980..50832.205 rows=1031106 loops=3) Sort Key: documentinformationsubject_1.documentinternalid, documentinformationsubject_1.documentid, documentinformationsubject_1.actinternalid Sort Method: external merge Disk: 122192kB Worker 0: Sort Method: external merge Disk: 122192kB Worker 1: Sort Method: external merge Disk: 122192kB -> Seq Scan on documentinformationsubject documentinformationsubject_1 (cost=0.00..1329868.64 rows=1010585 width=111) (actual time=20366.166..47751.267 rows=1031106 loops=3) Filter: ((participationtypecode)::text = 'PRD'::text) Rows Removed by Filter: 7415579Planning Time: 2.523 msExecution Time: 464825.391 ms (66 rows)
By the way, let me ask, do you have pretty-print functions I can call with, e.g., node in ExecProcNode, or pstate in ExecHashJoin? Because if there was, then we could determine where exactly in the current plan we are? And can I call the plan printer for the entire plan we are currently executing? Might it even give us preliminary counts of where in the process it is? (I ask the latter not only because it would be really useful for our present debugging, but also because it would be an awesome tool for monitoring of long running queries! Something I am sure tons of people would just love to have!)
I also read the other responses. I agree that having a swap space available just in case is better than these annoying out of memory errors. And yes, I can add that memory profiler thing, if you think it would actually work. I've done it with java heap dumps, even upgrading the VM to a 32 GB VM just to crunch the heap dump. But can you tell me just a little more as to how I need to configure this thing to get the data you want without blowing up the memory and disk during this huge query?
regards,
-Gunther
On Wed, Apr 17, 2019 at 11:52:44PM -0400, Gunther wrote: >Hi guys. I don't want to be pushy, but I found it strange that after >so much lively back and forth getting to the bottom of this, suddenly >my last nights follow-up remained completely without reply. I wonder >if it even got received. For those who read their emails with modern >readers (I know I too am from a time where I wrote everything in plain >text) I marked some important questions in bold. > It was received (and it's visible in the archives). It's right before easter, so I guess some people may be already on a vaction. As for the issue - I think the current hypothesis is that the data distribution is skewed in some strange way, triggering some unexpected behavior in hash join. That seems plausible, but it's really hard to investigate without knowing anything about the data distribution :-( It would be possible to do at least one of these two things: (a) export pg_stats info about distribution of the join keys The number of tables involved in the query is not that high, and this would allo us to generate a data set approximating your data. The one thing this can't do is showing how it's affected by WHERE conditions. (b) export data for join keys This is similar to (a), but it would allow filtering data by the WHERE conditions first. The amount of data would be higher, although we only need data from the columns used as join keys. Of course, if those key values contain sensitive data, it may not be possible, but perhaps you could hash it in some way. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi guys. I don't want to be pushy, but I found it strange that after so much lively back and forth getting to the bottom of this, suddenly my last nights follow-up remained completely without reply. I wonder if it even got received. For those who read their emails with modern readers (I know I too am from a time where I wrote everything in plain text) I marked some important questions in bold.
On Wed, Apr 17, 2019 at 11:52:44PM -0400, Gunther wrote: > Hi guys. I don't want to be pushy, but I found it strange that after so much Were you able to reproduce the issue in some minimized way ? Like after joining fewer tables or changing to join with fewer join conditions ? On Thu, Apr 18, 2019 at 05:21:28PM +0200, Tomas Vondra wrote: > As for the issue - I think the current hypothesis is that the data > distribution is skewed in some strange way, triggering some unexpected > behavior in hash join. That seems plausible, but it's really hard to > investigate without knowing anything about the data distribution :-( > > It would be possible to do at least one of these two things: > > (a) export pg_stats info about distribution of the join keys For starts, could you send the MCVs, maybe with some variation on this query ? https://wiki.postgresql.org/wiki/Slow_Query_Questions#Statistics:_n_distinct.2C_MCV.2C_histogram Justin
Were you able to reproduce the issue in some minimized way ? Like after joining fewer tables or changing to join with fewer join conditions ? On Thu, Apr 18, 2019 at 05:21:28PM +0200, Tomas Vondra wrote:It would be possible to do at least one of these two things:
Thanks, and sorry for my pushyness. Yes, I have pin pointed the HashJoin, and I have created the two tables involved.
The data distribution of the join keys, they are all essentially UUIDs and essentially random.
I am sharing this data with you. However, only someone who can actually control the planner can use it to reproduce the problem. I have tried but not succeeded. But I am sure the problem is reproduced by this material.
Here is the part of the plan that generates this massive number of calls to
-> Hash Right Join (cost=4255031.53..5530808.71 rows=34619 width=1197) Hash Cond: (((q.documentinternalid)::text = (documentinformationsubject.documentinternalid)::text) AND ((r.targetinternalid)::text = (documentinformationsubject.actinternalid)::text)) -> Hash Right Join (cost=1341541.37..2612134.36 rows=13 width=341) Hash Cond: (((documentinformationsubject_2.documentinternalid)::text = (q.documentinternalid)::text) AND ((documentinformationsubject_2.actinternalid)::text = (q.actinternalid)::text)) ... let's call this tmp_q ... -> Hash (cost=2908913.87..2908913.87 rows=34619 width=930) -> Gather Merge (cost=2892141.40..2908913.87 rows=34619 width=930) ... let's call this tmp_r ...
This can be logically reduced to the following query
SELECT * FROM tmp_q q RIGHT OUTER JOIN tmp_r r USING(documentInternalId, actInternalId);
with the following two tables
CREATE TABLE xtmp_q (documentinternalid character varying(255),operationqualifiercode character varying(512),operationqualifiername character varying(512),actinternalid character varying(255),approvalinternalid character varying(255),approvalnumber character varying(555),approvalnumbersystem character varying(555),approvalstatecode character varying(512),approvalstatecodesystem character varying(512),approvaleffectivetimelow character varying(512),approvaleffectivetimehigh character varying(512),approvalstatuscode character varying(32),licensecode character varying(512),agencyid character varying(555),agencyname text ); CREATE TABLE tmp_r (documentinternalid character varying(255),is_current character(1),documentid character varying(555),documenttypecode character varying(512),subjectroleinternalid character varying(255),subjectentityinternalid character varying(255),subjectentityid character varying(555),subjectentityidroot character varying(555),subjectentityname character varying,subjectentitytel text,subjectentityemail text,otherentityinternalid character varying(255),confidentialitycode character varying(512),actinternalid character varying(255),operationcode character varying(512),operationname text,productitemcode character varying(512),productinternalid character varying(255).. );
you can download the data here (URLs just a tiny bit obfuscated):
The small table http:// gusw dot net/tmp_q.gz
The big table is in the form of 9 parts of 20 MB each, http:// gusw dot net/tmp_r.gz.00, .01, .02, ..., .09, maybe you need only the first part.
Download as many as you have patience to grab, and then import the data like this:
\copy tmp_q from program 'zcat tmp_q.gz' \copt tmp_r from program 'cat tmp_r.gz.* |zcat'
The only problem is that I can't test that this actually would trigger the memory problem, because I can't force the plan to use the right join, it always reverts to the left join hashing the tmp_q:
-> Hash Left Join (cost=10.25..5601401.19 rows=5505039 width=12118) Hash Cond: (((r.documentinternalid)::text = (q.documentinternalid)::text) AND ((r.actinternalid)::text = (q.actinternalid)::text)) -> Seq Scan on tmp_r r (cost=0.00..5560089.39 rows=5505039 width=6844) -> Hash (cost=10.10..10.10 rows=10 width=6306) -> Seq Scan on tmp_q q (cost=0.00..10.10 rows=10 width=6306)
which is of course much better, but when tmp_q and tmp_r are the results of complex stuff that the planner can't estimate, then it gets it wrong, and then the issue gets triggered because we are hashing on the big tmp_r, not tmp_q.
It would be so nice if there was a way to force a specific plan for purposes of the testing. I tried giving false data in pg_class reltuples and relpages:
foo=# analyze tmp_q; ANALYZE foo=# analyze tmp_r; ANALYZE foo=# select relname, relpages, reltuples from pg_class where relname in ('tmp_q', 'tmp_r');relname | relpages | reltuples ---------+----------+-------------tmp_r | 5505039 | 1.13467e+08tmp_q | 7 | 236 (2 rows) foo=# update pg_class set (relpages, reltuples) = (5505039, 1.13467e+08) where relname = 'tmp_q'; UPDATE 1 foo=# update pg_class set (relpages, reltuples) = (7, 236) where relname = 'tmp_r'; UPDATE 1
but that didn't help. Somehow the planner outsmarts every such trick, so I can't get it to follow my right outer join plan where the big table is hashed. I am sure y'all know some way to force it.
regards,
-Gunther
On 4/19/2019 17:01, Justin Pryzby wrote:Were you able to reproduce the issue in some minimized way ? Like after joining fewer tables or changing to join with fewer join conditions ? On Thu, Apr 18, 2019 at 05:21:28PM +0200, Tomas Vondra wrote:It would be possible to do at least one of these two things:Thanks, and sorry for my pushyness. Yes, I have pin pointed the HashJoin, and I have created the two tables involved.
The data distribution of the join keys, they are all essentially UUIDs and essentially random.
I am sharing this data with you. However, only someone who can actually control the planner can use it to reproduce the problem. I have tried but not succeeded. But I am sure the problem is reproduced by this material.
Here is the part of the plan that generates this massive number of calls to
-> Hash Right Join (cost=4255031.53..5530808.71 rows=34619 width=1197) Hash Cond: (((q.documentinternalid)::text = (documentinformationsubject.documentinternalid)::text) AND ((r.targetinternalid)::text = (documentinformationsubject.actinternalid)::text)) -> Hash Right Join (cost=1341541.37..2612134.36 rows=13 width=341) Hash Cond: (((documentinformationsubject_2.documentinternalid)::text = (q.documentinternalid)::text) AND ((documentinformationsubject_2.actinternalid)::text = (q.actinternalid)::text)) ... let's call this tmp_q ... -> Hash (cost=2908913.87..2908913.87 rows=34619 width=930) -> Gather Merge (cost=2892141.40..2908913.87 rows=34619 width=930) ... let's call this tmp_r ...This can be logically reduced to the following query
SELECT * FROM tmp_q q RIGHT OUTER JOIN tmp_r r USING(documentInternalId, actInternalId);with the following two tables
CREATE TABLE xtmp_q (documentinternalid character varying(255),operationqualifiercode character varying(512),operationqualifiername character varying(512),actinternalid character varying(255),approvalinternalid character varying(255),approvalnumber character varying(555),approvalnumbersystem character varying(555),approvalstatecode character varying(512),approvalstatecodesystem character varying(512),approvaleffectivetimelow character varying(512),approvaleffectivetimehigh character varying(512),approvalstatuscode character varying(32),licensecode character varying(512),agencyid character varying(555),agencyname text ); CREATE TABLE tmp_r (documentinternalid character varying(255),is_current character(1),documentid character varying(555),documenttypecode character varying(512),subjectroleinternalid character varying(255),subjectentityinternalid character varying(255),subjectentityid character varying(555),subjectentityidroot character varying(555),subjectentityname character varying,subjectentitytel text,subjectentityemail text,otherentityinternalid character varying(255),confidentialitycode character varying(512),actinternalid character varying(255),operationcode character varying(512),operationname text,productitemcode character varying(512),productinternalid character varying(255).. );you can download the data here (URLs just a tiny bit obfuscated):
The small table http:// gusw dot net/tmp_q.gz
The big table is in the form of 9 parts of 20 MB each, http:// gusw dot net/tmp_r.gz.00, .01, .02, ..., .09, maybe you need only the first part.
Download as many as you have patience to grab, and then import the data like this:
\copy tmp_q from program 'zcat tmp_q.gz' \copt tmp_r from program 'cat tmp_r.gz.* |zcat'The only problem is that I can't test that this actually would trigger the memory problem, because I can't force the plan to use the right join, it always reverts to the left join hashing the tmp_q:
-> Hash Left Join (cost=10.25..5601401.19 rows=5505039 width=12118) Hash Cond: (((r.documentinternalid)::text = (q.documentinternalid)::text) AND ((r.actinternalid)::text = (q.actinternalid)::text)) -> Seq Scan on tmp_r r (cost=0.00..5560089.39 rows=5505039 width=6844) -> Hash (cost=10.10..10.10 rows=10 width=6306) -> Seq Scan on tmp_q q (cost=0.00..10.10 rows=10 width=6306)which is of course much better, but when tmp_q and tmp_r are the results of complex stuff that the planner can't estimate, then it gets it wrong, and then the issue gets triggered because we are hashing on the big tmp_r, not tmp_q.
It would be so nice if there was a way to force a specific plan for purposes of the testing. I tried giving false data in pg_class reltuples and relpages:
foo=# analyze tmp_q; ANALYZE foo=# analyze tmp_r; ANALYZE foo=# select relname, relpages, reltuples from pg_class where relname in ('tmp_q', 'tmp_r');relname | relpages | reltuples ---------+----------+-------------tmp_r | 5505039 | 1.13467e+08tmp_q | 7 | 236 (2 rows) foo=# update pg_class set (relpages, reltuples) = (5505039, 1.13467e+08) where relname = 'tmp_q'; UPDATE 1 foo=# update pg_class set (relpages, reltuples) = (7, 236) where relname = 'tmp_r'; UPDATE 1but that didn't help. Somehow the planner outsmarts every such trick, so I can't get it to follow my right outer join plan where the big table is hashed. I am sure y'all know some way to force it.
regards,
-Gunther
On Fri, Apr 19, 2019 at 11:34:54PM -0400, Gunther wrote: > On 4/19/2019 17:01, Justin Pryzby wrote: > >Were you able to reproduce the issue in some minimized way ? Like after > >joining fewer tables or changing to join with fewer join conditions ? > > > >On Thu, Apr 18, 2019 at 05:21:28PM +0200, Tomas Vondra wrote: > >>It would be possible to do at least one of these two things: > > Thanks, and sorry for my pushyness. Yes, I have pin pointed the HashJoin, > and I have created the two tables involved. > > The data distribution of the join keys, they are all essentially UUIDs and > essentially random. > > I am sharing this data with you. However, only someone who can actually > control the planner can use it to reproduce the problem. I have tried but > not succeeded. But I am sure the problem is reproduced by this material. > > Here is the part of the plan that generates this massive number of calls to > > -> Hash Right Join (cost=4255031.53..5530808.71 rows=34619 width=1197) > Hash Cond: (((q.documentinternalid)::text = (documentinformationsubject.documentinternalid)::text) AND ((r.targetinternalid)::text= (documentinformationsubject.actinternalid)::text)) > -> Hash Right Join (cost=1341541.37..2612134.36 rows=13 width=341) > Hash Cond: (((documentinformationsubject_2.documentinternalid)::text = (q.documentinternalid)::text)AND ((documentinformationsubject_2.actinternalid)::text = (q.actinternalid)::text)) > ... let's call this tmp_q ... > -> Hash (cost=2908913.87..2908913.87 rows=34619 width=930) > -> Gather Merge (cost=2892141.40..2908913.87 rows=34619 width=930) > ... let's call this tmp_r ... Would you send basic stats for these ? q.documentinternalid, documentinformationsubject.documentinternalid, r.targetinternalid, documentinformationsubject.actinternalid Like from this query https://wiki.postgresql.org/wiki/Slow_Query_Questions#Statistics:_n_distinct.2C_MCV.2C_histogram Justin
> The only problem is that I can't test that this actually would trigger the > memory problem, because I can't force the plan to use the right join, it > always reverts to the left join hashing the tmp_q: I think the table on the "OUTER" side is the one which needs to be iterated over (not hashed), in order to return each of its rows even if there are no join partners in the other table. In your original query, the small table was being hashed and the large table iterated; maybe that's whats important. > which is of course much better, but when tmp_q and tmp_r are the results of > complex stuff that the planner can't estimate, then it gets it wrong, and > then the issue gets triggered because we are hashing on the big tmp_r, not > tmp_q. I was able to get something maybe promising ? "Batches: 65536 (originally 1)" I didn't get "Out of memory" error yet, but did crash the server with this one: postgres=# explain analyze WITH v AS( SELECT * FROM generate_series(1,99999999)i WHERE i%10<10 AND i%11<11 AND i%12<12 ANDi%13<13 AND i%14<14 AND i%15<15 AND i%16<16 AND i%17<17 AND i%18<18 AND i%19<19 AND i%20<20 AND i%21<21 ) SELECT * FROMgenerate_series(1,99)k JOIN v ON k=i ; Note, on pg12dev this needs to be "with v AS MATERIALIZED". postgres=# SET work_mem='128kB';SET client_min_messages =log;SET log_statement_stats=on;explain(analyze,timing off) WITHv AS( SELECT * FROM generate_series(1,999999)i WHERE i%10<10 AND i%11<11 AND i%12<12 AND i%13<13 AND i%14<14 AND i%15<15AND i%16<16 AND i%17<17 AND i%18<18 AND i%19<19 AND i%20<20 AND i%21<21 ) SELECT * FROM generate_series(1,99)k JOINv ON k=i ; Hash Join (cost=70.04..83.84 rows=5 width=8) (actual rows=99 loops=1) Hash Cond: (k.k = v.i) CTE v -> Function Scan on generate_series i (cost=0.00..70.00 rows=1 width=4) (actual rows=999999 loops=1) Filter: (((i % 10) < 10) AND ((i % 11) < 11) AND ((i % 12) < 12) AND ((i % 13) < 13) AND ((i % 14) < 14) AND ((i% 15) < 15) AND ((i % 16) < 16) AND ((i % 17) < 17) AND ((i % 18) < 18) AND ((i % 19) < 19) AND ((i % 20) < 20) AND ((i% 21) < 21)) -> Function Scan on generate_series k (cost=0.00..10.00 rows=1000 width=4) (actual rows=99 loops=1) -> Hash (cost=0.02..0.02 rows=1 width=4) (actual rows=999999 loops=1) Buckets: 4096 (originally 1024) Batches: 512 (originally 1) Memory Usage: 101kB -> CTE Scan on v (cost=0.00..0.02 rows=1 width=4) (actual rows=999999 loops=1) Justin
On Sun, Apr 14, 2019 at 11:24:59PM -0400, Tom Lane wrote: > Gunther <raj@gusw.net> writes: > > ExecutorState: 2234123384 total in 266261 blocks; 3782328 free (17244 chunks); 2230341056 used > > Oooh, that looks like a memory leak right enough. The ExecutorState > should not get that big for any reasonable query. On Tue, Apr 16, 2019 at 11:30:19AM -0400, Tom Lane wrote: > Hmm ... this matches up with a vague thought I had that for some reason > the hash join might be spawning a huge number of separate batches. > Each batch would have a couple of files with associated in-memory > state including an 8K I/O buffer, so you could account for the On Tue, Apr 16, 2019 at 10:24:53PM -0400, Gunther wrote: > -> Hash (cost=2861845.87..2861845.87 rows=34619 width=74) (actual time=199792.446..199792.446 rows=113478127 loops=1) > Buckets: 65536 (originally 65536) Batches: 131072 (originally 2) Memory Usage: 189207kB Is it significant that there are ~2x as many ExecutorState blocks as there are batches ? 266261/131072 => 2.03... If there was 1 blocks leaked when batch=2, and 2 blocks leaked when batch=4, and 4 blocks leaked when batch=131072, then when batch=16, there'd be 64k leaked blocks, and 131072 total blocks. I'm guessing Tom probably already thought of this, but: 2230341056/266261 => ~8376 which is pretty close to the 8kB I/O buffer you were talking about (if the number of same-sized buffers much greater than other allocations). If Tom thinks (as I understand) that the issue is *not* a memory leak, but out of control increasing of nbatches, and failure to account for their size...then this patch might help. The number of batches is increased to avoid exceeding work_mem. With very low work_mem (or very larger number of tuples hashed), it'll try to use a large number of batches. At some point the memory used by BatchFiles structure (increasing by powers of two) itself exceeds work_mem. With larger work_mem, there's less need for more batches. So the number of batches used for small work_mem needs to be constrained, either based on work_mem, or at all. With my patch, the number of batches is nonlinear WRT work_mem, and reaches a maximum for moderately small work_mem. The goal is to choose the optimal number of batches to minimize the degree to which work_mem is exceeded. Justin
Attachment
On Tue, Apr 16, 2019 at 11:46:51PM -0500, Justin Pryzby wrote:
I wonder if it'd be useful to compile with ./configure CFLAGS=-DHJDEBUG=1Could you try this, too ?
OK, doing it now, here is what I'm getting in the log file now. I am surprised I get so few rows here when there
2019-04-20 17:12:16.077 UTC [7093] LOG: database system was shut down at 2019-04-20 17:12:15 UTC 2019-04-20 17:12:16.085 UTC [7091] LOG: database system is ready to accept connections Hashjoin 0x118e0c8: initial nbatch = 1, nbuckets = 1024 Hashjoin 0x118e0f8: initial nbatch = 1, nbuckets = 1024 Hashjoin 0x1194e78: initial nbatch = 1, nbuckets = 65536 Hashjoin 0x119b518: initial nbatch = 16, nbuckets = 65536 Hashjoin 0x1194e78: initial nbatch = 1, nbuckets = 65536 Hashjoin 0x119bb38: initial nbatch = 16, nbuckets = 65536 TopMemoryContext: 4347672 total in 9 blocks; 41784 free (19 chunks); 4305888 used HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 1456 free (0 chunks); 6736 used TopTransactionContext: 8192 total in 1 blocks; 5416 free (2 chunks); 2776 used Operator lookup cache: 24576 total in 2 blocks; 10760 free (3 chunks); 13816 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 32768 total in 3 blocks; 13488 free (1 chunks); 19280 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used PortalContext: 1474560 total in 183 blocks; 6152 free (8 chunks); 1468408 used: ExecutorState: 2234501600 total in 266274 blocks; 3696112 free (17274 chunks); 2230805488 used HashTableContext: 32768 total in 3 blocks; 17272 free (8 chunks); 15496 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 40120 used TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 40120 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8454256 total in 6 blocks; 64848 free (32 chunks); 8389408 used HashBatchContext: 177003344 total in 5387 blocks; 7936 free (0 chunks); 176995408 used TupleSort main: 452880 total in 8 blocks; 126248 free (27 chunks); 326632 used Caller tuples: 1048576 total in 8 blocks; 21608 free (14 chunks); 1026968 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used CacheMemoryContext: 1101328 total in 14 blocks; 236384 free (1 chunks); 864944 used index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: entity_id_idx index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: act_id_fkidx ... index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_attribute_relid_attnum_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 6152 free (1 chunks); 2040 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (4 chunks); 256 used Grand total: 2430295856 bytes in 272166 blocks; 5223104 free (17571 chunks); 2425072752 used 2019-04-20 17:28:56.887 UTC [7100] ERROR: out of memory 2019-04-20 17:28:56.887 UTC [7100] DETAIL: Failed on request of size 32800 in memory context "HashBatchContext". 2019-04-20 17:28:56.887 UTC [7100] STATEMENT: explain analyze SELECT * FROM reports.v_BusinessOperation;
(gdb) break ExecHashIncreaseNumBatches Breakpoint 1 at 0x6b7bd6: file nodeHash.c, line 884. (gdb) cont Continuing. Breakpoint 1, ExecHashIncreaseNumBatches (hashtable=0x12d0818) at nodeHash.c:884 884 int oldnbatch = hashtable->nbatch; (gdb) cont Continuing. Breakpoint 1, ExecHashIncreaseNumBatches (hashtable=0x12d0818) at nodeHash.c:884 884 int oldnbatch = hashtable->nbatch; (gdb) cont Continuing. Breakpoint 1, ExecHashIncreaseNumBatches (hashtable=0x12d0818) at nodeHash.c:884 884 int oldnbatch = hashtable->nbatch; (gdb) cont 100 Will ignore next 99 crossings of breakpoint 1. Continuing.But I am surprised now to find that the behavior has changed or what?
7100 postgres 20 0 2797036 1.1g 188064 R 17.9 14.9 14:46.00 postgres: postgres integrator [local] EXPLAIN7664 postgres 20 0 1271788 16228 14408 D 17.6 0.2 0:01.96 postgres: parallel worker for PID 71007665 postgres 20 0 1271788 16224 14404 R 17.6 0.2 0:01.95 postgres: parallel worker for PID 7100
#0 ExecHashIncreaseNumBatches (hashtable=0x12d0818) at nodeHash.c:904 #1 0x00000000006b93b1 in ExecHashTableInsert (hashtable=0x12d0818, slot=0x12bcf98, hashvalue=234960700) at nodeHash.c:1655 #2 0x00000000006bd600 in ExecHashJoinNewBatch (hjstate=0x12a4340) at nodeHashjoin.c:1051 #3 0x00000000006bc999 in ExecHashJoinImpl (pstate=0x12a4340, parallel=false) at nodeHashjoin.c:539 #4 0x00000000006bca23 in ExecHashJoin (pstate=0x12a4340) at nodeHashjoin.c:565 #5 0x00000000006a191f in ExecProcNodeInstr (node=0x12a4340) at execProcnode.c:461 #6 0x00000000006ceaad in ExecProcNode (node=0x12a4340) at ../../../src/include/executor/executor.h:247 #7 0x00000000006cebe7 in ExecSort (pstate=0x12a4230) at nodeSort.c:107 #8 0x00000000006a191f in ExecProcNodeInstr (node=0x12a4230) at execProcnode.c:461 #9 0x00000000006a18f0 in ExecProcNodeFirst (node=0x12a4230) at execProcnode.c:445 #10 0x00000000006cf25c in ExecProcNode (node=0x12a4230) at ../../../src/include/executor/executor.h:247 #11 0x00000000006cf388 in ExecUnique (pstate=0x12a4040) at nodeUnique.c:73 #12 0x00000000006a191f in ExecProcNodeInstr (node=0x12a4040) at execProcnode.c:461 #13 0x00000000006a18f0 in ExecProcNodeFirst (node=0x12a4040) at execProcnode.c:445 #14 0x000000000069728b in ExecProcNode (node=0x12a4040) at ../../../src/include/executor/executor.h:247 #15 0x0000000000699790 in ExecutePlan (estate=0x12a3da0, planstate=0x12a4040, use_parallel_mode=true, operation=CMD_SELECT, sendTuples=true, numberTuples=0, direction=ForwardScanDirection, dest=0xe811c0 <donothingDR>, execute_once=true) at execMain.c:1723 #16 0x0000000000697757 in standard_ExecutorRun (queryDesc=0x1404168, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:364 #17 0x00000000006975f4 in ExecutorRun (queryDesc=0x1404168, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:307 #18 0x000000000060d227 in ExplainOnePlan (plannedstmt=0x1402588, into=0x0, es=0x10b03a8, queryString=0x10866c0 "explain analyze SELECT * FROM reports.v_BusinessOperation;", params=0x0, queryEnv=0x0, planduration=0x7fff56a0df00) at explain.c:535
(gdb) break AllocSetAlloc if (int)strcmp(context->name, "ExecutorState") == 0 Breakpoint 4 at 0x9eab11: file aset.c, line 719. (gdb) cont Continuing. Breakpoint 4, AllocSetAlloc (context=0x12a3c90, size=8272) at aset.c:719 719 AllocSet set = (AllocSet) context; (gdb) bt 5 #0 AllocSetAlloc (context=0x12a3c90, size=8272) at aset.c:719 #1 0x00000000009f2e47 in palloc (size=8272) at mcxt.c:938 #2 0x000000000082ae84 in makeBufFileCommon (nfiles=1) at buffile.c:116 #3 0x000000000082af14 in makeBufFile (firstfile=34029) at buffile.c:138 #4 0x000000000082b09f in BufFileCreateTemp (interXact=false) at buffile.c:201 (More stack frames follow...) (gdb) bt 7 #0 AllocSetAlloc (context=0x12a3c90, size=8272) at aset.c:719 #1 0x00000000009f2e47 in palloc (size=8272) at mcxt.c:938 #2 0x000000000082ae84 in makeBufFileCommon (nfiles=1) at buffile.c:116 #3 0x000000000082af14 in makeBufFile (firstfile=34029) at buffile.c:138 #4 0x000000000082b09f in BufFileCreateTemp (interXact=false) at buffile.c:201 #5 0x00000000006bda31 in ExecHashJoinSaveTuple (tuple=0x86069b8, hashvalue=234960700, fileptr=0x8616a30) at nodeHashjoin.c:1220 #6 0x00000000006b80c0 in ExecHashIncreaseNumBatches (hashtable=0x12d0818) at nodeHash.c:1004 (More stack frames follow...) (gdb) cont Continuing. Breakpoint 4, AllocSetAlloc (context=0x12a3c90, size=8) at aset.c:719 719 AllocSet set = (AllocSet) context; (gdb) bt 7 #0 AllocSetAlloc (context=0x12a3c90, size=8) at aset.c:719 #1 0x00000000009f2f5d in palloc0 (size=8) at mcxt.c:969 #2 0x000000000082aea2 in makeBufFileCommon (nfiles=1) at buffile.c:119 #3 0x000000000082af14 in makeBufFile (firstfile=34029) at buffile.c:138 #4 0x000000000082b09f in BufFileCreateTemp (interXact=false) at buffile.c:201 #5 0x00000000006bda31 in ExecHashJoinSaveTuple (tuple=0x86069b8, hashvalue=234960700, fileptr=0x8616a30) at nodeHashjoin.c:1220 #6 0x00000000006b80c0 in ExecHashIncreaseNumBatches (hashtable=0x12d0818) at nodeHash.c:1004 (More stack frames follow...) (gdb) cont Continuing. Breakpoint 4, AllocSetAlloc (context=0x12a3c90, size=4) at aset.c:719 719 AllocSet set = (AllocSet) context; (gdb) bt 7 #0 AllocSetAlloc (context=0x12a3c90, size=4) at aset.c:719 #1 0x00000000009f2e47 in palloc (size=4) at mcxt.c:938 #2 0x000000000082af22 in makeBufFile (firstfile=34029) at buffile.c:140 #3 0x000000000082b09f in BufFileCreateTemp (interXact=false) at buffile.c:201 #4 0x00000000006bda31 in ExecHashJoinSaveTuple (tuple=0x86069b8, hashvalue=234960700, fileptr=0x8616a30) at nodeHashjoin.c:1220 #5 0x00000000006b80c0 in ExecHashIncreaseNumBatches (hashtable=0x12d0818) at nodeHash.c:1004 #6 0x00000000006b93b1 in ExecHashTableInsert (hashtable=0x12d0818, slot=0x12bcf98, hashvalue=234960700) at nodeHash.c:1655 (More stack frames follow...) (gdb) cont Continuing. Breakpoint 4, AllocSetAlloc (context=0x12a3c90, size=375) at aset.c:719 719 AllocSet set = (AllocSet) context; (gdb) bt 7 #0 AllocSetAlloc (context=0x12a3c90, size=375) at aset.c:719 #1 0x00000000009f2e47 in palloc (size=375) at mcxt.c:938 #2 0x00000000006bdbec in ExecHashJoinGetSavedTuple (hjstate=0x12a4340, file=0x13ca418, hashvalue=0x7fff56a0da54, tupleSlot=0x12bcf98) at nodeHashjoin.c:1277 #3 0x00000000006bd61f in ExecHashJoinNewBatch (hjstate=0x12a4340) at nodeHashjoin.c:1042 #4 0x00000000006bc999 in ExecHashJoinImpl (pstate=0x12a4340, parallel=false) at nodeHashjoin.c:539 #5 0x00000000006bca23 in ExecHashJoin (pstate=0x12a4340) at nodeHashjoin.c:565 #6 0x00000000006a191f in ExecProcNodeInstr (node=0x12a4340) at execProcnode.c:461 (More stack frames follow...)
(gdb) finish Run till exit from #0 AllocSetAlloc (context=0x12a3c90, size=375) at aset.c:719 0x00000000009f2e47 in palloc (size=375) at mcxt.c:938 938 ret = context->methods->alloc(context, size); Value returned is $1 = (void *) 0x11bd858 (gdb) cont Continuing. Breakpoint 5, AllocSetFree (context=0x12a3c90, pointer=0x11bf748) at aset.c:992 992 AllocSet set = (AllocSet) context; (gdb) cont Continuing. Breakpoint 4, AllocSetAlloc (context=0x12a3c90, size=375) at aset.c:719 719 AllocSet set = (AllocSet) context; (gdb) finish Run till exit from #0 AllocSetAlloc (context=0x12a3c90, size=375) at aset.c:719 0x00000000009f2e47 in palloc (size=375) at mcxt.c:938 938 ret = context->methods->alloc(context, size); Value returned is $2 = (void *) 0x11bf748 (gdb) cont Continuing. Breakpoint 5, AllocSetFree (context=0x12a3c90, pointer=0x11bd858) at aset.c:992 992 AllocSet set = (AllocSet) context;
Breakpoint 2, ExecHashIncreaseNumBatches (hashtable=0x12d0818) at nodeHash.c:904 904 printf("Hashjoin %p: increasing nbatch to %d because space = %zu\n", (gdb) next 908 oldcxt = MemoryContextSwitchTo(hashtable->hashCxt); (gdb) call MemoryContextStats(TopPortalContext)
Hashjoin 0x118e4c8: initial nbatch = 1, nbuckets = 1024 Hashjoin 0x118e4f8: initial nbatch = 1, nbuckets = 1024 Hashjoin 0x1195278: initial nbatch = 1, nbuckets = 65536 Hashjoin 0x119b918: initial nbatch = 16, nbuckets = 65536 Hashjoin 0x1195278: initial nbatch = 1, nbuckets = 65536 Hashjoin 0x119b918: initial nbatch = 16, nbuckets = 65536 ... Hashjoin 0x13a8e68: initial nbatch = 16, nbuckets = 8192 Hashjoin 0x13a8e68: increasing nbatch to 32 because space = 4128933 Hashjoin 0x13a8e68: freed 148 of 10584 tuples, space now 4071106 Hashjoin 0x13a8e68: increasing nbatch to 64 because space = 4128826 Hashjoin 0x13a8e68: freed 544 of 10584 tuples, space now 3916296 Hashjoin 0x13a8e68: increasing nbatch to 128 because space = 4128846 Hashjoin 0x13a8e68: freed 10419 of 10585 tuples, space now 65570 Hashjoin 0x13a8e68: increasing nbatch to 256 because space = 4128829 Hashjoin 0x13a8e68: freed 10308 of 10734 tuples, space now 161815 Hashjoin 0x13a8e68: increasing nbatch to 512 because space = 4128908 Hashjoin 0x13a8e68: freed 398 of 10379 tuples, space now 3977787 Hashjoin 0x13a8e68: increasing nbatch to 1024 because space = 4129008 Hashjoin 0x13a8e68: freed 296 of 10360 tuples, space now 4013423 Hashjoin 0x13a8e68: increasing nbatch to 2048 because space = 4129133 Hashjoin 0x13a8e68: freed 154 of 10354 tuples, space now 4068786 Hashjoin 0x13a8e68: increasing nbatch to 4096 because space = 4129035 Hashjoin 0x13a8e68: freed 10167 of 10351 tuples, space now 72849 Hashjoin 0x242c9b0: initial nbatch = 1, nbuckets = 1024 Hashjoin 0x2443ee0: initial nbatch = 1, nbuckets = 1024 Hashjoin 0x2443aa0: initial nbatch = 1, nbuckets = 1024 Hashjoin 0x2443440: initial nbatch = 1, nbuckets = 65536 Hashjoin 0x2443330: initial nbatch = 16, nbuckets = 65536 Hashjoin 0x13a8e68: increasing nbatch to 8192 because space = 4128997 Hashjoin 0x12d0818: freed 10555 of 10560 tuples, space now 1983 Hashjoin 0x12d0818: increasing nbatch to 16384 because space = 4128957 Hashjoin 0x12d0818: freed 10697 of 10764 tuples, space now 25956TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used ...
(gdb) cont 100 Will ignore next 99 crossings of breakpoint 1. Continuing. Breakpoint 1, ExecHashIncreaseNumBatches (hashtable=0x12d0818) at nodeHash.c:884 884 int oldnbatch = hashtable->nbatch; (gdb) bt 7 #0 ExecHashIncreaseNumBatches (hashtable=0x12d0818) at nodeHash.c:884 #1 0x00000000006b93b1 in ExecHashTableInsert (hashtable=0x12d0818, slot=0x12bcf98, hashvalue=2161368) at nodeHash.c:1655 #2 0x00000000006bd600 in ExecHashJoinNewBatch (hjstate=0x12a4340) at nodeHashjoin.c:1051 #3 0x00000000006bc999 in ExecHashJoinImpl (pstate=0x12a4340, parallel=false) at nodeHashjoin.c:539 #4 0x00000000006bca23 in ExecHashJoin (pstate=0x12a4340) at nodeHashjoin.c:565 #5 0x00000000006a191f in ExecProcNodeInstr (node=0x12a4340) at execProcnode.c:461 #6 0x00000000006ceaad in ExecProcNode (node=0x12a4340) at ../../../src/include/executor/executor.h:247 (More stack frames follow...)
... ExecutorState: 782712360 total in 92954 blocks; 3626888 free (3126 chunks); 779085472 used
The only problem is that I can't test that this actually would trigger the memory problem, because I can't force the plan to use the right join, it always reverts to the left join hashing the tmp_q:I think the table on the "OUTER" side is the one which needs to be iterated over (not hashed), in order to return each of its rows even if there are no join partners in the other table. In your original query, the small table was being hashed and the large table iterated; maybe that's whats important.
May be so. Trying to wrap my head around the RIGHT vs. LEFT outer join and why there even is a difference though.
-> Hash Right Join (cost=4255031.53..5530808.71 rows=34619 width=1197) Hash Cond: (((q.documentinternalid)::text = (documentinformationsubject.documentinternalid)::text) AND ((r.targetinternalid)::text = (documentinformationsubject.actinternalid)::text)) -> Hash Right Join (cost=1341541.37..2612134.36 rows=13 width=341) Hash Cond: (((documentinformationsubject_2.documentinternalid)::text = (q.documentinternalid)::text) AND ((documentinformationsubject_2.actinternalid)::text = (q.actinternalid)::text)) ... from the TINY table tmp_q, the estimate of 13 rows ain't bad -> Hash (cost=2908913.87..2908913.87 rows=34619 width=930) -> Gather Merge (cost=2892141.40..2908913.87 rows=34619 width=930) .... from the HUGE tmp_r table, with 100 million rows (estimate of 34619 is grossly wrong, but how could it know? ...
Now in a right join, we include all rows from the right table, and only those from the left table that match the join key. I wonder why not transform all of those to left joins then?
Why are we not hashing only the "optional" side?
The plan here seems to tell me without a doubt that it is hashing the big table, 100 million rows get hashed into the hash join and then we iterate over the tiny table and then ... now my mind boggles and I just don't know why there are right joins at all.
But I see the plan is running the Hash index of 100 million rows.
which is of course much better, but when tmp_q and tmp_r are the results of complex stuff that the planner can't estimate, then it gets it wrong, and then the issue gets triggered because we are hashing on the big tmp_r, not tmp_q.I was able to get something maybe promising ? "Batches: 65536 (originally 1)" I didn't get "Out of memory" error yet, but did crash the server with this one: postgres=# explain analyze WITH v AS( SELECT * FROM generate_series(1,99999999)i WHERE i%10<10 AND i%11<11 AND i%12<12 AND i%13<13 AND i%14<14 AND i%15<15 AND i%16<16 AND i%17<17 AND i%18<18 AND i%19<19 AND i%20<20 AND i%21<21 ) SELECT * FROM generate_series(1,99)k JOIN v ON k=i ; Note, on pg12dev this needs to be "with v AS MATERIALIZED". postgres=# SET work_mem='128kB';SET client_min_messages =log;SET log_statement_stats=on;explain(analyze,timing off) WITH v AS( SELECT * FROM generate_series(1,999999)i WHERE i%10<10 AND i%11<11 AND i%12<12 AND i%13<13 AND i%14<14 AND i%15<15 AND i%16<16 AND i%17<17 AND i%18<18 AND i%19<19 AND i%20<20 AND i%21<21 ) SELECT * FROM generate_series(1,99)k JOIN v ON k=i ;Hash Join (cost=70.04..83.84 rows=5 width=8) (actual rows=99 loops=1) Hash Cond: (k.k = v.i) CTE v -> Function Scan on generate_series i (cost=0.00..70.00 rows=1 width=4) (actual rows=999999 loops=1) Filter: (((i % 10) < 10) AND ((i % 11) < 11) AND ((i % 12) < 12) AND ((i % 13) < 13) AND ((i % 14) < 14) AND ((i % 15) < 15) AND ((i % 16) < 16) AND ((i % 17) < 17) AND ((i % 18) < 18) AND ((i % 19) < 19) AND ((i % 20) < 20) AND ((i % 21) < 21)) -> Function Scan on generate_series k (cost=0.00..10.00 rows=1000 width=4) (actual rows=99 loops=1) -> Hash (cost=0.02..0.02 rows=1 width=4) (actual rows=999999 loops=1) Buckets: 4096 (originally 1024) Batches: 512 (originally 1) Memory Usage: 101kB -> CTE Scan on v (cost=0.00..0.02 rows=1 width=4) (actual rows=999999 loops=1)
Yes I thought that with CTEs and functions one might be able to generate a test case, but still not seing how you can trick the planner into this peculiar jointype JOIN_RIGHT and whether that is requisite for triggering the problem.
Finally, I have tried to make a pstate pretty printer in explain.c:
void DumpPlanState(PlanState *pstate) { ExplainState *es = NewExplainState(); ExplainNode(pstate, NIL, NULL, NULL, es); puts(es->str->data); pfree(es->str->data); }
but that didn't work, because unfortunately that ExplainNode function is destructive. It would be so nice to refactor this explain code such that there would be a completely conservative function that simply dumps the present pstate with all the information about its estimate and actual situation, how many iterations it has already accomplished, how many it estimates to still have to do, whether its original estimate panned out or not, etc. This would be so tremendously useful for runtime debugging of queries. I think the utility of this can hardly be overstated. I mean even for end-user applications of some data warehouse, where you could probe a long running query every 5 seconds as to where the execution is. Man, I could not think of any more low hanging fruit useful feature. I am sure that if PostgreSQL was originally written in Java, this feature would naturally exist already.
regards and Happy Easter,
-Gunther
On Fri, Apr 19, 2019 at 11:34:54PM -0400, Gunther wrote: > > ... > >It would be so nice if there was a way to force a specific plan for >purposes of the testing. I tried giving false data in pg_class >reltuples and relpages: > >foo=# analyze tmp_q; >ANALYZE >foo=# analyze tmp_r; >ANALYZE >foo=# select relname, relpages, reltuples from pg_class where relname in ('tmp_q', 'tmp_r'); > relname | relpages | reltuples >---------+----------+------------- > tmp_r | 5505039 | 1.13467e+08 > tmp_q | 7 | 236 >(2 rows) > >foo=# update pg_class set (relpages, reltuples) = (5505039, 1.13467e+08) where relname = 'tmp_q'; >UPDATE 1 >foo=# update pg_class set (relpages, reltuples) = (7, 236) where relname = 'tmp_r'; >UPDATE 1 > >but that didn't help. Somehow the planner outsmarts every such trick, >so I can't get it to follow my right outer join plan where the big >table is hashed. I am sure y'all know some way to force it. > That does not work, because the planner does not actually use these values directly - it only computes the density from them, and then multiplies that to the current number of pages in the file. That behaves much nicer when the table grows/shrinks between refreshes of the pg_class values. So what you need to do is tweak these values to skew the density in a way that then results in the desired esimate when multiplied with the actual number of pages. For me, this did the trick: update pg_class set (relpages, reltuples) = (1000000, 1) where relname = 'tmp_r'; update pg_class set (relpages, reltuples) = (1, 1000000) where relname = 'tmp_q'; after which I get a plan like this: Hash Right Join Hash Cond: (...) -> Seq Scan on tmp_q q -> Hash -> Seq Scan on tmp_r r As for the issue, I have a theory that I think would explain the issues. It is related to the number of batches, as others speculated over here. It's not a memory leak, though, it's just that each batch requires a lot of extra memory and we don't account for that. The trouble is - each batch is represented by BufFile, which is a whopping 8272 bytes, because it includes PGAlignedBlock. Now, with 131072 batches, that's a nice 1GB of memory right there. And we don't actually account for this memory in hashjoin code, so it's not counted against work_mem and we just increase the number of batches. Attached are two patches, that should help us to confirm that's actually what's happening when running the query on actual data. The first patch moves the BufFile stuff into a separate memory context, to make it more obvious where the memory went. It also adds a buch of logging into the ExecHashIncreaseNumBatches() function. The second patch makes sure all the BufFiles are allocated right when increasing the number of batches - otherwise we allocate them only when we actually find a row for that batch, and I suspect the sample data shared on this thread are somewhat correlated (I see long runs of the same UUID value). That might slow down the memory growth. Of course, the real data probably don't have such correlation, resulting in faster failures. With the patch, I see stuff like this with 256k batches: ExecutorState: 65536 total in 4 blocks; 28136 free (4 chunks); 37400 used HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used hash batch files: 4404002656 total in 524302 blocks; 8387928 free (20 chunks); 4395614728 used so it's conceivable it's the root cause. As for a fix, I'm not sure. I'm pretty sure we need to consider the amount of memory for BufFile(s) when increasing the number of batches. But we can't just stop incrementing the batches, because that would mean the current batch may easily get bigger than work_mem :-( I think we might cap the number of batches kept in memory, and at some point start spilling data into an "overflow batch." So for example we'd allow 32k batches, and then instead of increasing nbatch to 64k, we'd create a single "overflow batch" representing batches 32k - 64k. After processing the first 32k batches, we'd close the files and reuse the memory for the next 32k batches. We'd read the overflow batch, split it into the 32k batches, and just process them as usual. Of course, there might be multiple rounds of this, for example we might end up with 32k concurrent batches but 128k virtual ones, which means we'd do 4 rounds of this dance. It's a bit inefficient, but situations like this should be rather rare, and it's more graceful than just crashing with OOM. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Sat, Apr 20, 2019 at 02:30:09PM -0500, Justin Pryzby wrote: >On Sun, Apr 14, 2019 at 11:24:59PM -0400, Tom Lane wrote: >> Gunther <raj@gusw.net> writes: >> > ExecutorState: 2234123384 total in 266261 blocks; 3782328 free (17244 chunks); 2230341056 used >> >> Oooh, that looks like a memory leak right enough. The ExecutorState >> should not get that big for any reasonable query. > >On Tue, Apr 16, 2019 at 11:30:19AM -0400, Tom Lane wrote: >> Hmm ... this matches up with a vague thought I had that for some reason >> the hash join might be spawning a huge number of separate batches. >> Each batch would have a couple of files with associated in-memory >> state including an 8K I/O buffer, so you could account for the > >On Tue, Apr 16, 2019 at 10:24:53PM -0400, Gunther wrote: >> -> Hash (cost=2861845.87..2861845.87 rows=34619 width=74) (actual time=199792.446..199792.446 rows=113478127 loops=1) >> Buckets: 65536 (originally 65536) Batches: 131072 (originally 2) Memory Usage: 189207kB > >Is it significant that there are ~2x as many ExecutorState blocks as there are >batches ? 266261/131072 => 2.03... > IMO that confirms this is the issue with BufFile I just described, because the struct is >8K, so it's allocated as a separate block (it exceeds the threshold in AllocSet). And we have two BufFile(s) for each batch, because we need to batch both the inner and outer relations. >If there was 1 blocks leaked when batch=2, and 2 blocks leaked when batch=4, >and 4 blocks leaked when batch=131072, then when batch=16, there'd be 64k >leaked blocks, and 131072 total blocks. > >I'm guessing Tom probably already thought of this, but: >2230341056/266261 => ~8376 Well, the BufFile is 8272 on my system, so that's pretty close ;-) >which is pretty close to the 8kB I/O buffer you were talking about (if the >number of same-sized buffers much greater than other allocations). > >If Tom thinks (as I understand) that the issue is *not* a memory leak, but out >of control increasing of nbatches, and failure to account for their size...then >this patch might help. > >The number of batches is increased to avoid exceeding work_mem. With very low >work_mem (or very larger number of tuples hashed), it'll try to use a large >number of batches. At some point the memory used by BatchFiles structure >(increasing by powers of two) itself exceeds work_mem. > >With larger work_mem, there's less need for more batches. So the number of >batches used for small work_mem needs to be constrained, either based on >work_mem, or at all. > >With my patch, the number of batches is nonlinear WRT work_mem, and reaches a >maximum for moderately small work_mem. The goal is to choose the optimal >number of batches to minimize the degree to which work_mem is exceeded. > Yeah. The patch might be enough for debugging, but it's clearly not something we could adopt as is, because we increase the number of batches for a reason - we need to do that to keep the amount of memory needed for the hash table contents (i.e. rows) below work_mem. If you just cap the number of batches, you'll keep the amount of memory for BufFile under control, but the hash table may exceed work_mem. Considering how rare this issue likely is, we need to be looking for a solution that does not break the common case. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > Considering how rare this issue likely is, we need to be looking for a > solution that does not break the common case. Agreed. What I think we need to focus on next is why the code keeps increasing the number of batches. It seems like there must be an undue amount of data all falling into the same bucket ... but if it were simply a matter of a lot of duplicate hash keys, the growEnabled shutoff heuristic ought to trigger. regards, tom lane
On Sat, Apr 20, 2019 at 04:26:34PM -0400, Tom Lane wrote: >Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> Considering how rare this issue likely is, we need to be looking for a >> solution that does not break the common case. > >Agreed. What I think we need to focus on next is why the code keeps >increasing the number of batches. It seems like there must be an undue >amount of data all falling into the same bucket ... but if it were simply >a matter of a lot of duplicate hash keys, the growEnabled shutoff >heuristic ought to trigger. > I think it's really a matter of underestimate, which convinces the planner to hash the larger table. In this case, the table is 42GB, so it's possible it actually works as expected. With work_mem = 4MB I've seen 32k batches, and that's not that far off, I'd day. Maybe there are more common values, but it does not seem like a very contrived data set. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > I think it's really a matter of underestimate, which convinces the planner > to hash the larger table. In this case, the table is 42GB, so it's > possible it actually works as expected. With work_mem = 4MB I've seen 32k > batches, and that's not that far off, I'd day. Maybe there are more common > values, but it does not seem like a very contrived data set. Maybe we just need to account for the per-batch buffers while estimating the amount of memory used during planning. That would force this case into a mergejoin instead, given that work_mem is set so small. regards, tom lane
Gunther <raj@gusw.net> writes: > and checked my log file and there was nothing before the call > MemoryContextStats(TopPortalContext) so I don't understand where this > printf stuff is ending up. It's going to stdout, which is likely block-buffered whereas stderr is line-buffered, so data from the latter will show up in your log file much sooner. You might consider adding something to startup to switch stdout to line buffering. regards, tom lane
On Sat, Apr 20, 2019 at 04:46:03PM -0400, Tom Lane wrote: >Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> I think it's really a matter of underestimate, which convinces the planner >> to hash the larger table. In this case, the table is 42GB, so it's >> possible it actually works as expected. With work_mem = 4MB I've seen 32k >> batches, and that's not that far off, I'd day. Maybe there are more common >> values, but it does not seem like a very contrived data set. > >Maybe we just need to account for the per-batch buffers while estimating >the amount of memory used during planning. That would force this case >into a mergejoin instead, given that work_mem is set so small. > How would that solve the issue of underestimates like this one? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Apr 20, 2019 at 10:36:50PM +0200, Tomas Vondra wrote: >On Sat, Apr 20, 2019 at 04:26:34PM -0400, Tom Lane wrote: >>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >>>Considering how rare this issue likely is, we need to be looking for a >>>solution that does not break the common case. >> >>Agreed. What I think we need to focus on next is why the code keeps >>increasing the number of batches. It seems like there must be an undue >>amount of data all falling into the same bucket ... but if it were simply >>a matter of a lot of duplicate hash keys, the growEnabled shutoff >>heuristic ought to trigger. >> > >I think it's really a matter of underestimate, which convinces the planner >to hash the larger table. In this case, the table is 42GB, so it's >possible it actually works as expected. With work_mem = 4MB I've seen 32k >batches, and that's not that far off, I'd day. Maybe there are more common >values, but it does not seem like a very contrived data set. > Actually, I might have spoken too soon. I've dne some stats on the sample data. There are 113478127 rows in total, and while most UUIDs are unique, there are UUIDs that represent ~10% of the data. So maybe there really is something broken in disabling the growth. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Sat, Apr 20, 2019 at 04:46:03PM -0400, Tom Lane wrote: > Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > > I think it's really a matter of underestimate, which convinces the planner > > to hash the larger table. In this case, the table is 42GB, so it's > > possible it actually works as expected. With work_mem = 4MB I've seen 32k > > batches, and that's not that far off, I'd day. Maybe there are more common > > values, but it does not seem like a very contrived data set. > > Maybe we just need to account for the per-batch buffers while estimating > the amount of memory used during planning. That would force this case > into a mergejoin instead, given that work_mem is set so small. Do you mean by adding disable_cost if work_mem is so small that it's estimated to be exceeded ? Justin
Justin Pryzby <pryzby@telsasoft.com> writes: > On Sat, Apr 20, 2019 at 04:46:03PM -0400, Tom Lane wrote: >> Maybe we just need to account for the per-batch buffers while estimating >> the amount of memory used during planning. That would force this case >> into a mergejoin instead, given that work_mem is set so small. > Do you mean by adding disable_cost if work_mem is so small that it's estimated > to be exceeded ? No, my point is that ExecChooseHashTableSize fails to consider the I/O buffers at all while estimating hash table size. It's not immediately obvious how to factor that in, but we should. If Tomas is right that there's also an underestimate of the number of rows here, that might not solve Gunther's immediate problem; but it seems like a clear costing oversight. There's also the angle that the runtime code acts as though increasing the number of batches is free, while it clearly isn't when you think about the I/O buffers. So at some point we should probably stop increasing the number of batches on the grounds of needing too many buffers. regards, tom lane
On Sat, Apr 20, 2019 at 06:20:15PM -0400, Tom Lane wrote: >Justin Pryzby <pryzby@telsasoft.com> writes: >> On Sat, Apr 20, 2019 at 04:46:03PM -0400, Tom Lane wrote: >>> Maybe we just need to account for the per-batch buffers while estimating >>> the amount of memory used during planning. That would force this case >>> into a mergejoin instead, given that work_mem is set so small. > >> Do you mean by adding disable_cost if work_mem is so small that it's estimated >> to be exceeded ? > >No, my point is that ExecChooseHashTableSize fails to consider the >I/O buffers at all while estimating hash table size. It's not >immediately obvious how to factor that in, but we should. > >If Tomas is right that there's also an underestimate of the number >of rows here, that might not solve Gunther's immediate problem; but >it seems like a clear costing oversight. > >There's also the angle that the runtime code acts as though increasing >the number of batches is free, while it clearly isn't when you think >about the I/O buffers. So at some point we should probably stop >increasing the number of batches on the grounds of needing too many >buffers. Yes. I think it might be partially due to the cost being hidden elsewhere. The hashjoin code only really deals with array of pointers to BufFile, not with the BufFiles. And might have looked insignificant for common cases, but clearly for these corner cases it matters quite a bit. So yes, ExecChooseHashTableSize() needs to consider this memory and check if doubling the number of batches has any chance of actually improving things, because at some point the BufFile memory starts to dominate and would just force us to do more and more batches. But I think we should also consider this before even creating the hash join path - see if the expected number of batches has any chance of fitting into work_mem, and if not then just not create the path at all. Just like we do for hash aggregate, for example. It's not going to solve cases like this (with underestimates), but it seems reasonable. Although, maybe we won't actually use such paths, because merge join will win thanks to being automatically cheaper? Not sure. Also, I wonder if we really need 8kB buffers here. Would it make sense to allow smaller buffers in some cases? Say, 1kB. It's not going to save us, but it's still 8x better than now. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
For me, this did the trick:
update pg_class set (relpages, reltuples) = (1000000, 1) where relname = 'tmp_r';
update pg_class set (relpages, reltuples) = (1, 1000000) where relname = 'tmp_q';
YES! For me too. My EXPLAIN ANALYZE actually succeeded.
Hash Right Join (cost=11009552.27..11377073.28 rows=11 width=4271) (actual time=511580.110..1058354.140 rows=113478386 loops=1) Hash Cond: (((q.documentinternalid)::text = (r.documentinternalid)::text) AND ((q.actinternalid)::text = (r.actinternalid)::text)) -> Seq Scan on tmp_q q (cost=0.00..210021.00 rows=21000000 width=3417) (actual time=1.148..1.387 rows=236 loops=1) -> Hash (cost=11009552.11..11009552.11 rows=11 width=928) (actual time=511577.002..511577.002 rows=113478127 loops=1) Buckets: 16384 (originally 1024) Batches: 131072 (originally 1) Memory Usage: 679961kB -> Seq Scan on tmp_r r (cost=0.00..11009552.11 rows=11 width=928) (actual time=4.077..344558.954 rows=113478127 loops=1)Planning Time: 0.725 msExecution Time: 1064128.721 msBut it used a lot of resident memory, and now it seems like I actually have a leak! Because after the command returned as shown above, the memory is still allocated:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND7100 postgres 20 0 2164012 1.1g 251364 S 0.0 14.5 23:27.23 postgres: postgres integrator [local] idle and let's do the memory map dump: 2019-04-20 22:09:52.522 UTC [7100] LOG: duration: 1064132.171 ms statement: explain analyze SELECT * FROM tmp_q q RIGHT OUTER JOIN tmp_r r USING(documentInternalId, actInternalId); TopMemoryContext: 153312 total in 8 blocks; 48168 free (70 chunks); 105144 used HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Operator lookup cache: 24576 total in 2 blocks; 10760 free (3 chunks); 13816 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 8192 total in 1 blocks; 6896 free (1 chunks); 1296 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7936 free (1 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used CacheMemoryContext: 1154080 total in 20 blocks; 151784 free (1 chunks); 1002296 used index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: pg_class_tblspc_relfilenode_index index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: entity_id_idx ... index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_attribute_relid_attnum_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 4992 free (6 chunks); 3200 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (3 chunks); 256 used Grand total: 2082624 bytes in 240 blocks; 382760 free (175 chunks); 1699864 used
strange, it shows no leak here. Now I run this test again, to see if the memory grows further in top? This time I also leave the DISTINCT step in the query. I am trying to hit the out of memory situation. Well, I clearly saw memory growing now:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND7100 postgres 20 0 2601900 1.5g 251976 R 97.7 19.9 36:32.23 postgres: postgres integrator [local] EXPLAIN TopMemoryContext: 2250520 total in 9 blocks; 45384 free (56 chunks); 2205136 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 1456 free (0 chunks); 6736 used TopTransactionContext: 8192 total in 1 blocks; 7528 free (1 chunks); 664 used HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Operator lookup cache: 24576 total in 2 blocks; 10760 free (3 chunks); 13816 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 65536 total in 4 blocks; 28664 free (9 chunks); 36872 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used PortalContext: 147456 total in 21 blocks; 10400 free (7 chunks); 137056 used: ExecutorState: 489605432 total in 57794 blocks; 5522192 free (129776 chunks); 484083240 used HashTableContext: 2162800 total in 6 blocks; 64848 free (35 chunks); 2097952 used HashBatchContext: 706576176 total in 21503 blocks; 7936 free (0 chunks); 706568240 used TupleSort main: 452880 total in 8 blocks; 125880 free (29 chunks); 327000 used Caller tuples: 4194304 total in 10 blocks; 452280 free (20 chunks); 3742024 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used CacheMemoryContext: 1154080 total in 20 blocks; 149992 free (1 chunks); 1004088 used index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: pg_class_tblspc_relfilenode_index index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index ... index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_attribute_relid_attnum_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 4992 free (6 chunks); 3200 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Grand total: 1207458200 bytes in 79595 blocks; 6639272 free (130033 chunks); 1200818928 used but Executor state is only 489 MB, so it is in the area of slow but massive growth. ExecutorState: 489605432 total in 57794 blocks; 5522192 free (129776 chunks); 484083240 used HashTableContext: 2162800 total in 6 blocks; 64848 free (35 chunks); 2097952 used HashBatchContext: 706576176 total in 21503 blocks; 7936 free (0 chunks); 706568240 used TupleSort main: 452880 total in 8 blocks; 125880 free (29 chunks); 327000 used Caller tuples: 4194304 total in 10 blocks; 452280 free (20 chunks); 3742024 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 usednow I see if I can run it to completion anyway, and if there will then be a new bottom of memory. Now the newly allocated memory seems to have been released, so we stick to the 1.1G baseline we started out with.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND7100 postgres 20 0 2164012 1.1g 251976 D 6.3 14.5 55:26.82 postgres: postgres integrator [local] EXPLAIN
on the other hand, the sort step is not yet finished.
Also, I think while we might have focused in on a peculiar planning situation where a very unfortunate plan is chosen which stresses the memory situation, the real reason for the final out of memory situation has not yet been determined. Remember, I have seen 3 stages in my original query:
- Steady state, sort-merge join active high CPU% memory at or below 100 kB
- Slow massive growth from over 200 kB to 1.5 GB or 1.8 GB
- Explosive growth within a second to over 2.2 GB
It might well be that my initial query would have succeeded just fine despite the unfortunate plan with the big memory consumption on the oddly planned hash join, were it not for that third phase of explosive growth. And we haven't been able to catch this, because it happens too quickly.
It seems to me that some better memory tracing would be necessary. Looking at src/backend/utils/memdebug.c, mentions Valgrind. But to me, Valgrind would be a huge toolbox to just look after one thing. I wonder if we could not make a much simpler memory leak debugger tool. One that is fast, yet doesn't provide too much output to overwhelm the log destination file system (and waste too much time). There are already Debug macros there which, if enabled, just create an absolutely crazy amount of undecipherable log file content, because ALL backend processes would spit out this blabber. So I already stopped that by adding a variable that must be set to 1 (using the debugger on exactly one process for exactly the time desired):
int _alloc_info = 0; #ifdef HAVE_ALLOCINFO #define AllocFreeInfo(_cxt, _chunk) \ if(_alloc_info) \ fprintf(stderr, "AllocFree: %s: %p, %zu\n", \ (_cxt)->header.name, (_chunk), (_chunk)->size) #define AllocAllocInfo(_cxt, _chunk) \ if(_alloc_info) \ fprintf(stderr, "AllocAlloc: %s: %p, %zu\n", \ (_cxt)->header.name, (_chunk), (_chunk)->size) #else #define AllocFreeInfo(_cxt, _chunk) #define AllocAllocInfo(_cxt, _chunk) #endif
But now I am thinking that should be the hook to use a limited cache where we can cancel out AllocSetAlloc with their AllocSetFree calls that follow relatively rapidly, which apparently is the majority of the log chatter created.
The memory debugger would allocate a single fixed memory chunk like 8 or 16 kB as a cache per each memory context that is actually traced. In each we would record the memory allocation in the shortest possible way. With everything compressed. Instead of pointer, references to the memory we would store whatever memory chunk index, a very brief backtrace would be stored in a compressed form, by instruction pointer ($rip) of the caller and then variable length differences to the $rip of the caller next up. These could even be stored in an index of the 100 most common caller chains to compress this data some more, while minimizing the cost in searching. Now each allocated chunk would be kept in this table and when it fills up, the oldest allocated chunk removed and written to the log file. When freed before being evicted from the cache, the chunk gets silently removed. When a chunk is freed that is no longer in the cache, the free event is recorded in the log. That way only those chunks get written to the log files that have endured beyond the capacity of the cache. Hopefully that would be the chunks most likely involved in the memory leak. Once the log has been created, it can be loaded into PostgreSQL table itself, and analyzed to find the chunks that never get freed and from the compressed backtrace figure out where they have been allocated.
BTW, my explain analyze is still running. That Sort - Unique step is taking forever on this data.
OK, now I will try the various patches that people sent.
-Gunther
On Sat, Apr 20, 2019 at 08:33:46PM -0400, Gunther wrote: >On 4/20/2019 16:01, Tomas Vondra wrote: >>For me, this did the trick: >> update pg_class set (relpages, reltuples) = (1000000, 1) where >>relname = 'tmp_r'; >> update pg_class set (relpages, reltuples) = (1, 1000000) where >>relname = 'tmp_q'; >> >YES! For me too. My EXPLAIN ANALYZE actually succeeded. > > Hash Right Join (cost=11009552.27..11377073.28 rows=11 width=4271) (actual time=511580.110..1058354.140 rows=113478386loops=1) > Hash Cond: (((q.documentinternalid)::text = (r.documentinternalid)::text) AND ((q.actinternalid)::text = (r.actinternalid)::text)) > -> Seq Scan on tmp_q q (cost=0.00..210021.00 rows=21000000 width=3417) (actual time=1.148..1.387 rows=236 loops=1) > -> Hash (cost=11009552.11..11009552.11 rows=11 width=928) (actual time=511577.002..511577.002 rows=113478127 loops=1) > Buckets: 16384 (originally 1024) Batches: 131072 (originally 1) Memory Usage: 679961kB > -> Seq Scan on tmp_r r (cost=0.00..11009552.11 rows=11 width=928) (actual time=4.077..344558.954 rows=113478127loops=1) > Planning Time: 0.725 ms > Execution Time: 1064128.721 ms > >But it used a lot of resident memory, and now it seems like I actually >have a leak! Because after the command returned as shown above, the >memory is still allocated: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 7100 postgres 20 0 2164012 1.1g 251364 S 0.0 14.5 23:27.23 postgres: postgres integrator [local] idle > >and let's do the memory map dump: > >2019-04-20 22:09:52.522 UTC [7100] LOG: duration: 1064132.171 ms statement: explain analyze > SELECT * > FROM tmp_q q > RIGHT OUTER JOIN tmp_r r > USING(documentInternalId, actInternalId); >TopMemoryContext: 153312 total in 8 blocks; 48168 free (70 chunks); 105144 used > HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used > Operator lookup cache: 24576 total in 2 blocks; 10760 free (3 chunks); 13816 used > TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used > Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used > RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used > MessageContext: 8192 total in 1 blocks; 6896 free (1 chunks); 1296 used > Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used > smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used > TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used > Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used > TopPortalContext: 8192 total in 1 blocks; 7936 free (1 chunks); 256 used > Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used > CacheMemoryContext: 1154080 total in 20 blocks; 151784 free (1 chunks); 1002296 used > index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: pg_class_tblspc_relfilenode_index > index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index > index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx > index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: entity_id_idx > ... > index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_attribute_relid_attnum_index > index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index > WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used > PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used > MdSmgr: 8192 total in 1 blocks; 4992 free (6 chunks); 3200 used > LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used > Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used > ErrorContext: 8192 total in 1 blocks; 7936 free (3 chunks); 256 used >Grand total: 2082624 bytes in 240 blocks; 382760 free (175 chunks); 1699864 used > >strange, it shows no leak here. Now I run this test again, to see if >the memory grows further in top? This time I also leave the DISTINCT >step in the query. I am trying to hit the out of memory situation. >Well, I clearly saw memory growing now: > Unfortunately, interpreting RES is way more complicated. The trouble is PostgreSQL does not get memory from kernel directly, it gets it through glibc. So when we do free(), it's not guaranteed kernel gets it. Also, I think glibc has multiple ways of getting memory from the kernel. It can either to mmap or sbrk, and AFAIK it's easy to cause "islands" that make it impossible to undo sbrk after freeing memory. Memory leaks in PostgreSQL are usually about allocating memory in the wrong context, and so are visible in MemoryContextStats. Permanent leaks that don't show there are quite rare. > >Also, I think while we might have focused in on a peculiar planning >situation where a very unfortunate plan is chosen which stresses the >memory situation, the real reason for the final out of memory >situation has not yet been determined. Remember, I have seen 3 stages >in my original query: > >1. Steady state, sort-merge join active high CPU% memory at or below 100 kB >2. Slow massive growth from over 200 kB to 1.5 GB or 1.8 GB >3. Explosive growth within a second to over 2.2 GB > >It might well be that my initial query would have succeeded just fine >despite the unfortunate plan with the big memory consumption on the >oddly planned hash join, were it not for that third phase of explosive >growth. And we haven't been able to catch this, because it happens too >quickly. > >It seems to me that some better memory tracing would be necessary. >Looking at src/backend/utils/memdebug.c, mentions Valgrind. But to me, >Valgrind would be a huge toolbox to just look after one thing. I >wonder if we could not make a much simpler memory leak debugger tool. >One that is fast, yet doesn't provide too much output to overwhelm >the log destination file system (and waste too much time). There are >already Debug macros there which, if enabled, just create an >absolutely crazy amount of undecipherable log file content, because >ALL backend processes would spit out this blabber. So I already >stopped that by adding a variable that must be set to 1 (using the >debugger on exactly one process for exactly the time desired): > >int _alloc_info = 0; >#ifdef HAVE_ALLOCINFO >#define AllocFreeInfo(_cxt, _chunk) \ > if(_alloc_info) \ > fprintf(stderr, "AllocFree: %s: %p, %zu\n", \ > (_cxt)->header.name, (_chunk), (_chunk)->size) >#define AllocAllocInfo(_cxt, _chunk) \ > if(_alloc_info) \ > fprintf(stderr, "AllocAlloc: %s: %p, %zu\n", \ > (_cxt)->header.name, (_chunk), (_chunk)->size) >#else >#define AllocFreeInfo(_cxt, _chunk) >#define AllocAllocInfo(_cxt, _chunk) >#endif > >But now I am thinking that should be the hook to use a limited cache >where we can cancel out AllocSetAlloc with their AllocSetFree calls >that follow relatively rapidly, which apparently is the majority of >the log chatter created. > >The memory debugger would allocate a single fixed memory chunk like 8 >or 16 kB as a cache per each memory context that is actually traced. >In each we would record the memory allocation in the shortest possible >way. With everything compressed. Instead of pointer, references to the >memory we would store whatever memory chunk index, a very brief >backtrace would be stored in a compressed form, by instruction pointer >($rip) of the caller and then variable length differences to the $rip >of the caller next up. These could even be stored in an index of the >100 most common caller chains to compress this data some more, while >minimizing the cost in searching. Now each allocated chunk would be >kept in this table and when it fills up, the oldest allocated chunk >removed and written to the log file. When freed before being evicted >from the cache, the chunk gets silently removed. When a chunk is freed >that is no longer in the cache, the free event is recorded in the log. >That way only those chunks get written to the log files that have >endured beyond the capacity of the cache. Hopefully that would be the >chunks most likely involved in the memory leak. Once the log has been >created, it can be loaded into PostgreSQL table itself, and analyzed >to find the chunks that never get freed and from the compressed >backtrace figure out where they have been allocated. > >BTW, my explain analyze is still running. That Sort - Unique step is >taking forever on this data. > >OK, now I will try the various patches that people sent. > Maybe. But before wasting any more time on the memory leak investigation, I suggest you first try the patch moving the BufFile allocations to a separate context. That'll either confirm or disprove the theory. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Maybe. But before wasting any more time on the memory leak investigation,
I suggest you first try the patch moving the BufFile allocations to a
separate context. That'll either confirm or disprove the theory.
OK, fair enough. So, first patch 0001-* applied, recompiled and
2019-04-21 04:08:04.364 UTC [11304] LOG: server process (PID 11313) was terminated by signal 11: Segmentation fault 2019-04-21 04:08:04.364 UTC [11304] DETAIL: Failed process was running: explain analyze select * from reports.v_BusinessOperation; 2019-04-21 04:08:04.364 UTC [11304] LOG: terminating any other active server processes 2019-04-21 04:08:04.368 UTC [11319] FATAL: the database system is in recovery mode 2019-04-21 04:08:04.368 UTC [11315] WARNING: terminating connection because of crash of another server process
SIGSEGV ... and with the core dump that I have I can tell you where:
Core was generated by `postgres: postgres integrator'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00000000009f300c in palloc (size=8272) at mcxt.c:936 936 context->isReset = false; (gdb) bt #0 0x00000000009f300c in palloc (size=8272) at mcxt.c:936 #1 0x000000000082b068 in makeBufFileCommon (nfiles=1) at buffile.c:116 #2 0x000000000082b0f8 in makeBufFile (firstfile=73) at buffile.c:138 #3 0x000000000082b283 in BufFileCreateTemp (interXact=false) at buffile.c:201 #4 0x00000000006bdc15 in ExecHashJoinSaveTuple (tuple=0x1c5a468, hashvalue=3834163156, fileptr=0x18a3730) at nodeHashjoin.c:1227 #5 0x00000000006b9568 in ExecHashTableInsert (hashtable=0x188fb88, slot=0x1877a18, hashvalue=3834163156) at nodeHash.c:1701 #6 0x00000000006b6c39 in MultiExecPrivateHash (node=0x1862168) at nodeHash.c:186 #7 0x00000000006b6aec in MultiExecHash (node=0x1862168) at nodeHash.c:114 #8 0x00000000006a19cc in MultiExecProcNode (node=0x1862168) at execProcnode.c:501 #9 0x00000000006bc5d2 in ExecHashJoinImpl (pstate=0x17b90e0, parallel=false) at nodeHashjoin.c:290 ... (gdb) info frame Stack level 0, frame at 0x7fffd5d4dc80:rip = 0x9f300c in palloc (mcxt.c:936); saved rip = 0x82b068called by frame at 0x7fffd5d4dcb0source language c.Arglist at 0x7fffd5d4dc70, args: size=8272Locals at 0x7fffd5d4dc70, Previous frame's sp is 0x7fffd5d4dc80Saved registers: rbx at 0x7fffd5d4dc60, rbp at 0x7fffd5d4dc70, r12 at 0x7fffd5d4dc68, rip at 0x7fffd5d4dc78and I have confirmed that this is while working the main T_HashJoin with jointype JOIN_RIGHT.
So now I am assuming that perhaps you want both of these patches applied. So applied it, and retried and boom, same thing same place.
turns out the MemoryContext is NULL:
(gdb) p context $1 = (MemoryContext) 0x0
all patches applied cleanly (with the -p1 option). I see no .rej file, but also no .orig file, not sure why that version of patch didn't create them. But I paid attention and know that there was no error.
-Gunther
I am now running Justin's patch after undoing Tomas' patches and any of my own hacks (which would not have interfered with Tomas' patch)
With my patch, the number of batches is nonlinear WRT work_mem, and reaches a maximum for moderately small work_mem. The goal is to choose the optimal number of batches to minimize the degree to which work_mem is exceeded.
Now I seem to be in that slow massive growth phase or maybe still in an earlier step, but I can see the top RES behavior different already. The size lingers around 400 MB. But then it's growing too, at high CPU%, goes past 700, 800, 900 MB now 1.5 GB, 1.7 GB, 1.8 GB, 1.9 GB, 2.0 GB, 2.1, and still 98% CPU. 2.4 GB, wow, it has never been that big ... and BOOM!
TopMemoryContext: 120544 total in 7 blocks; 9760 free (7 chunks); 110784 used HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 blocks; 6680 free (0 chunks); 1512 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 32768 total in 3 blocks; 13488 free (10 chunks); 19280 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used PortalContext: 1482752 total in 184 blocks; 11216 free (9 chunks); 1471536 used: ExecutorState: 2361536 total in 27 blocks; 1827536 free (3163 chunks); 534000 used TupleSort main: 3957712 total in 22 blocks; 246792 free (39 chunks); 3710920 used TupleSort main: 4219912 total in 23 blocks; 246792 free (39 chunks); 3973120 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7336 free (6 chunks); 856 used HashBatchContext: 2523874568 total in 76816 blocks; 7936 free (0 chunks); 2523866632 used TupleSort main: 41016 total in 3 blocks; 6504 free (6 chunks); 34512 used Caller tuples: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used CacheMemoryContext: 1101328 total in 14 blocks; 288672 free (1 chunks); 812656 used index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx ... index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_attribute_relid_attnum_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 6176 free (1 chunks); 2016 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (4 chunks); 256 used Grand total: 2538218304 bytes in 77339 blocks; 3075256 free (3372 chunks); 2535143048 used 2019-04-21 05:27:07.731 UTC [968] ERROR: out of memory 2019-04-21 05:27:07.731 UTC [968] DETAIL: Failed on request of size 32800 in memory context "HashBatchContext". 2019-04-21 05:27:07.731 UTC [968] STATEMENT: explain analyze select * from reports.v_BusinessOperation;
so we're ending up with the same problem.
No cigar. But lots of admiration and gratitude for all your attempts to pinpoint this.
Also, again, if anyone (of the trusted people) wants access to hack directly, I can provide.
regards,
-Gunther
On Sun, Apr 21, 2019 at 01:03:50AM -0400, Gunther wrote: > On 4/20/2019 21:14, Tomas Vondra wrote: > >Maybe. But before wasting any more time on the memory leak investigation, > >I suggest you first try the patch moving the BufFile allocations to a > >separate context. That'll either confirm or disprove the theory. > > OK, fair enough. So, first patch 0001-* applied, recompiled and > > 2019-04-21 04:08:04.364 UTC [11304] LOG: server process (PID 11313) was terminated by signal 11: Segmentation fault ... > turns out the MemoryContext is NULL: > > (gdb) p context > $1 = (MemoryContext) 0x0 I updated Tomas' patch to unconditionally set the context. (Note, oldctx vs oldcxt is fairly subtle but I think deliberate?) Justin
Attachment
I was able to reproduce in a somewhat contrived way: sh -c 'ulimit -v 1024000 ; /usr/local/pgsql/bin/postgres -D ./pg12dev5 -cport=1234' & postgres=# SET work_mem='64kB';SET client_min_messages =debug1;SET log_statement_stats=on;explain(analyze) WITH v AS MATERIALIZED(SELECT * FROM generate_series(1,9999999)i WHERE i%10<10 AND i%11<11 AND i%12<12 AND i%13<13 AND i%14<14 ANDi%15<15 AND i%16<16 AND i%17<17 AND i%18<18 AND i%19<19 AND i%20<20 AND i%21<21 ) SELECT * FROM generate_series(1,99)kJOIN v ON k=i; HashTableContext: 8192 total in 1 blocks; 7696 free (7 chunks); 496 used hash batch files: 852065104 total in 101318 blocks; 951896 free (20 chunks); 851113208 used HashBatchContext: 73888 total in 4 blocks; 24280 free (6 chunks); 49608 used 2019-04-21 04:11:02.521 CDT [4156] ERROR: out of memory 2019-04-21 04:11:02.521 CDT [4156] DETAIL: Failed on request of size 8264 in memory context "hash batch files".
On Sun, Apr 21, 2019 at 03:08:22AM -0500, Justin Pryzby wrote: >On Sun, Apr 21, 2019 at 01:03:50AM -0400, Gunther wrote: >> On 4/20/2019 21:14, Tomas Vondra wrote: >> >Maybe. But before wasting any more time on the memory leak investigation, >> >I suggest you first try the patch moving the BufFile allocations to a >> >separate context. That'll either confirm or disprove the theory. >> >> OK, fair enough. So, first patch 0001-* applied, recompiled and >> >> 2019-04-21 04:08:04.364 UTC [11304] LOG: server process (PID 11313) was terminated by signal 11: Segmentation fault >... >> turns out the MemoryContext is NULL: >> >> (gdb) p context >> $1 = (MemoryContext) 0x0 > >I updated Tomas' patch to unconditionally set the context. >(Note, oldctx vs oldcxt is fairly subtle but I think deliberate?) > I don't follow - is there a typo confusing oldctx vs. oldcxt? I don't think so, but I might have missed something. (I always struggle with which spelling is the right one). I think the bug is actually such simpler - the memory context was created only in ExecuteIncreaseNumBatches() when starting with nbatch=1. But when the initial nbatch value was higher (i.e. when starting with 2 or more batches), it was left NULL. That was OK for testing with the contrived data set, but it may easily break on other examples. So here is an updated patch - hopefully this version works. I don't have time to do much more testing now, though. But it compiles. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> Considering how rare this issue likely is, we need to be looking for a
> solution that does not break the common case.
Agreed. What I think we need to focus on next is why the code keeps
increasing the number of batches. It seems like there must be an undue
amount of data all falling into the same bucket ... but if it were simply
a matter of a lot of duplicate hash keys, the growEnabled shutoff
heuristic ought to trigger.
Jeff Janes <jeff.janes@gmail.com> writes: > The growEnabled stuff only prevents infinite loops. It doesn't prevent > extreme silliness. > If a single 32 bit hash value has enough tuples by itself to not fit in > work_mem, then it will keep splitting until that value is in a batch by > itself before shutting off Right, that's the code's intention. If that's not good enough for this case, we'll need to understand the details a bit better before we can design a better(?) heuristic. I suspect, however, that we might be better off just taking the existence of the I/O buffers into account somehow while deciding whether it's worth growing further. That is, I'm imagining adding a second independent reason for shutting off growEnabled, along the lines of "increasing nbatch any further will require an unreasonable amount of buffer memory". The question then becomes how to define "unreasonable". regards, tom lane
On Sun, Apr 21, 2019 at 10:36:43AM -0400, Tom Lane wrote: >Jeff Janes <jeff.janes@gmail.com> writes: >> The growEnabled stuff only prevents infinite loops. It doesn't prevent >> extreme silliness. > >> If a single 32 bit hash value has enough tuples by itself to not fit in >> work_mem, then it will keep splitting until that value is in a batch by >> itself before shutting off > >Right, that's the code's intention. If that's not good enough for this >case, we'll need to understand the details a bit better before we can >design a better(?) heuristic. > I think we only disable growing when there are no other values in the batch, but that seems rather easy to defeat - all you need is a single tuple with a hash that falls into the same batch, and it's over. Maybe we should make this a bit less accurate - say, if less than 5% memory gets freed, don't add more batches. >I suspect, however, that we might be better off just taking the existence >of the I/O buffers into account somehow while deciding whether it's worth >growing further. That is, I'm imagining adding a second independent >reason for shutting off growEnabled, along the lines of "increasing >nbatch any further will require an unreasonable amount of buffer memory". >The question then becomes how to define "unreasonable". > I think the question the code needs to be asking is "If we double the number of batches, does the amount of memory we need drop?" And the memory needs to account both for the buffers and per-batch data. I don't think we can just stop increasing the number of batches when the memory for BufFile exceeds work_mem, because that entirely ignores the fact that by doing that we force the system to keep the per-batch stuff in memory (and that can be almost arbitrary amount). What I think we should be doing instead is instead make the threshold dynamic - instead of just checking at work_mem, we need to increment the number of batches when the total amount of memory exceeds Max(work_mem, 3 * memory_for_buffiles) This is based on the observation that by increasing the number of batches, we double memory_for_buffiles and split the per-batch data in half. By adding more batches, we'd actually increase the amount of memory used. Of course, this just stops enforcing work_mem at some point, but it at least attempts to minimize the amount of memory used. An alternative would be spilling the extra tuples into a special overflow file, as I explained earlier. That would actually enforce work_mem I think. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Apr 21, 2019 at 10:36:43AM -0400, Tom Lane wrote: > Jeff Janes <jeff.janes@gmail.com> writes: > > The growEnabled stuff only prevents infinite loops. It doesn't prevent > > extreme silliness. > > > If a single 32 bit hash value has enough tuples by itself to not fit in > > work_mem, then it will keep splitting until that value is in a batch by > > itself before shutting off > > I suspect, however, that we might be better off just taking the existence > of the I/O buffers into account somehow while deciding whether it's worth > growing further. That is, I'm imagining adding a second independent > reason for shutting off growEnabled, along the lines of "increasing > nbatch any further will require an unreasonable amount of buffer memory". > The question then becomes how to define "unreasonable". On Sun, Apr 21, 2019 at 06:15:25PM +0200, Tomas Vondra wrote: > I think the question the code needs to be asking is "If we double the > number of batches, does the amount of memory we need drop?" And the > memory needs to account both for the buffers and per-batch data. > > I don't think we can just stop increasing the number of batches when the > memory for BufFile exceeds work_mem, because that entirely ignores the > fact that by doing that we force the system to keep the per-batch stuff > in memory (and that can be almost arbitrary amount). ... > Of course, this just stops enforcing work_mem at some point, but it at > least attempts to minimize the amount of memory used. This patch defines reasonable as "additional BatchFiles will not themselves exceed work_mem; OR, exceeded work_mem already but additional BatchFiles are going to save us RAM"... I think the first condition is insensitive and not too important to get right, it only allows work_mem to be exceeded by 2x, which maybe already happens for multiple reasons, related to this thread and otherwise. It'd be fine to slap on a factor of /2 or /4 or /8 there too. The current patch doesn't unset growEnabled, since there's no point at which the hashtable should grow without bound: if hash tables are *already* exceeding work_mem by 2x as big, nbatches should be doubled. Justin
Attachment
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > On Sun, Apr 21, 2019 at 10:36:43AM -0400, Tom Lane wrote: >> Jeff Janes <jeff.janes@gmail.com> writes: >>> If a single 32 bit hash value has enough tuples by itself to not fit in >>> work_mem, then it will keep splitting until that value is in a batch by >>> itself before shutting off >> Right, that's the code's intention. If that's not good enough for this >> case, we'll need to understand the details a bit better before we can >> design a better(?) heuristic. > I think we only disable growing when there are no other values in the > batch, but that seems rather easy to defeat - all you need is a single > tuple with a hash that falls into the same batch, and it's over. Maybe > we should make this a bit less accurate - say, if less than 5% memory > gets freed, don't add more batches. Yeah, something like that, but it's hard to design it without seeing some concrete misbehaving examples. I think though that this is somewhat independent of the problem that we're not including the I/O buffers in our reasoning about memory consumption. > An alternative would be spilling the extra tuples into a special > overflow file, as I explained earlier. That would actually enforce > work_mem I think. Well, no, it won't. If you have umpteen gigabytes of RHS tuples with the exact same hash code, nothing we can do here is going to prevent you from having to process those in a single table load. (If it's a plain inner join, maybe you could break that into subsections anyway ... but that won't work for left or full joins where you need per-tuple match status.) I think our ambition here should just be to not have the code go crazy trying to keep its memory consumption under work_mem when it's ultimately going to fail to do so anyhow. regards, tom lane
After applying Tomas' corrected patch 0001, and routing HJDEBUG messages to stderr:
integrator=# set enable_nestloop to off; SET integrator=# explain analyze select * from reports.v_BusinessOperation; WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 16 to 32 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 32 to 64 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 64 to 128 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 128 to 256 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 256 to 512 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 512 to 1024 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 1024 to 2048 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 2048 to 4096 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 4096 to 8192 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 8192 to 16384 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 16384 to 32768 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 32768 to 65536 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 65536 to 131072 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 131072 to 262144 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 262144 to 524288 ERROR: out of memory DETAIL: Failed on request of size 32800 in memory context "HashBatchContext".
Now
TopMemoryContext: 4347672 total in 9 blocks; 41688 free (18 chunks); 4305984 used HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 blocks; 5416 free (2 chunks); 2776 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 32768 total in 3 blocks; 13488 free (10 chunks); 19280 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used PortalContext: 1482752 total in 184 blocks; 11216 free (8 chunks); 1471536 used: ExecutorState: 2449896 total in 16 blocks; 1795000 free (3158 chunks); 654896 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 16384 total in 2 blocks; 3008 free (6 chunks); 13376 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used TupleSort main: 1073512 total in 11 blocks; 246792 free (39 chunks); 826720 used TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 40120 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 2242545904 total in 266270 blocks; 3996232 free (14164 chunks); 2238549672 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 168165080 total in 5118 blocks; 7936 free (0 chunks); 168157144 used TupleSort main: 452880 total in 8 blocks; 126248 free (27 chunks); 326632 used Caller tuples: 1048576 total in 8 blocks; 21608 free (14 chunks); 1026968 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used CacheMemoryContext: 1101328 total in 14 blocks; 288672 free (1 chunks); 812656 used index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx ... index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_attribute_relid_attnum_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 6560 free (1 chunks); 1632 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (5 chunks); 256 used Grand total: 2424300520 bytes in 271910 blocks; 7332360 free (17596 chunks); 2416968160 used 2019-04-21 19:50:21.338 UTC [6974] ERROR: out of memory 2019-04-21 19:50:21.338 UTC [6974] DETAIL: Failed on request of size 32800 in memory context "HashBatchContext". 2019-04-21 19:50:21.338 UTC [6974] STATEMENT: explain analyze select * from reports.v_BusinessOperation;
Next I'll apply Tomas' corrected 0002 patch on top of this and see ...
-Gunther
On Sun, Apr 21, 2019 at 07:25:15PM -0400, Gunther wrote: > After applying Tomas' corrected patch 0001, and routing HJDEBUG messages > to stderr: > > integrator=# set enable_nestloop to off; > SET > integrator=# explain analyze select * from reports.v_BusinessOperation; > > ... > WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 131072 to 262144 > WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 262144 to 524288 > ERROR: out of memory > DETAIL: Failed on request of size 32800 in memory context "HashBatchContext". > > Now > > TopMemoryContext: 4347672 total in 9 blocks; 41688 free (18 chunks); 4305984 used > ... > Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used > TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used > PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used > PortalContext: 1482752 total in 184 blocks; 11216 free (8 chunks); 1471536 used: > ExecutorState: 2449896 total in 16 blocks; 1795000 free (3158 chunks); 654896 used > TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 40120 used > ... > ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used > HashBatchFiles: 2242545904 total in 266270 blocks; 3996232 free (14164 chunks); 2238549672 used > HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used > HashBatchContext: 168165080 total in 5118 blocks; 7936 free (0 chunks); 168157144 used > TupleSort main: 452880 total in 8 blocks; 126248 free (27 chunks); 326632 used > Caller tuples: 1048576 total in 8 blocks; 21608 free (14 chunks); 1026968 used > ... > Grand total: 2424300520 bytes in 271910 blocks; 7332360 free (17596 chunks); 2416968160 used > IMO this pretty much proves that the memory allocated for BufFile really is the root cause of the issues with this query. 524288 batches means up to 1048576 BufFiles, which is a bit more than ~8GB of RAM. However those for the inner relation were not allycated yet, so at this point only about 4GB might be allocated. And it seems ~1/2 of them did not receive any tuples, so only about 2GB got allocated so far. The second batch will probably make it fail much sooner, because it allocates the BufFile stuff eagerly (both for inner and outer side). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
After applying Tomas' patch 0002 as corrected, over 0001, same thing:
integrator=# set enable_nestloop to off; SET integrator=# explain analyze select * from reports.v_BusinessOperation; WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 16 to 32 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 32 to 64 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 64 to 128 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 128 to 256 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 256 to 512 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 512 to 1024 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 1024 to 2048 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 2048 to 4096 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 4096 to 8192 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 8192 to 16384 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 16384 to 32768 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 32768 to 65536 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 65536 to 131072 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 131072 to 262144 ERROR: out of memory DETAIL: Failed on request of size 8272 in memory context "HashBatchFiles".
And from the log:
2019-04-21 23:29:33.497 UTC [8890] LOG: database system was shut down at 2019-04-21 23:29:33 UTC 2019-04-21 23:29:33.507 UTC [8888] LOG: database system is ready to accept connections Hashjoin 0x1732b88: initial nbatch = 16, nbuckets = 8192 2019-04-21 23:31:54.447 UTC [8896] WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 16 to 32 2019-04-21 23:31:54.447 UTC [8896] LOG: ExecHashIncreaseNumBatches ======= context stats start ======= 2019-04-21 23:31:54.447 UTC [8896] STATEMENT: explain analyze select * from reports.v_BusinessOperation; TopMemoryContext: 120544 total in 7 blocks; 10016 free (6 chunks); 110528 used HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 blocks; 6680 free (0 chunks); 1512 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 32768 total in 3 blocks; 13488 free (10 chunks); 19280 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used PortalContext: 1482752 total in 184 blocks; 11216 free (8 chunks); 1471536 used: ExecutorState: 647368 total in 10 blocks; 197536 free (13 chunks); 449832 used TupleSort main: 4219912 total in 23 blocks; 246792 free (39 chunks); 3973120 used TupleSort main: 4219912 total in 23 blocks; 246792 free (39 chunks); 3973120 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 258032 total in 31 blocks; 6208 free (0 chunks); 251824 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used TupleSort main: 41016 total in 3 blocks; 6504 free (6 chunks); 34512 used Caller tuples: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ... ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used CacheMemoryContext: 1101328 total in 14 blocks; 288672 free (1 chunks); 812656 used index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx ... index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_attribute_relid_attnum_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 6560 free (1 chunks); 1632 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (5 chunks); 256 used Grand total: 17429152 bytes in 668 blocks; 1452392 free (220 chunks); 15976760 used Hashjoin 0x1732b88: increasing nbatch to 32 because space = 4128933 Hashjoin 0x1732b88: freed 148 of 10584 tuples, space now 4071106 2019-04-21 23:31:54.450 UTC [8896] LOG: ExecHashIncreaseNumBatches ======= context stats end ======= 2019-04-21 23:31:54.450 UTC [8896] STATEMENT: explain analyze select * from reports.v_BusinessOperation; TopMemoryContext: 120544 total in 7 blocks; 9760 free (7 chunks); 110784 used HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 blocks; 6680 free (0 chunks); 1512 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 32768 total in 3 blocks; 13488 free (10 chunks); 19280 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used PortalContext: 1482752 total in 184 blocks; 11216 free (8 chunks); 1471536 used: ExecutorState: 647368 total in 10 blocks; 197536 free (13 chunks); 449832 used TupleSort main: 4219912 total in 23 blocks; 246792 free (39 chunks); 3973120 used TupleSort main: 4219912 total in 23 blocks; 246792 free (39 chunks); 3973120 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 524528 total in 63 blocks; 4416 free (2 chunks); 520112 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 4213640 total in 128 blocks; 7936 free (0 chunks); 4205704 used TupleSort main: 41016 total in 3 blocks; 6504 free (6 chunks); 34512 used Caller tuples: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ... ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used CacheMemoryContext: 1101328 total in 14 blocks; 288672 free (1 chunks); 812656 used index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx ... index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_attribute_relid_attnum_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 6560 free (1 chunks); 1632 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (5 chunks); 256 used Grand total: 17629936 bytes in 698 blocks; 1450344 free (223 chunks); 16179592 used 2019-04-21 23:31:54.452 UTC [8896] WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 32 to 64 2019-04-21 23:31:54.452 UTC [8896] LOG: ExecHashIncreaseNumBatches ======= context stats start ======= 2019-04-21 23:31:54.452 UTC [8896] STATEMENT: explain analyze select * from reports.v_BusinessOperation; TopMemoryContext: 120544 total in 7 blocks; 9760 free (7 chunks); 110784 used HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 blocks; 6680 free (0 chunks); 1512 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 32768 total in 3 blocks; 13488 free (10 chunks); 19280 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used PortalContext: 1482752 total in 184 blocks; 11216 free (8 chunks); 1471536 used: ExecutorState: 647368 total in 10 blocks; 197536 free (13 chunks); 449832 used TupleSort main: 4219912 total in 23 blocks; 246792 free (39 chunks); 3973120 used TupleSort main: 4219912 total in 23 blocks; 246792 free (39 chunks); 3973120 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 524528 total in 63 blocks; 4416 free (2 chunks); 520112 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used TupleSort main: 41016 total in 3 blocks; 6504 free (6 chunks); 34512 used Caller tuples: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ... ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used CacheMemoryContext: 1101328 total in 14 blocks; 288672 free (1 chunks); 812656 used index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx ... index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_attribute_relid_attnum_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 6560 free (1 chunks); 1632 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (5 chunks); 256 used Grand total: 17695648 bytes in 700 blocks; 1450344 free (223 chunks); 16245304 used Hashjoin 0x1732b88: increasing nbatch to 64 because space = 4128826 Hashjoin 0x1732b88: freed 544 of 10584 tuples, space now 3916296 2019-04-21 23:31:54.456 UTC [8896] LOG: ExecHashIncreaseNumBatches ======= context stats end ======= 2019-04-21 23:31:54.456 UTC [8896] STATEMENT: explain analyze select * from reports.v_BusinessOperation; TopMemoryContext: 120544 total in 7 blocks; 8224 free (7 chunks); 112320 used HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 blocks; 6680 free (0 chunks); 1512 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 32768 total in 3 blocks; 13488 free (10 chunks); 19280 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used PortalContext: 1482752 total in 184 blocks; 11216 free (8 chunks); 1471536 used: ExecutorState: 647368 total in 10 blocks; 197536 free (13 chunks); 449832 used TupleSort main: 4219912 total in 23 blocks; 246792 free (39 chunks); 3973120 used TupleSort main: 4219912 total in 23 blocks; 246792 free (39 chunks); 3973120 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 1057520 total in 127 blocks; 832 free (4 chunks); 1056688 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 4049360 total in 123 blocks; 7936 free (0 chunks); 4041424 used TupleSort main: 41016 total in 3 blocks; 6504 free (6 chunks); 34512 used Caller tuples: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ... ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used CacheMemoryContext: 1101328 total in 14 blocks; 288672 free (1 chunks); 812656 used index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx ... index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_attribute_relid_attnum_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 6560 free (1 chunks); 1632 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (5 chunks); 256 used Grand total: 17998648 bytes in 757 blocks; 1445224 free (225 chunks); 16553424 used 2019-04-21 23:31:54.459 UTC [8896] WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 64 to 128 2019-04-21 23:31:54.459 UTC [8896] LOG: ExecHashIncreaseNumBatches ======= context stats start ======= 2019-04-21 23:31:54.459 UTC [8896] STATEMENT: explain analyze select * from reports.v_BusinessOperation; TopMemoryContext: 120544 total in 7 blocks; 8224 free (7 chunks); 112320 used HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 blocks; 6680 free (0 chunks); 1512 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 32768 total in 3 blocks; 13488 free (10 chunks); 19280 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used PortalContext: 1482752 total in 184 blocks; 11216 free (8 chunks); 1471536 used: ExecutorState: 647368 total in 10 blocks; 197536 free (13 chunks); 449832 used TupleSort main: 4219912 total in 23 blocks; 246792 free (39 chunks); 3973120 used TupleSort main: 4219912 total in 23 blocks; 246792 free (39 chunks); 3973120 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 1057520 total in 127 blocks; 832 free (4 chunks); 1056688 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used TupleSort main: 41016 total in 3 blocks; 6504 free (6 chunks); 34512 used Caller tuples: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 blocks; 6680 free (0 chunks); 1512 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 32768 total in 3 blocks; 13488 free (10 chunks); 19280 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used PortalContext: 1482752 total in 184 blocks; 11216 free (8 chunks); 1471536 used: ExecutorState: 647368 total in 10 blocks; 197536 free (13 chunks); 449832 used TupleSort main: 4219912 total in 23 blocks; 246792 free (39 chunks); 3973120 used TupleSort main: 4219912 total in 23 blocks; 246792 free (39 chunks); 3973120 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 1057520 total in 127 blocks; 832 free (4 chunks); 1056688 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used TupleSort main: 41016 total in 3 blocks; 6504 free (6 chunks); 34512 used Caller tuples: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ... ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used CacheMemoryContext: 1101328 total in 14 blocks; 288672 free (1 chunks); 812656 used index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx ... index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_attribute_relid_attnum_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 6560 free (1 chunks); 1632 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (5 chunks); 256 used Grand total: 18228640 bytes in 764 blocks; 1445224 free (225 chunks); 16783416 used Hashjoin 0x1732b88: increasing nbatch to 128 because space = 4128846 Hashjoin 0x1732b88: freed 10419 of 10585 tuples, space now 65570 2019-04-21 23:31:54.466 UTC [8896] LOG: ExecHashIncreaseNumBatches ======= context stats end ======= 2019-04-21 23:31:54.466 UTC [8896] STATEMENT: explain analyze select * from reports.v_BusinessOperation; TopMemoryContext: 120544 total in 7 blocks; 6176 free (8 chunks); 114368 used HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 blocks; 6680 free (0 chunks); 1512 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 32768 total in 3 blocks; 13488 free (10 chunks); 19280 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used ... I notice now you have tons of these memory map dumps, tell me what you want to see and I will grep it out for you. I guess the Hashjoin related things .... OK, last one before the out of memory now, and then I give you some grepped stuff... ... TopMemoryContext: 4347672 total in 9 blocks; 41608 free (18 chunks); 4306064 used HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 blocks; 5416 free (2 chunks); 2776 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 32768 total in 3 blocks; 13488 free (10 chunks); 19280 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used PortalContext: 1482752 total in 184 blocks; 11216 free (8 chunks); 1471536 used: ExecutorState: 2449896 total in 16 blocks; 1794968 free (3158 chunks); 654928 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 32768 total in 3 blocks; 17304 free (9 chunks); 15464 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 40120 used TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 40120 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 2468537520 total in 293910 blocks; 2669512 free (14 chunks); 2465868008 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used TupleSort main: 256272 total in 6 blocks; 36424 free (15 chunks); 219848 used Caller tuples: 2097152 total in 9 blocks; 929696 free (17 chunks); 1167456 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ... ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used CacheMemoryContext: 1101328 total in 14 blocks; 288672 free (1 chunks); 812656 used index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx ... index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_attribute_relid_attnum_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 6560 free (1 chunks); 1632 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (5 chunks); 256 used Grand total: 2486488160 bytes in 294559 blocks; 6838088 free (3440 chunks); 2479650072 used 2019-04-21 23:40:23.118 UTC [8896] ERROR: out of memory 2019-04-21 23:40:23.118 UTC [8896] DETAIL: Failed on request of size 8272 in memory context "HashBatchFiles". 2019-04-21 23:40:23.118 UTC [8896] STATEMENT: explain analyze select * from reports.v_BusinessOperation; 2019-04-21 23:40:23.119 UTC [8896] LOG: could not open directory "base/pgsql_tmp/pgsql_tmp8896.2.sharedfileset": Cannot allocate memory ok now here comes a summary grepped out: grep 'Hash' Hashjoin 0x1732b88: initial nbatch = 16, nbuckets = 8192 2019-04-21 23:31:54.447 UTC [8896] WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 16 to 32 2019-04-21 23:31:54.447 UTC [8896] LOG: ExecHashIncreaseNumBatches ======= context stats start ======= HashBatchFiles: 258032 total in 31 blocks; 6208 free (0 chunks); 251824 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used Hashjoin 0x1732b88: increasing nbatch to 32 because space = 4128933 Hashjoin 0x1732b88: freed 148 of 10584 tuples, space now 4071106 2019-04-21 23:31:54.450 UTC [8896] LOG: ExecHashIncreaseNumBatches ======= context stats end ======= HashBatchFiles: 524528 total in 63 blocks; 4416 free (2 chunks); 520112 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 4213640 total in 128 blocks; 7936 free (0 chunks); 4205704 used 2019-04-21 23:31:54.452 UTC [8896] WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 32 to 64 2019-04-21 23:31:54.452 UTC [8896] LOG: ExecHashIncreaseNumBatches ======= context stats start ======= HashBatchFiles: 524528 total in 63 blocks; 4416 free (2 chunks); 520112 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used Hashjoin 0x1732b88: increasing nbatch to 64 because space = 4128826 Hashjoin 0x1732b88: freed 544 of 10584 tuples, space now 3916296 2019-04-21 23:31:54.456 UTC [8896] LOG: ExecHashIncreaseNumBatches ======= context stats end ======= HashBatchFiles: 1057520 total in 127 blocks; 832 free (4 chunks); 1056688 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 4049360 total in 123 blocks; 7936 free (0 chunks); 4041424 used 2019-04-21 23:31:54.459 UTC [8896] WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 64 to 128 2019-04-21 23:31:54.459 UTC [8896] LOG: ExecHashIncreaseNumBatches ======= context stats start ======= HashBatchFiles: 1057520 total in 127 blocks; 832 free (4 chunks); 1056688 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used Hashjoin 0x1732b88: increasing nbatch to 128 because space = 4128846 Hashjoin 0x1732b88: freed 10419 of 10585 tuples, space now 65570 2019-04-21 23:31:54.466 UTC [8896] LOG: ExecHashIncreaseNumBatches ======= context stats end ======= HashBatchFiles: 2148080 total in 257 blocks; 18160 free (6 chunks); 2129920 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 172352 total in 5 blocks; 7936 free (0 chunks); 164416 used 2019-04-21 23:32:07.174 UTC [8896] WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 128 to 256 2019-04-21 23:32:07.174 UTC [8896] LOG: ExecHashIncreaseNumBatches ======= context stats start ======= HashBatchFiles: 2148080 total in 257 blocks; 18160 free (6 chunks); 2129920 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 4312208 total in 131 blocks; 7936 free (0 chunks); 4304272 used Hashjoin 0x1732b88: increasing nbatch to 256 because space = 4128829 Hashjoin 0x1732b88: freed 10308 of 10734 tuples, space now 161815 2019-04-21 23:32:07.183 UTC [8896] LOG: ExecHashIncreaseNumBatches ======= context stats end ======= HashBatchFiles: 4312816 total in 514 blocks; 36552 free (8 chunks); 4276264 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 270920 total in 8 blocks; 7936 free (0 chunks); 262984 used 2019-04-21 23:32:18.865 UTC [8896] WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 256 to 512 2019-04-21 23:32:18.865 UTC [8896] LOG: ExecHashIncreaseNumBatches ======= context stats start ======= HashBatchFiles: 4312816 total in 514 blocks; 36552 free (8 chunks); 4276264 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used Hashjoin 0x1732b88: increasing nbatch to 512 because space = 4128908 Hashjoin 0x1732b88: freed 398 of 10379 tuples, space now 3977787 2019-04-21 23:32:18.877 UTC [8896] LOG: ExecHashIncreaseNumBatches ======= context stats end ======= HashBatchFiles: 8642288 total in 1027 blocks; 73376 free (10 chunks); 8568912 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 4147928 total in 126 blocks; 7936 free (0 chunks); 4139992 used 2019-04-21 23:32:18.880 UTC [8896] WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 512 to 1024 2019-04-21 23:32:18.880 UTC [8896] LOG: ExecHashIncreaseNumBatches ======= context stats start ======= HashBatchFiles: 8642288 total in 1027 blocks; 73376 free (10 chunks); 8568912 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used Hashjoin 0x1732b88: increasing nbatch to 1024 because space = 4129008 Hashjoin 0x1732b88: freed 296 of 10360 tuples, space now 4013423 2019-04-21 23:32:18.903 UTC [8896] LOG: ExecHashIncreaseNumBatches ======= context stats end ======= HashBatchFiles: 17301232 total in 2052 blocks; 147064 free (12 chunks); 17154168 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 4180784 total in 127 blocks; 7936 free (0 chunks); 4172848 used 2019-04-21 23:32:18.906 UTC [8896] WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 1024 to 2048 2019-04-21 23:32:18.906 UTC [8896] LOG: ExecHashIncreaseNumBatches ======= context stats start ======= HashBatchFiles: 17301232 total in 2052 blocks; 147064 free (12 chunks); 17154168 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used Hashjoin 0x1732b88: increasing nbatch to 2048 because space = 4129133 Hashjoin 0x1732b88: freed 154 of 10354 tuples, space now 4068786 2019-04-21 23:32:18.946 UTC [8896] LOG: ExecHashIncreaseNumBatches ======= context stats end ======= HashBatchFiles: 34389856 total in 4102 blocks; 65176 free (14 chunks); 34324680 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 4213640 total in 128 blocks; 7936 free (0 chunks); 4205704 used 2019-04-21 23:32:18.949 UTC [8896] WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 2048 to 4096 2019-04-21 23:32:18.949 UTC [8896] LOG: ExecHashIncreaseNumBatches ======= context stats start ======= HashBatchFiles: 34389856 total in 4102 blocks; 65176 free (14 chunks); 34324680 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used Hashjoin 0x1732b88: increasing nbatch to 4096 because space = 4129035 Hashjoin 0x1732b88: freed 10167 of 10351 tuples, space now 72849 2019-04-21 23:32:19.032 UTC [8896] LOG: ExecHashIncreaseNumBatches ======= context stats end ======= HashBatchFiles: 68796256 total in 8199 blocks; 130672 free (14 chunks); 68665584 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 172352 total in 5 blocks; 7936 free (0 chunks); 164416 used Hashjoin 0x1b3d4c8: initial nbatch = 1, nbuckets = 1024 Hashjoin 0x1b3bc08: initial nbatch = 1, nbuckets = 1024 Hashjoin 0x1b3b5d8: initial nbatch = 1, nbuckets = 1024 Hashjoin 0x15499a8: initial nbatch = 1, nbuckets = 1024 Hashjoin 0x1549978: initial nbatch = 1, nbuckets = 1024 Hashjoin 0x1b3cc88: initial nbatch = 1, nbuckets = 65536 Hashjoin 0x1553638: initial nbatch = 1, nbuckets = 65536 Hashjoin 0x1553638: initial nbatch = 1, nbuckets = 65536 Hashjoin 0x1553d38: initial nbatch = 16, nbuckets = 65536 Hashjoin 0x1b3ad98: initial nbatch = 16, nbuckets = 65536 Hashjoin 0x15538d8: initial nbatch = 16, nbuckets = 65536 2019-04-21 23:40:06.495 UTC [8896] WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 4096 to 8192 2019-04-21 23:40:06.495 UTC [8896] LOG: ExecHashIncreaseNumBatches ======= context stats start ======= HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 32768 total in 3 blocks; 17304 free (9 chunks); 15464 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used HashBatchFiles: 68663008 total in 8183 blocks; 131440 free (46 chunks); 68531568 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used Hashjoin 0x1732b88: increasing nbatch to 8192 because space = 4128997 Hashjoin 0x1732b88: freed 10555 of 10560 tuples, space now 1983 2019-04-21 23:40:06.680 UTC [8896] LOG: ExecHashIncreaseNumBatches ======= context stats end ======= HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 32768 total in 3 blocks; 17304 free (9 chunks); 15464 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used HashBatchFiles: 137609056 total in 16392 blocks; 261704 free (14 chunks); 137347352 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 106640 total in 3 blocks; 7936 free (0 chunks); 98704 used 2019-04-21 23:40:06.883 UTC [8896] WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 8192 to 16384 2019-04-21 23:40:06.883 UTC [8896] LOG: ExecHashIncreaseNumBatches ======= context stats start ======= HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 32768 total in 3 blocks; 17304 free (9 chunks); 15464 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used HashBatchFiles: 137592400 total in 16390 blocks; 261800 free (18 chunks); 137330600 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 4246496 total in 129 blocks; 7936 free (0 chunks); 4238560 used Hashjoin 0x1732b88: increasing nbatch to 16384 because space = 4128957 Hashjoin 0x1732b88: freed 10697 of 10764 tuples, space now 25956 2019-04-21 23:40:07.268 UTC [8896] LOG: ExecHashIncreaseNumBatches ======= context stats end ======= HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 32768 total in 3 blocks; 17304 free (9 chunks); 15464 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used HashBatchFiles: 275234656 total in 32777 blocks; 523808 free (14 chunks); 274710848 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 106640 total in 3 blocks; 7936 free (0 chunks); 98704 used 2019-04-21 23:40:09.096 UTC [8896] WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 16384 to 32768 2019-04-21 23:40:09.096 UTC [8896] LOG: ExecHashIncreaseNumBatches ======= context stats start ======= HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 32768 total in 3 blocks; 17304 free (9 chunks); 15464 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used HashBatchFiles: 275018128 total in 32751 blocks; 525056 free (66 chunks); 274493072 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used Hashjoin 0x1732b88: increasing nbatch to 32768 because space = 4128890 Hashjoin 0x1732b88: freed 8 of 10809 tuples, space now 4125769 2019-04-21 23:40:10.050 UTC [8896] LOG: ExecHashIncreaseNumBatches ======= context stats end ======= HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 32768 total in 3 blocks; 17304 free (9 chunks); 15464 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used HashBatchFiles: 550485856 total in 65546 blocks; 1048056 free (14 chunks); 549437800 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used 2019-04-21 23:40:10.060 UTC [8896] WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 32768 to 65536 2019-04-21 23:40:10.060 UTC [8896] LOG: ExecHashIncreaseNumBatches ======= context stats start ======= HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 32768 total in 3 blocks; 17304 free (9 chunks); 15464 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used HashBatchFiles: 550485856 total in 65546 blocks; 1048056 free (14 chunks); 549437800 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used Hashjoin 0x1732b88: increasing nbatch to 65536 because space = 4128825 Hashjoin 0x1732b88: freed 20 of 10809 tuples, space now 4121380 2019-04-21 23:40:12.686 UTC [8896] LOG: ExecHashIncreaseNumBatches ======= context stats end ======= HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 32768 total in 3 blocks; 17304 free (9 chunks); 15464 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used HashBatchFiles: 1100988256 total in 131083 blocks; 2096592 free (14 chunks); 1098891664 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 4246496 total in 129 blocks; 7936 free (0 chunks); 4238560 used 2019-04-21 23:40:12.703 UTC [8896] WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 65536 to 131072 2019-04-21 23:40:12.703 UTC [8896] LOG: ExecHashIncreaseNumBatches ======= context stats start ======= HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 32768 total in 3 blocks; 17304 free (9 chunks); 15464 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used HashBatchFiles: 1100988256 total in 131083 blocks; 2096592 free (14 chunks); 1098891664 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used Hashjoin 0x1732b88: increasing nbatch to 131072 because space = 4129020 Hashjoin 0x1732b88: freed 2 of 10809 tuples, space now 4128291 2019-04-21 23:40:20.571 UTC [8896] LOG: ExecHashIncreaseNumBatches ======= context stats end ======= HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 32768 total in 3 blocks; 17304 free (9 chunks); 15464 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used HashBatchFiles: 2201993056 total in 262156 blocks; 4193704 free (14 chunks); 2197799352 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used 2019-04-21 23:40:20.602 UTC [8896] WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 131072 to 262144 2019-04-21 23:40:20.602 UTC [8896] LOG: ExecHashIncreaseNumBatches ======= context stats start ======= HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 32768 total in 3 blocks; 17304 free (9 chunks); 15464 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used HashBatchFiles: 2201993056 total in 262156 blocks; 4193704 free (14 chunks); 2197799352 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used Hashjoin 0x1732b88: increasing nbatch to 262144 because space = 4129055 HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 32768 total in 3 blocks; 17304 free (9 chunks); 15464 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used HashBatchFiles: 2468537520 total in 293910 blocks; 2669512 free (14 chunks); 2465868008 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used 2019-04-21 23:40:23.118 UTC [8896] DETAIL: Failed on request of size 8272 in memory context "HashBatchFiles". (END) -Gunther
$ mv src/include/executor/hashjoin.h.orig src/include/executor/hashjoin.h $ mv src/backend/executor/nodeHash.c.orig src/backend/executor/nodeHash.c $ mv src/backend/executor/nodeHashjoin.c.orig src/backend/executor/nodeHashjoin.c $ patch -p1 <../limit-hash-nbatches-v2.patch patching file src/backend/executor/nodeHash.c Hunk #1 succeeded at 570 (offset -3 lines). Hunk #2 succeeded at 917 (offset -3 lines). Hunk #3 succeeded at 930 (offset -3 lines). Hunk #4 succeeded at 1037 (offset -3 lines). Hunk #5 succeeded at 1658 (offset -4 lines). $ make $ make install $ pg_ctl -c restart
and go ...
lots of CPU% again and very limited memory use as of yet.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11054 postgres 20 0 1302772 90316 58004 R 94.4 1.1 4:38.05 postgres: postgres integrator [local] EXPLAIN 11055 postgres 20 0 1280532 68076 57168 R 97.7 0.9 2:03.54 postgres: parallel worker for PID 11054 11056 postgres 20 0 1280532 67964 57124 S 0.0 0.9 2:08.28 postgres: parallel worker for PID 11054
that's a pretty decent sign so far. Slight increase ... but still relatively steady ...
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11054 postgres 20 0 1379704 167140 58004 R 95.0 2.1 5:56.28 postgres: postgres integrator [local] EXPLAIN 11055 postgres 20 0 1280532 68076 57168 S 25.6 0.9 2:36.59 postgres: parallel worker for PID 11054 11056 postgres 20 0 1280532 67964 57124 R 61.8 0.9 2:29.65 postgres: parallel worker for PID 11054
aaand break out ...
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11119 postgres 20 0 1271660 1.0g 1.0g D 0.0 13.4 0:03.10 postgres: parallel worker for PID 11054 11054 postgres 20 0 1380940 1.0g 950508 D 0.0 13.4 6:56.09 postgres: postgres integrator [local] EXPLAIN 11118 postgres 20 0 1271660 884540 882724 D 0.0 11.2 0:02.84 postgres: parallel worker for PID 11054
and crash:
foo=# explain analyze select * from reports.v_BusinessOperation; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !>
what happened? ouch, no space left on root device, too much logging? Maybe the core dump ... Log file content is simple:
2019-04-22 00:07:56.104 UTC [11048] LOG: database system was shut down at 2019-04-22 00:07:55 UTC 2019-04-22 00:07:56.108 UTC [11046] LOG: database system is ready to accept connections Hashjoin 0x2122458: initial nbatch = 16, nbuckets = 8192 Hashjoin 0x2122458: increasing nbatch to 32 because space = 4128933 Hashjoin 0x2122458: freed 148 of 10584 tuples, space now 4071106 Hashjoin 0x2122458: increasing nbatch to 64 because space = 4128826 Hashjoin 0x2122458: freed 544 of 10584 tuples, space now 3916296 Hashjoin 0x2122458: increasing nbatch to 128 because space = 4128846 Hashjoin 0x2122458: freed 10419 of 10585 tuples, space now 65570 Hashjoin 0x2122458: increasing nbatch to 256 because space = 4128829 Hashjoin 0x2122458: freed 10308 of 10734 tuples, space now 161815 Hashjoin 0x2122458: increasing nbatch to 512 because space = 4128908 Hashjoin 0x2122458: freed 398 of 10379 tuples, space now 3977787 Hashjoin 0x3ac9918: initial nbatch = 1, nbuckets = 1024 Hashjoin 0x3ac91a8: initial nbatch = 1, nbuckets = 1024 Hashjoin 0x3ac93c8: initial nbatch = 1, nbuckets = 1024 Hashjoin 0x1f41018: initial nbatch = 1, nbuckets = 1024 Hashjoin 0x1f41048: initial nbatch = 1, nbuckets = 1024 2019-04-22 00:16:55.273 UTC [11046] LOG: background worker "parallel worker" (PID 11119) was terminated by signal 11: Segmentation fault 2019-04-22 00:16:55.273 UTC [11046] DETAIL: Failed process was running: explain analyze select * from reports.v_BusinessOperation; 2019-04-22 00:16:55.273 UTC [11046] LOG: terminating any other active server processes 2019-04-22 00:16:55.274 UTC [11058] WARNING: terminating connection because of crash of another server process 2019-04-22 00:16:55.274 UTC [11058] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2019-04-22 00:16:55.274 UTC [11058] HINT: In a moment you should be able to reconnect to the database and repeat your command. 2019-04-22 00:16:55.277 UTC [11052] LOG: could not write temporary statistics file "pg_stat/db_16903.tmp": No space left on device 2019-04-22 00:16:55.278 UTC [11052] LOG: could not close temporary statistics file "pg_stat/db_0.tmp": No space left on device 2019-04-22 00:16:55.278 UTC [11052] LOG: could not close temporary statistics file "pg_stat/global.tmp": No space left on device 2019-04-22 00:16:55.315 UTC [11046] LOG: all server processes terminated; reinitializing 2019-04-22 00:16:55.425 UTC [11123] LOG: database system was interrupted; last known up at 2019-04-22 00:12:56 UTC 2019-04-22 00:16:55.426 UTC [11124] FATAL: the database system is in recovery mode 2019-04-22 00:16:55.545 UTC [11123] LOG: database system was not properly shut down; automatic recovery in progress 2019-04-22 00:16:55.549 UTC [11123] LOG: redo starts at 3D2/C44FDCF8
ok it is all because it dumped 3 core dumps, glad that I captured the top lines of the backend and its 2 workers
-rw------- 1 postgres postgres 1075843072 Apr 22 00:16 core.11054 -- backend
-rw------- 1 postgres postgres 894640128 Apr 22 00:16 core.11118 -- worker 1
-rw------- 1 postgres postgres 1079103488 Apr 22 00:16 core.11119 -- worker 2
And the melt down starts with "parallel worker" (PID 11119) receiving SIGSEGV.
So let's get gdb to the task to see what's up:
$ gdb -c data/core.11119 postgresql-11.2/src/backend/postgres ... Reading symbols from postgresql-11.2/src/backend/postgres...done. BFD: Warning: /var/lib/pgsql/data/core.11119 is truncated: expected core file size >= 1127112704, found: 1079103488. [New LWP 11119] Cannot access memory at address 0x7ff8d25dc108 Cannot access memory at address 0x7ff8d25dc100 Failed to read a valid object file image from memory. Core was generated by `postgres: parallel worker for'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00000000006bd792 in ExecParallelHashJoinNewBatch ( hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9c38>) at nodeHashjoin.c:1127 1127 { (gdb) bt 8 #0 0x00000000006bd792 in ExecParallelHashJoinNewBatch ( hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9c38>) at nodeHashjoin.c:1127 Backtrace stopped: Cannot access memory at address 0x7ffd45fa9c88 (gdb) info frame Stack level 0, frame at 0x7ffd45fa9c90:rip = 0x6bd792 in ExecParallelHashJoinNewBatch (nodeHashjoin.c:1127); saved rip = <not saved>Outermost frame: Cannot access memory at address 0x7ffd45fa9c88source language c.Arglist at 0x7ffd45fa9c80, args: hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9c38>Locals at 0x7ffd45fa9c80, Previous frame's sp is 0x7ffd45fa9c90 Cannot access memory at address 0x7ffd45fa9c80 (gdb) list 1122 SharedTuplestoreAccessor *inner_tuples; 1123 Barrier *batch_barrier = 1124 &hashtable->batches[batchno].shared->batch_barrier; 1125 1126 switch (BarrierAttach(batch_barrier)) 1127 { 1128 case PHJ_BATCH_ELECTING: 1129 1130 /* One backend allocates the hash table. */ 1131 if (BarrierArriveAndWait(batch_barrier,
unfortunately this core file is truncated because of the file system running out of space. Let's see the others.
$ gdb -c data/core.11118 postgresql-11.2/src/backend/postgres ... Reading symbols from postgresql-11.2/src/backend/postgres...done. BFD: Warning: /var/lib/pgsql/data/core.11118 is truncated: expected core file size >= 1127112704, found: 894640128. [New LWP 11118] Cannot access memory at address 0x7ff8d25dc108 Cannot access memory at address 0x7ff8d25dc100 Failed to read a valid object file image from memory. Core was generated by `postgres: parallel worker for'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00000000006bd792 in ExecParallelHashJoinNewBatch ( hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9c38>) at nodeHashjoin.c:1127 1127 { (gdb) bt 5 #0 0x00000000006bd792 in ExecParallelHashJoinNewBatch ( hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9c38>) at nodeHashjoin.c:1127 Backtrace stopped: Cannot access memory at address 0x7ffd45fa9c88 (gdb) list 1122 SharedTuplestoreAccessor *inner_tuples; 1123 Barrier *batch_barrier = 1124 &hashtable->batches[batchno].shared->batch_barrier; 1125 1126 switch (BarrierAttach(batch_barrier)) 1127 { 1128 case PHJ_BATCH_ELECTING: 1129 1130 /* One backend allocates the hash table. */ 1131 if (BarrierArriveAndWait(batch_barrier,
strange, that one must have died very similar, same place, also truncated.
$ gdb -c data/core.11054 postgresql-11.2/src/backend/postgres ... Reading symbols from postgresql-11.2/src/backend/postgres...done. BFD: Warning: /var/lib/pgsql/data/core.11054 is truncated: expected core file size >= 1238786048, found: 1075843072. [New LWP 11054] Cannot access memory at address 0x7ff8d25dc108 Cannot access memory at address 0x7ff8d25dc100 Failed to read a valid object file image from memory. Core was generated by `postgres: postgres integrator'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00000000006bd792 in ExecParallelHashJoinNewBatch ( hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9498>) at nodeHashjoin.c:1127 1127 { (
I don't understand why all of them are at the same location. Doesn't make any sense to me.
But I'll leave it at that right now.
-Gunther
On Sun, Apr 21, 2019 at 11:40:22AM -0500, Justin Pryzby wrote: >On Sun, Apr 21, 2019 at 10:36:43AM -0400, Tom Lane wrote: >> Jeff Janes <jeff.janes@gmail.com> writes: >> > The growEnabled stuff only prevents infinite loops. It doesn't prevent >> > extreme silliness. >> >> > If a single 32 bit hash value has enough tuples by itself to not fit in >> > work_mem, then it will keep splitting until that value is in a batch by >> > itself before shutting off >> >> I suspect, however, that we might be better off just taking the existence >> of the I/O buffers into account somehow while deciding whether it's worth >> growing further. That is, I'm imagining adding a second independent >> reason for shutting off growEnabled, along the lines of "increasing >> nbatch any further will require an unreasonable amount of buffer memory". >> The question then becomes how to define "unreasonable". > >On Sun, Apr 21, 2019 at 06:15:25PM +0200, Tomas Vondra wrote: >> I think the question the code needs to be asking is "If we double the >> number of batches, does the amount of memory we need drop?" And the >> memory needs to account both for the buffers and per-batch data. >> >> I don't think we can just stop increasing the number of batches when the >> memory for BufFile exceeds work_mem, because that entirely ignores the >> fact that by doing that we force the system to keep the per-batch stuff >> in memory (and that can be almost arbitrary amount). >... >> Of course, this just stops enforcing work_mem at some point, but it at >> least attempts to minimize the amount of memory used. > >This patch defines reasonable as "additional BatchFiles will not themselves >exceed work_mem; OR, exceeded work_mem already but additional BatchFiles are >going to save us RAM"... > OK. >I think the first condition is insensitive and not too important to get right, >it only allows work_mem to be exceeded by 2x, which maybe already happens for >multiple reasons, related to this thread and otherwise. It'd be fine to slap >on a factor of /2 or /4 or /8 there too. > TBH I'm not quite sure I understand all the conditions in the patch - it seems unnecessarily complicated. And I don't think it actually minimizes the amount of memory used for hash table + buffers, because it keeps the same spaceAllowed (which pushes nbatches up). At some point it actually makes to bump spaceAllowed and make larger batches instead of adding more batches, and the patch does not seem to do that. Also, the patch does this: if (hashtable->nbatch*sizeof(PGAlignedBlock) < hashtable->spaceAllowed) { ExecHashIncreaseNumBatches(hashtable); } else if (hashtable->spaceUsed/2 >= hashtable->spaceAllowed) { /* Exceeded spaceAllowed by 2x, so we'll save RAM by allowing nbatches to increase */ /* I think this branch would be hit almost same as below branch */ ExecHashIncreaseNumBatches(hashtable); } ... but the reasoning for the second branch seems wrong, because (spaceUsed/2 >= spaceAllowed) is not enough to guarantee that we actually save memory by doubling the number of batches. To do that, we need to make sure that (spaceUsed/2 >= hashtable->nbatch * sizeof(PGAlignedBlock)) But that may not be true - it certainly is not guaranteed by not getting into the first branch. Consider an ideal example with uniform distribution: create table small (id bigint, val text); create table large (id bigint, val text); insert into large select 1000000000 * random(), md5(i::text) from generate_series(1, 700000000) s(i); insert into small select 1000000000 * random(), md5(i::text) from generate_series(1, 10000) s(i); vacuum analyze large; vacuum analyze small; update pg_class set (relpages, reltuples) = (1000000, 1) where relname = 'large'; update pg_class set (relpages, reltuples) = (1, 1000000) where relname = 'small'; set work_mem = '1MB'; explain analyze select * from small join large using (id); A log after each call to ExecHashIncreaseNumBatches says this: nbatch=2 spaceUsed=463200 spaceAllowed=1048576 BufFile=16384 nbatch=4 spaceUsed=463120 spaceAllowed=1048576 BufFile=32768 nbatch=8 spaceUsed=457120 spaceAllowed=1048576 BufFile=65536 nbatch=16 spaceUsed=458320 spaceAllowed=1048576 BufFile=131072 nbatch=32 spaceUsed=457120 spaceAllowed=1048576 BufFile=262144 nbatch=64 spaceUsed=459200 spaceAllowed=1048576 BufFile=524288 nbatch=128 spaceUsed=455600 spaceAllowed=1048576 BufFile=1048576 nbatch=256 spaceUsed=525120 spaceAllowed=1048576 BufFile=2097152 nbatch=256 spaceUsed=2097200 spaceAllowed=1048576 BufFile=2097152 nbatch=512 spaceUsed=2097200 spaceAllowed=1048576 BufFile=4194304 nbatch=1024 spaceUsed=2097200 spaceAllowed=1048576 BufFile=8388608 nbatch=2048 spaceUsed=2097200 spaceAllowed=1048576 BufFile=16777216 nbatch=4096 spaceUsed=2097200 spaceAllowed=1048576 BufFile=33554432 nbatch=8192 spaceUsed=2097200 spaceAllowed=1048576 BufFile=67108864 nbatch=16384 spaceUsed=2097200 spaceAllowed=1048576 BufFile=134217728 So we've succeeded in keeping spaceUsed below 2*spaceAllowed (which seems rather confusing, BTW), but we've allocated 128MB for BufFile. So about 130MB in total. With 16k batches. What I think might work better is the attached v2 of the patch, with a single top-level condition, comparing the combined memory usage (spaceUsed + BufFile) against spaceAllowed. But it also tweaks spaceAllowed once the size needed for BufFile gets over work_mem/3. And it behaves like this: nbatch=2 spaceUsed=458640 spaceAllowed=1048576 BufFile=16384 nbatch=4 spaceUsed=455040 spaceAllowed=1048576 BufFile=32768 nbatch=8 spaceUsed=440160 spaceAllowed=1048576 BufFile=65536 nbatch=16 spaceUsed=426560 spaceAllowed=1048576 BufFile=131072 nbatch=32 spaceUsed=393200 spaceAllowed=1048576 BufFile=262144 nbatch=64 spaceUsed=329120 spaceAllowed=1572864 BufFile=524288 nbatch=128 spaceUsed=455600 spaceAllowed=3145728 BufFile=1048576 nbatch=256 spaceUsed=987440 spaceAllowed=6291456 BufFile=2097152 nbatch=512 spaceUsed=2040560 spaceAllowed=12582912 BufFile=4194304 nbatch=1024 spaceUsed=4114640 spaceAllowed=25165824 BufFile=8388608 nbatch=2048 spaceUsed=8302880 spaceAllowed=50331648 BufFile=16777216 So we end up with just 2k batches, using ~24MB of memory in total. That's because the spaceAllowed limit was bumped up instead of adding more and more batches. >The current patch doesn't unset growEnabled, since there's no point at which >the hashtable should grow without bound: if hash tables are *already* exceeding >work_mem by 2x as big, nbatches should be doubled. > Not sure. I guess it might be useful to re-evaluate the flag after a while - not necessarily by actually enabling it right away, but just checking if it'd move any tuples. Just disabling it once may be an issue when the input data is somehow correlated, which seems to be one of the issues with the data set discussed in this thread. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Sun, Apr 21, 2019 at 11:40:22AM -0500, Justin Pryzby wrote:
>On Sun, Apr 21, 2019 at 10:36:43AM -0400, Tom Lane wrote:
>> Jeff Janes <jeff.janes@gmail.com> writes:
>> > The growEnabled stuff only prevents infinite loops. It doesn't prevent
>> > extreme silliness.
>>
>> > If a single 32 bit hash value has enough tuples by itself to not fit in
>> > work_mem, then it will keep splitting until that value is in a batch by
>> > itself before shutting off
>>
>> I suspect, however, that we might be better off just taking the existence
>> of the I/O buffers into account somehow while deciding whether it's worth
>> growing further. That is, I'm imagining adding a second independent
>> reason for shutting off growEnabled, along the lines of "increasing
>> nbatch any further will require an unreasonable amount of buffer memory".
>> The question then becomes how to define "unreasonable".
>
>On Sun, Apr 21, 2019 at 06:15:25PM +0200, Tomas Vondra wrote:
>> I think the question the code needs to be asking is "If we double the
>> number of batches, does the amount of memory we need drop?" And the
>> memory needs to account both for the buffers and per-batch data.
>>
>> I don't think we can just stop increasing the number of batches when the
>> memory for BufFile exceeds work_mem, because that entirely ignores the
>> fact that by doing that we force the system to keep the per-batch stuff
>> in memory (and that can be almost arbitrary amount).
>...
>> Of course, this just stops enforcing work_mem at some point, but it at
>> least attempts to minimize the amount of memory used.
>
>This patch defines reasonable as "additional BatchFiles will not themselves
>exceed work_mem; OR, exceeded work_mem already but additional BatchFiles are
>going to save us RAM"...
>
OK.
>I think the first condition is insensitive and not too important to get right,
>it only allows work_mem to be exceeded by 2x, which maybe already happens for
>multiple reasons, related to this thread and otherwise. It'd be fine to slap
>on a factor of /2 or /4 or /8 there too.
>
TBH I'm not quite sure I understand all the conditions in the patch - it
seems unnecessarily complicated. And I don't think it actually minimizes
the amount of memory used for hash table + buffers, because it keeps the
same spaceAllowed (which pushes nbatches up). At some point it actually
makes to bump spaceAllowed and make larger batches instead of adding
more batches, and the patch does not seem to do that.
Also, the patch does this:
if (hashtable->nbatch*sizeof(PGAlignedBlock) < hashtable->spaceAllowed)
{
ExecHashIncreaseNumBatches(hashtable);
}
else if (hashtable->spaceUsed/2 >= hashtable->spaceAllowed)
{
/* Exceeded spaceAllowed by 2x, so we'll save RAM by allowing nbatches to increase */
/* I think this branch would be hit almost same as below branch */
ExecHashIncreaseNumBatches(hashtable);
}
...
but the reasoning for the second branch seems wrong, because
(spaceUsed/2 >= spaceAllowed)
is not enough to guarantee that we actually save memory by doubling the
number of batches. To do that, we need to make sure that
(spaceUsed/2 >= hashtable->nbatch * sizeof(PGAlignedBlock))
But that may not be true - it certainly is not guaranteed by not getting
into the first branch.
Consider an ideal example with uniform distribution:
create table small (id bigint, val text);
create table large (id bigint, val text);
insert into large select 1000000000 * random(), md5(i::text)
from generate_series(1, 700000000) s(i);
insert into small select 1000000000 * random(), md5(i::text)
from generate_series(1, 10000) s(i);
vacuum analyze large;
vacuum analyze small;
update pg_class set (relpages, reltuples) = (1000000, 1)
where relname = 'large';
update pg_class set (relpages, reltuples) = (1, 1000000)
where relname = 'small';
set work_mem = '1MB';
explain analyze select * from small join large using (id);
A log after each call to ExecHashIncreaseNumBatches says this:
nbatch=2 spaceUsed=463200 spaceAllowed=1048576 BufFile=16384
nbatch=4 spaceUsed=463120 spaceAllowed=1048576 BufFile=32768
nbatch=8 spaceUsed=457120 spaceAllowed=1048576 BufFile=65536
nbatch=16 spaceUsed=458320 spaceAllowed=1048576 BufFile=131072
nbatch=32 spaceUsed=457120 spaceAllowed=1048576 BufFile=262144
nbatch=64 spaceUsed=459200 spaceAllowed=1048576 BufFile=524288
nbatch=128 spaceUsed=455600 spaceAllowed=1048576 BufFile=1048576
nbatch=256 spaceUsed=525120 spaceAllowed=1048576 BufFile=2097152
nbatch=256 spaceUsed=2097200 spaceAllowed=1048576 BufFile=2097152
nbatch=512 spaceUsed=2097200 spaceAllowed=1048576 BufFile=4194304
nbatch=1024 spaceUsed=2097200 spaceAllowed=1048576 BufFile=8388608
nbatch=2048 spaceUsed=2097200 spaceAllowed=1048576 BufFile=16777216
nbatch=4096 spaceUsed=2097200 spaceAllowed=1048576 BufFile=33554432
nbatch=8192 spaceUsed=2097200 spaceAllowed=1048576 BufFile=67108864
nbatch=16384 spaceUsed=2097200 spaceAllowed=1048576 BufFile=134217728
So we've succeeded in keeping spaceUsed below 2*spaceAllowed (which
seems rather confusing, BTW), but we've allocated 128MB for BufFile. So
about 130MB in total. With 16k batches.
What I think might work better is the attached v2 of the patch, with a
single top-level condition, comparing the combined memory usage
(spaceUsed + BufFile) against spaceAllowed. But it also tweaks
spaceAllowed once the size needed for BufFile gets over work_mem/3.
And it behaves like this:
nbatch=2 spaceUsed=458640 spaceAllowed=1048576 BufFile=16384
nbatch=4 spaceUsed=455040 spaceAllowed=1048576 BufFile=32768
nbatch=8 spaceUsed=440160 spaceAllowed=1048576 BufFile=65536
nbatch=16 spaceUsed=426560 spaceAllowed=1048576 BufFile=131072
nbatch=32 spaceUsed=393200 spaceAllowed=1048576 BufFile=262144
nbatch=64 spaceUsed=329120 spaceAllowed=1572864 BufFile=524288
nbatch=128 spaceUsed=455600 spaceAllowed=3145728 BufFile=1048576
nbatch=256 spaceUsed=987440 spaceAllowed=6291456 BufFile=2097152
nbatch=512 spaceUsed=2040560 spaceAllowed=12582912 BufFile=4194304
nbatch=1024 spaceUsed=4114640 spaceAllowed=25165824 BufFile=8388608
nbatch=2048 spaceUsed=8302880 spaceAllowed=50331648 BufFile=16777216
So we end up with just 2k batches, using ~24MB of memory in total.
That's because the spaceAllowed limit was bumped up instead of adding
more and more batches.
>The current patch doesn't unset growEnabled, since there's no point at which
>the hashtable should grow without bound: if hash tables are *already* exceeding
>work_mem by 2x as big, nbatches should be doubled.
>
Not sure. I guess it might be useful to re-evaluate the flag after a
while - not necessarily by actually enabling it right away, but just
checking if it'd move any tuples. Just disabling it once may be an issue
when the input data is somehow correlated, which seems to be one of the
issues with the data set discussed in this thread.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Apr 22, 2019 at 10:07:52AM +0200, Gaetano Mendola wrote: > Batch splitting shouldn't be followed by a hash function change? What would be the value? That can help with hash collisions, but that's not the issue with the data sets discussed in this thread. The issue reported originally is about underestimates, and the sample data set has a large number of duplicate values (a single value representing ~10% of the data set). Neither of those issues is about hash collisions. The data set I used to demonstrate how the algorithms work is pretty perfect, with uniform distribution and no hash collisions. Furthermore, I don't think we can just change the hash function, for a couple of technical reasons. Firstly, it's not like we totally redistribute the whole dataset from N old batches to (2*N) new ones. By using the same 32-bit hash value and cconsidering one extra bit, the tuples either stay in the same batch (when the new bit is 0) or move to a single new batch (when it's 1). So each batch is split in 1/2. By changing the hash function this would no longer be true, and we'd redistribute pretty much the whole data set. The other issue is even more significant - we don't redistribute the tuples immediately. We only redistribute the current batch, but leave the other batches alone and handle them when we actually get to them. This is possible, because the tuples never move backwards - when splitting batch K, the tuples either stay in K or move to 2K. Or something like that, I'm too lazy to recall the exact formula now. And if I recall correctly, I think we can increment the number of batches while already performing the join, after some rows were already processed. That would probably be no longer true if we just switched the hash function, because it might move rows backwards (to the already processed region). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Gunther <raj@gusw.net> writes:
> and checked my log file and there was nothing before the call
> MemoryContextStats(TopPortalContext) so I don't understand where this
> printf stuff is ending up.
It's going to stdout, which is likely block-buffered whereas stderr
is line-buffered, so data from the latter will show up in your log
file much sooner. You might consider adding something to startup
to switch stdout to line buffering.
Jeff Janes <jeff.janes@gmail.com> writes: > Is there a reason to not just elog the HJDEBUG stuff? Yes --- it'd be expensive (a "no op" elog is far from free) and useless to ~ 99.999% of users. Almost all the conditionally-compiled debug support in the PG executor is legacy leftovers from Berkeley days. If it were useful more often than once in a blue moon, we probably would have made it more easily reachable long ago. I'm a bit surprised we haven't just ripped it out, TBH. When I find myself needing extra debug output, it's almost never the case that any of that old code does what I need. There might be a case for changing it all to print to stderr not stdout, so that it plays more nicely with elog/ereport output when you do have it turned on, though. regards, tom lane
What I think might work better is the attached v2 of the patch, with a
single top-level condition, comparing the combined memory usage
(spaceUsed + BufFile) against spaceAllowed. But it also tweaks
spaceAllowed once the size needed for BufFile gets over work_mem/3.
Thanks for this, and I am trying this now.
So far it is promising.
I see the memory footprint contained under 1 GB. I see it go up, but also down again. CPU, IO, all being live.
foo=# set enable_nestloop to off; SET foo=# explain analyze select * from reports.v_BusinessOperation; WARNING: ExecHashIncreaseNumBatches: nbatch=32 spaceAllowed=4194304 WARNING: ExecHashIncreaseNumBatches: nbatch=64 spaceAllowed=4194304 WARNING: ExecHashIncreaseNumBatches: nbatch=128 spaceAllowed=4194304 WARNING: ExecHashIncreaseNumBatches: nbatch=256 spaceAllowed=6291456 WARNING: ExecHashIncreaseNumBatches: nbatch=512 spaceAllowed=12582912 WARNING: ExecHashIncreaseNumBatches: nbatch=1024 spaceAllowed=25165824 WARNING: ExecHashIncreaseNumBatches: nbatch=2048 spaceAllowed=50331648 WARNING: ExecHashIncreaseNumBatches: nbatch=4096 spaceAllowed=100663296 WARNING: ExecHashIncreaseNumBatches: nbatch=8192 spaceAllowed=201326592 WARNING: ExecHashIncreaseNumBatches: nbatch=16384 spaceAllowed=402653184 WARNING: ExecHashIncreaseNumBatches: nbatch=32768 spaceAllowed=805306368 WARNING: ExecHashIncreaseNumBatches: nbatch=65536 spaceAllowed=1610612736
Aaaaaand, it's a winner!
Unique (cost=5551524.36..5554207.33 rows=34619 width=1197) (actual time=6150303.060..6895451.210 rows=435274 loops=1) -> Sort (cost=5551524.36..5551610.91 rows=34619 width=1197) (actual time=6150303.058..6801372.192 rows=113478386 loops=1) Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.is_current, documentinformationsubject.documentid, documentinformationsubject.documenttypecode, documentinformationsubject.subjectroleinternalid, documentinformationsubject.subjectentityinternalid, documentinformationsubject.subjectentityid, documentinformationsubject.subjectentityidroot, documentinformationsubject.subjectentityname, documentinformationsubject.subjectentitytel, documentinformationsubject.subjectentityemail, documentinformationsubject.otherentityinternalid, documentinformationsubject.confidentialitycode, documentinformationsubject.actinternalid, documentinformationsubject.code_code, documentinformationsubject.code_displayname, q.code_code, q.code_displayname, an.extension, an.root, documentinformationsubject_2.subjectentitycode, documentinformationsubject_2.subjectentitycodesystem, documentinformationsubject_2.effectivetime_low, documentinformationsubject_2.effectivetime_high, documentinformationsubject_2.statuscode, documentinformationsubject_2.code_code, agencyid.extension, agencyname.trivialname, documentinformationsubject_1.subjectentitycode, documentinformationsubject_1.subjectentityinternalid Sort Method: external merge Disk: 40726720kB -> Hash Right Join (cost=4255031.53..5530808.71 rows=34619 width=1197) (actual time=325240.679..1044194.775 rows=113478386 loops=1) Hash Cond: (((q.documentinternalid)::text = (documentinformationsubject.documentinternalid)::text) AND ((r.targetinternalid)::text = (documentinformationsubject.actinternalid)::text)) -> Hash Right Join (cost=1341541.37..2612134.36 rows=13 width=341) (actual time=81093.327..81093.446 rows=236 loops=1) Hash Cond: (((documentinformationsubject_2.documentinternalid)::text = (q.documentinternalid)::text) AND ((documentinformationsubject_2.actinternalid)::text = (q.actinternalid)::text)) -> Gather (cost=31291.54..1301884.52 rows=1 width=219) (actual time=41920.563..41929.780 rows=0 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Hash Left Join (cost=30291.54..1300884.42 rows=1 width=219) (actual time=41915.960..41915.960 rows=0 loops=3) Hash Cond: ((documentinformationsubject_2.otherentityinternalid)::text = (agencyid.entityinternalid)::text) -> Parallel Hash Left Join (cost=28606.13..1299199.00 rows=1 width=204) (actual time=41862.767..41862.769 rows=0 loops=3) Hash Cond: ((documentinformationsubject_2.otherentityinternalid)::text = (agencyname.entityinternalid)::text) -> Parallel Seq Scan on documentinformationsubject documentinformationsubject_2 (cost=0.00..1268800.85 rows=1 width=177) (actual time=40805.337..40805.337 rows=0 loops=3) Filter: ((participationtypecode)::text = 'AUT'::text) Rows Removed by Filter: 2815562 -> Parallel Hash (cost=24733.28..24733.28 rows=166628 width=64) (actual time=981.000..981.001 rows=133303 loops=3) Buckets: 65536 Batches: 16 Memory Usage: 3136kB -> Parallel Seq Scan on bestname agencyname (cost=0.00..24733.28 rows=166628 width=64) (actual time=0.506..916.816 rows=133303 loops=3) -> Parallel Hash (cost=1434.07..1434.07 rows=20107 width=89) (actual time=52.350..52.350 rows=11393 loops=3) Buckets: 65536 Batches: 1 Memory Usage: 4680kB -> Parallel Seq Scan on entity_id agencyid (cost=0.00..1434.07 rows=20107 width=89) (actual time=0.376..46.875 rows=11393 loops=3) -> Hash (cost=1310249.63..1310249.63 rows=13 width=233) (actual time=39172.682..39172.682 rows=236 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 70kB -> Hash Right Join (cost=829388.20..1310249.63 rows=13 width=233) (actual time=35084.850..39172.545 rows=236 loops=1) Hash Cond: ((an.actinternalid)::text = (q.actinternalid)::text) -> Seq Scan on act_id an (cost=0.00..425941.04 rows=14645404 width=134) (actual time=0.908..7583.123 rows=14676871 loops=1) -> Hash (cost=829388.19..829388.19 rows=1 width=136) (actual time=29347.614..29347.614 rows=236 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 63kB -> Gather (cost=381928.46..829388.19 rows=1 width=136) (actual time=23902.428..29347.481 rows=236 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Hash Join (cost=380928.46..828388.09 rows=1 width=136) (actual time=23915.790..29336.452 rows=79 loops=3) Hash Cond: ((q.actinternalid)::text = (r.sourceinternalid)::text) -> Parallel Seq Scan on documentinformation q (cost=0.00..447271.93 rows=50050 width=99) (actual time=10055.238..15484.478 rows=87921 loops=3) Filter: (((classcode)::text = 'CNTRCT'::text) AND ((moodcode)::text = 'EVN'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text)) Rows Removed by Filter: 1540625 -> Parallel Hash (cost=380928.44..380928.44 rows=1 width=74) (actual time=13825.726..13825.726 rows=79 loops=3) Buckets: 1024 Batches: 1 Memory Usage: 112kB -> Parallel Seq Scan on actrelationship r (cost=0.00..380928.44 rows=1 width=74) (actual time=5289.948..13825.576 rows=79 loops=3) Filter: ((typecode)::text = 'SUBJ'::text) Rows Removed by Filter: 3433326 -> Hash (cost=2908913.87..2908913.87 rows=34619 width=930) (actual time=244145.322..244145.322 rows=113478127 loops=1) Buckets: 8192 (originally 8192) Batches: 65536 (originally 16) Memory Usage: 1204250kB -> Gather Merge (cost=2892141.40..2908913.87 rows=34619 width=930) (actual time=75215.333..145622.427 rows=113478127 loops=1) Workers Planned: 2 Workers Launched: 2 -> Merge Left Join (cost=2891141.37..2903917.96 rows=14425 width=930) (actual time=75132.988..99411.448 rows=37826042 loops=3) Merge Cond: (((documentinformationsubject.documentinternalid)::text = (documentinformationsubject_1.documentinternalid)::text) AND ((documentinformationsubject.documentid)::text = (documentinformationsubject_1.documentid)::text) AND ((documentinformationsubject.actinternalid)::text = (documentinformationsubject_1.actinternalid)::text)) -> Sort (cost=1301590.26..1301626.32 rows=14425 width=882) (actual time=39801.337..40975.780 rows=231207 loops=3) Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.documentid, documentinformationsubject.actinternalidct_1.documentid, documentinformationsubject_1.actinternalid Sort Method: external merge Disk: 169768kB Worker 0: Sort Method: external merge Disk: 169768kB Worker 1: Sort Method: external merge Disk: 169768kB -> Seq Scan on documentinformationsubject documentinformationsubject_1 (cost=0.00..1329868.64 rows=1010585 width=159) (actual time=23401.537..31758.042 rows=1031106 loops=3) Filter: ((participationtypecode)::text = 'PRD'::text) Rows Removed by Filter: 7415579Planning Time: 40.559 msExecution Time: 6896581.566 ms (70 rows)
For the first time this query has succeeded now. Memory was bounded. The time of nearly hours is crazy, but things sometimes take that long. The important thing was not to get an out of memory error.
Thank you. Anything else you want to try, I can do it.
regards,
-Gunther
On Tue, Apr 23, 2019 at 04:37:50PM -0400, Gunther wrote: > On 4/21/2019 23:09, Tomas Vondra wrote: > >What I think might work better is the attached v2 of the patch, with a > Thanks for this, and I am trying this now. ... > Aaaaaand, it's a winner! > > Unique (cost=5551524.36..5554207.33 rows=34619 width=1197) (actual time=6150303.060..6895451.210 rows=435274 loops=1) > -> Sort (cost=5551524.36..5551610.91 rows=34619 width=1197) (actual time=6150303.058..6801372.192 rows=113478386 loops=1) > Sort Method: external merge Disk: 40726720kB > > For the first time this query has succeeded now. Memory was bounded. The > time of nearly hours is crazy, but things sometimes take that long It wrote 40GB tempfiles - perhaps you can increase work_mem now to improve the query time. We didn't address it yet, but your issue was partially caused by a misestimate. It's almost certainly because these conditions are correlated, or maybe redundant. > Merge Cond: (((documentinformationsubject.documentinternalid)::text = (documentinformationsubject_1.documentinternalid)::text)AND ((documentinformationsubject.documentid)::text = (documentinformationsubject_1.documentid)::text)AND ((documentinformationsubject.actinternalid)::text = (documentinformationsubject_1.actinternalid)::text)) If they're completely redundant and you can get the same result after dropping one or two of those conditions, then you should. Alternately, if they're correlated but not redundant, you can use PG10 "dependency" statistics (CREATE STATISTICS) on the correlated columns (and ANALYZE). On Tue, Apr 16, 2019 at 10:24:53PM -0400, Gunther wrote: > Hash Right Join (cost=4203858.53..5475530.71 rows=34619 width=4) (actual time=309603.384..459480.863 rows=113478386 loops=1) ... > -> Hash (cost=1310249.63..1310249.63 rows=13 width=111) (actual time=51077.049..51077.049 rows=236 loops=1) ... > -> Hash (cost=2861845.87..2861845.87 rows=34619 width=74) (actual time=199792.446..199792.446 rows=113478127 loops=1) > Buckets: 65536 (originally 65536) Batches: 131072 (originally 2) Memory Usage: 189207kB > -> Gather Merge (cost=2845073.40..2861845.87 rows=34619 width=74) (actual time=107620.262..156256.432 rows=113478127loops=1) > Workers Planned: 2 > Workers Launched: 2 > -> Merge Left Join (cost=2844073.37..2856849.96 rows=14425 width=74) (actual time=107570.719..126113.792rows=37826042 loops=3) > Merge Cond: (((documentinformationsubject.documentinternalid)::text = (documentinformationsubject_1.documentinternalid)::text)AND ((documentinformationsubject.documentid)::text = (documentinformationsubject_1.documentid)::text)AND ((documentinformationsubject.actinternalid)::text = (documentinformationsubject_1.actinternalid)::text)) > -> Sort (cost=1295969.26..1296005.32 rows=14425 width=111) (actual time=57700.723..58134.751 rows=231207loops=3) > Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.documentid,documentinformationsubject.actinternalid > Sort Method: external merge Disk: 26936kB > Worker 0: Sort Method: external merge Disk: 27152kB > Worker 1: Sort Method: external merge Disk: 28248kB > -> Parallel Seq Scan on documentinformationsubject (cost=0.00..1294972.76 rows=14425 width=111)(actual time=24866.656..57424.420 rows=231207 loops=3) > Filter: (((participationtypecode)::text = ANY ('{PPRF,PRF}'::text[])) AND ((classcode)::text= 'ACT'::text) AND ((moodcode)::text = 'DEF'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text)) > Rows Removed by Filter: 2584355 > -> Materialize (cost=1548104.12..1553157.04 rows=1010585 width=111) (actual time=49869.984..54191.701rows=38060250 loops=3) > -> Sort (cost=1548104.12..1550630.58 rows=1010585 width=111) (actual time=49869.980..50832.205rows=1031106 loops=3) > Sort Key: documentinformationsubject_1.documentinternalid, documentinformationsubject_1.documentid,documentinformationsubject_1.actinternalid > Sort Method: external merge Disk: 122192kB > Worker 0: Sort Method: external merge Disk: 122192kB > Worker 1: Sort Method: external merge Disk: 122192kB > -> Seq Scan on documentinformationsubject documentinformationsubject_1 (cost=0.00..1329868.64rows=1010585 width=111) (actual time=20366.166..47751.267 rows=1031106 loops=3) > Filter: ((participationtypecode)::text = 'PRD'::text) > Rows Removed by Filter: 7415579
On Tue, Apr 23, 2019 at 03:43:48PM -0500, Justin Pryzby wrote: >On Tue, Apr 23, 2019 at 04:37:50PM -0400, Gunther wrote: >> On 4/21/2019 23:09, Tomas Vondra wrote: >> >What I think might work better is the attached v2 of the patch, with a >> Thanks for this, and I am trying this now. >... >> Aaaaaand, it's a winner! >> >> Unique (cost=5551524.36..5554207.33 rows=34619 width=1197) (actual time=6150303.060..6895451.210 rows=435274 loops=1) >> -> Sort (cost=5551524.36..5551610.91 rows=34619 width=1197) (actual time=6150303.058..6801372.192 rows=113478386 loops=1) >> Sort Method: external merge Disk: 40726720kB >> >> For the first time this query has succeeded now. Memory was bounded. The >> time of nearly hours is crazy, but things sometimes take that long > >It wrote 40GB tempfiles - perhaps you can increase work_mem now to improve the >query time. > That's unlikely to reduce the amount of data written to temporary files, it just means there will be fewer larger files - in total it's still going to be ~40GB. And it's not guaranteed it'll improve performance, because work_mem=4MB might fit into CPU caches and larger values almost certainly won't. I don't think there's much to gain, really. >We didn't address it yet, but your issue was partially caused by a misestimate. >It's almost certainly because these conditions are correlated, or maybe >redundant. > Right. Chances are that with a bettwe estimate the optimizer would pick merge join instead. I wonder if that would be significantly faster. >> Merge Cond: (((documentinformationsubject.documentinternalid)::text = >> (documentinformationsubject_1.documentinternalid)::text) AND >> ((documentinformationsubject.documentid)::text = >> (documentinformationsubject_1.documentid)::text) AND >> ((documentinformationsubject.actinternalid)::text = >> (documentinformationsubject_1.actinternalid)::text)) > >If they're completely redundant and you can get the same result after >dropping one or two of those conditions, then you should. > >Alternately, if they're correlated but not redundant, you can use PG10 >"dependency" statistics (CREATE STATISTICS) on the correlated columns >(and ANALYZE). > That's not going to help, because we don't use functional dependencies in join estimation yet. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Apr 23, 2019 at 04:37:50PM -0400, Gunther wrote: > On 4/21/2019 23:09, Tomas Vondra wrote: > > What I think might work better is the attached v2 of the patch, with a > single top-level condition, comparing the combined memory usage > (spaceUsed + BufFile) against spaceAllowed. But it also tweaks > spaceAllowed once the size needed for BufFile gets over work_mem/3. > > Thanks for this, and I am trying this now. > > So far it is promising. > > I see the memory footprint contained under 1 GB. I see it go up, but also > down again. CPU, IO, all being live. > > foo=# set enable_nestloop to off; > SET > foo=# explain analyze select * from reports.v_BusinessOperation; > WARNING: ExecHashIncreaseNumBatches: nbatch=32 spaceAllowed=4194304 > WARNING: ExecHashIncreaseNumBatches: nbatch=64 spaceAllowed=4194304 > WARNING: ExecHashIncreaseNumBatches: nbatch=128 spaceAllowed=4194304 > WARNING: ExecHashIncreaseNumBatches: nbatch=256 spaceAllowed=6291456 > WARNING: ExecHashIncreaseNumBatches: nbatch=512 spaceAllowed=12582912 > WARNING: ExecHashIncreaseNumBatches: nbatch=1024 spaceAllowed=25165824 > WARNING: ExecHashIncreaseNumBatches: nbatch=2048 spaceAllowed=50331648 > WARNING: ExecHashIncreaseNumBatches: nbatch=4096 spaceAllowed=100663296 > WARNING: ExecHashIncreaseNumBatches: nbatch=8192 spaceAllowed=201326592 > WARNING: ExecHashIncreaseNumBatches: nbatch=16384 spaceAllowed=402653184 > WARNING: ExecHashIncreaseNumBatches: nbatch=32768 spaceAllowed=805306368 > WARNING: ExecHashIncreaseNumBatches: nbatch=65536 spaceAllowed=1610612736 > > Aaaaaand, it's a winner! > Good ;-) > Unique (cost=5551524.36..5554207.33 rows=34619 width=1197) (actual time=6150303.060..6895451.210 rows=435274 loops=1) > -> Sort (cost=5551524.36..5551610.91 rows=34619 width=1197) (actual time=6150303.058..6801372.192 rows=113478386 loops=1) > Sort Key: ... > Sort Method: external merge Disk: 40726720kB > -> Hash Right Join (cost=4255031.53..5530808.71 rows=34619 width=1197) (actual time=325240.679..1044194.775rows=113478386 loops=1) > Hash Cond: ... > ... > Planning Time: 40.559 ms > Execution Time: 6896581.566 ms > (70 rows) > > > For the first time this query has succeeded now. Memory was bounded. The > time of nearly hours is crazy, but things sometimes take that long. The > important thing was not to get an out of memory error. > TBH I don't think there's much we can do to improve this further - it's a rather desperate effort to keep the memory usage as low as possible, without any real guarantees. Also, the hash join only takes about 1000 seconds out of the 6900 total. So even if we got it much faster, the query would still take almost two hours, give or take. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
It wrote 40GB tempfiles - perhaps you can increase work_mem now to improve the query time.
I now upped my shared_buffers back from 1 to 2GB and work_mem from 4 to 16MB. Need to set vm.overcommit_ratio from 50 to 75 (percent, with vm.overcommit_memory = 2 as it is.)
We didn't address it yet, but your issue was partially caused by a misestimate. It's almost certainly because these conditions are correlated, or maybe redundant.
That may be so, but mis-estimates happen. And I can still massively improve this query logically I am sure. In fact it sticks out like a sore thumb, logically it makes no sense to churn over 100 million rows here, but the point is that hopefully PostgreSQL runs stable in such outlier situations, comes back and presents you with 2 hours of work time, 40 GB temp space, or whatever, and then we users can figure out how to make it work better. The frustrating thing it to get out of memory and we not knowing what we can possibly do about it.
From my previous attempt with this tmp_r and tmp_q table, I also know that the Sort/Uniqe step is taking a lot of extra time. I can cut that out too by addressing the causes of the "repeated result" rows. But again, that is all secondary optimizations.
I understand. You are saying by reducing the amount of columns in the join condition, somehow you might be able to reduce the size of the hash temporary table?Merge Cond: (((documentinformationsubject.documentinternalid)::text = (documentinformationsubject_1.documentinternalid)::text) AND ((documentinformationsubject.documentid)::text = (documentinformationsubject_1.documentid)::text) AND ((documentinformationsubject.actinternalid)::text = (documentinformationsubject_1.actinternalid)::text))If they're completely redundant and you can get the same result after dropping one or two of those conditions, then you should.
Alternately, if they're correlated but not redundant, you can use PG10 "dependency" statistics (CREATE STATISTICS) on the correlated columns (and ANALYZE).
I think documentId and documentInternalId is 1:1 they are both primary / alternate keys. So I could go with only one of them, but since I end up needing both elsewhere inside the query I like to throw them all into the natural join key, so that I don't have to deal with the duplicate result columns.
Now running:
integrator=# set enable_nestloop to off; SET integrator=# explain analyze select * from reports.v_BusinessOperation; WARNING: ExecHashIncreaseNumBatches: nbatch=8 spaceAllowed=16777216 WARNING: ExecHashIncreaseNumBatches: nbatch=16 spaceAllowed=16777216 WARNING: ExecHashIncreaseNumBatches: nbatch=32 spaceAllowed=16777216 WARNING: ExecHashIncreaseNumBatches: nbatch=64 spaceAllowed=16777216 WARNING: ExecHashIncreaseNumBatches: nbatch=128 spaceAllowed=16777216 WARNING: ExecHashIncreaseNumBatches: nbatch=256 spaceAllowed=16777216 WARNING: ExecHashIncreaseNumBatches: nbatch=512 spaceAllowed=16777216 WARNING: ExecHashIncreaseNumBatches: nbatch=1024 spaceAllowed=25165824 WARNING: ExecHashIncreaseNumBatches: nbatch=2048 spaceAllowed=50331648 WARNING: ExecHashIncreaseNumBatches: nbatch=4096 spaceAllowed=100663296 WARNING: ExecHashIncreaseNumBatches: nbatch=8192 spaceAllowed=201326592 WARNING: ExecHashIncreaseNumBatches: nbatch=16384 spaceAllowed=402653184 WARNING: ExecHashIncreaseNumBatches: nbatch=32768 spaceAllowed=805306368 WARNING: ExecHashIncreaseNumBatches: nbatch=65536 spaceAllowed=1610612736
I am waiting now, probably for that Sort/Unique to finish I think that the vast majority of the time spent is in this sort
Unique (cost=5551524.36..5554207.33 rows=34619 width=1197) (actual time=6150303.060..6895451.210 rows=435274 loops=1) -> Sort (cost=5551524.36..5551610.91 rows=34619 width=1197) (actual time=6150303.058..6801372.192 rows=113478386 loops=1) Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.is_current, documentinformationsubject.documentid, documentinformationsubject.documenttypecode, documentinformationsubject.subjectroleinternalid, documentinformationsubject.subjectentityinternalid, documentinformationsubject.subjectentityid, documentinformationsubject.subjectentityidroot, documentinformationsubject.subjectentityname, documentinformationsubject.subjectentitytel, documentinformationsubject.subjectentityemail, documentinformationsubject.otherentityinternalid, documentinformationsubject.confidentialitycode, documentinformationsubject.actinternalid, documentinformationsubject.code_code, documentinformationsubject.code_displayname, q.code_code, q.code_displayname, an.extension, an.root, documentinformationsubject_2.subjectentitycode, documentinformationsubject_2.subjectentitycodesystem, documentinformationsubject_2.effectivetime_low, documentinformationsubject_2.effectivetime_high, documentinformationsubject_2.statuscode, documentinformationsubject_2.code_code, agencyid.extension, agencyname.trivialname, documentinformationsubject_1.subjectentitycode, documentinformationsubject_1.subjectentityinternalid Sort Method: external merge Disk: 40726720kB -> Hash Right Join (cost=4255031.53..5530808.71 rows=34619 width=1197) (actual time=325240.679..1044194.775 rows=113478386 loops=1)
isn't it?
Unique/Sort actual time 6,150,303.060 ms = 6,150 s <~ 2 h.
Hash Right Join actual time 325,240.679 ms.
So really all time is wasted in that sort, no need for you guys to worry about anything else with these 2 hours. Tomas just stated the same thing.
Right. Chances are that with a bettwe estimate the optimizer would pickThe prospect of a merge join is interesting here to consider: with the Sort/Unique step taking so long, it seems the Merge Join might also take a lot of time? I see my disks are churning for the most time in this way:
merge join instead. I wonder if that would be significantly faster.
avg-cpu: %user %nice %system %iowait %steal %idle 7.50 0.00 2.50 89.50 0.00 0.50 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util nvme1n1 0.00 0.00 253.00 131.00 30.15 32.20 332.50 2.01 8.40 8.41 8.37 2.59 99.60 nvme1n1p24 0.00 0.00 253.00 131.00 30.15 32.20 332.50 2.01 8.40 8.41 8.37 2.59 99.60
I.e. 400 IOPS at 60 MB/s half of it read, half of it write. During the previous steps, the hash join presumably, throughput was a lot higher, like 2000 IOPS with 120 MB/s read or write.
But even if the Merge Join would have taken about the same or a little more time than the Hash Join, I wonder, if one could not use that to collapse the Sort/Unique step into that? Like it seems that after the Sort/Merge has completed, one should be able to read Uniqe records without any further sorting? In that case the Merge would be a great advantage.
What I like about the situation now is that with that 4x bigger work_mem, the overall memory situation remains the same. I.e., we are scraping just below 1GB for this process and we see oscillation, growth and shrinkage occurring. So I consider this case closed for me. That doesn't mean I wouldn't be available if you guys want to try anything else about it.
OK, now here is the result with the 16 MB work_mem:
Unique (cost=5462874.86..5465557.83 rows=34619 width=1197) (actual time=6283539.282..7003311.451 rows=435274 loops=1) -> Sort (cost=5462874.86..5462961.41 rows=34619 width=1197) (actual time=6283539.280..6908879.456 rows=113478386 loops=1) Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.is_current, documentinformationsubject.documentid, documentinformationsubject.documenttypecode, documentinformationsubject.subjectroleinternalid, documentinformati onsubject.subjectentityinternalid, documentinformationsubject.subjectentityid, documentinformationsubject.subjectentityidroot, documentinformationsubject.subjectentityname, documentinformationsubject.subjectentitytel, documentinformationsubject.subjectenti tyemail, documentinformationsubject.otherentityinternalid, documentinformationsubject.confidentialitycode, documentinformationsubject.actinternalid, documentinformationsubject.code_code, documentinformationsubject.code_displayname, q.code_code, q.code_disp layname, an.extension, an.root, documentinformationsubject_2.subjectentitycode, documentinformationsubject_2.subjectentitycodesystem, documentinformationsubject_2.effectivetime_low, documentinformationsubject_2.effectivetime_high, documentinformationsubjec t_2.statuscode, documentinformationsubject_2.code_code, agencyid.extension, agencyname.trivialname, documentinformationsubject_1.subjectentitycode, documentinformationsubject_1.subjectentityinternalid Sort Method: external merge Disk: 40726872kB -> Hash Right Join (cost=4168174.03..5442159.21 rows=34619 width=1197) (actual time=337057.290..1695675.896 rows=113478386 loops=1) Hash Cond: (((q.documentinternalid)::text = (documentinformationsubject.documentinternalid)::text) AND ((r.targetinternalid)::text = (documentinformationsubject.actinternalid)::text)) -> Hash Right Join (cost=1339751.37..2608552.36 rows=13 width=341) (actual time=84109.143..84109.238 rows=236 loops=1) Hash Cond: (((documentinformationsubject_2.documentinternalid)::text = (q.documentinternalid)::text) AND ((documentinformationsubject_2.actinternalid)::text = (q.actinternalid)::text)) -> Gather (cost=29501.54..1298302.52 rows=1 width=219) (actual time=43932.534..43936.888 rows=0 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Hash Left Join (cost=28501.54..1297302.42 rows=1 width=219) (actual time=43925.304..43925.304 rows=0 loops=3) ... -> Hash (cost=1310249.63..1310249.63 rows=13 width=233) (actual time=40176.581..40176.581 rows=236 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 70kB -> Hash Right Join (cost=829388.20..1310249.63 rows=13 width=233) (actual time=35925.031..40176.447 rows=236 loops=1) Hash Cond: ((an.actinternalid)::text = (q.actinternalid)::text) -> Seq Scan on act_id an (cost=0.00..425941.04 rows=14645404 width=134) (actual time=1.609..7687.986 rows=14676871 loops=1) -> Hash (cost=829388.19..829388.19 rows=1 width=136) (actual time=30106.123..30106.123 rows=236 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 63kB -> Gather (cost=381928.46..829388.19 rows=1 width=136) (actual time=24786.510..30105.983 rows=236 loops=1) ... -> Hash (cost=2823846.37..2823846.37 rows=34619 width=930) (actual time=252946.367..252946.367 rows=113478127 loops=1) Buckets: 32768 (originally 32768) Batches: 65536 (originally 4) Memory Usage: 1204250kB -> Gather Merge (cost=2807073.90..2823846.37 rows=34619 width=930) (actual time=83891.069..153380.040 rows=113478127 loops=1) Workers Planned: 2 Workers Launched: 2 -> Merge Left Join (cost=2806073.87..2818850.46 rows=14425 width=930) (actual time=83861.921..108022.671 rows=37826042 loops=3) Merge Cond: (((documentinformationsubject.documentinternalid)::text = (documentinformationsubject_1.documentinternalid)::text) AND ((documentinformationsubject.documentid)::text = (documentinformationsubject_1.documentid):: text) AND ((documentinformationsubject.actinternalid)::text = (documentinformationsubject_1.actinternalid)::text)) -> Sort (cost=1295969.26..1296005.32 rows=14425 width=882) (actual time=44814.114..45535.398 rows=231207 loops=3) Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.docum... Workers Planned: 2 Workers Launched: 2 -> Merge Left Join (cost=2806073.87..2818850.46 rows=14425 width=930) (actual time=83861.921..108022.671 rows=37826042 loops=3) Merge Cond: (((documentinformationsubject.documentinternalid)::text = (documentinformationsubject_1.documentinternalid)::text) AND ((documentinformationsubject.documentid)::text = (documentinformationsubject_1.documentid):: text) AND ((documentinformationsubject.actinternalid)::text = (documentinformationsubject_1.actinternalid)::text)) -> Sort (cost=1295969.26..1296005.32 rows=14425 width=882) (actual time=44814.114..45535.398rows=231207 loops=3) ...Planning Time: 2.953 msExecution Time: 7004340.091 ms (70 rows)
There isn't really any big news here. But what matters is that it works.
thanks & regards,
-Gunther Schadow
On Tue, Apr 23, 2019 at 07:09:00PM -0400, Gunther wrote: > On 4/23/2019 16:43, Justin Pryzby wrote: > > It wrote 40GB tempfiles - perhaps you can increase work_mem now to improve the > query time. > > I now upped my shared_buffers back from 1 to 2GB and work_mem from 4 to > 16MB. Need to set vm.overcommit_ratio from 50 to 75 (percent, with > vm.overcommit_memory = 2 as it is.) > > We didn't address it yet, but your issue was partially caused by a misestimate. > It's almost certainly because these conditions are correlated, or maybe > redundant. > > That may be so, but mis-estimates happen. And I can still massively > improve this query logically I am sure. In fact it sticks out like a sore > thumb, logically it makes no sense to churn over 100 million rows here, > but the point is that hopefully PostgreSQL runs stable in such outlier > situations, comes back and presents you with 2 hours of work time, 40 GB > temp space, or whatever, and then we users can figure out how to make it > work better. The frustrating thing it to get out of memory and we not > knowing what we can possibly do about it. > Sure. And I think the memory balancing algorithm implemented in the v2 patch is a step in that direction. I think we can do better in terms of memory consumption (essentially keeping it closer to work_mem) but it's unlikely to be any faster. In a way this is similar to underestimates in hash aggregate, except that in that case we don't have any spill-to-disk fallback at all. > From my previous attempt with this tmp_r and tmp_q table, I also know that > the Sort/Uniqe step is taking a lot of extra time. I can cut that out too > by addressing the causes of the "repeated result" rows. But again, that is > all secondary optimizations. > > Merge Cond: (((documentinformationsubject.documentinternalid)::text = (documentinformationsubject_1.documentinternalid)::text)AND ((documentinformationsubject.documentid)::text = (documentinformationsubject_1.documentid)::text)AND ((documentinformationsubject.actinternalid)::text = (documentinformationsubject_1.actinternalid)::text)) > > If they're completely redundant and you can get the same result after dropping > one or two of those conditions, then you should. > > I understand. You are saying by reducing the amount of columns in the join > condition, somehow you might be able to reduce the size of the hash > temporary table? > No. When estimating the join result size with multiple join clauses, the optimizer essentially has to compute 1: P((x1 = y1) && (x2 = y2) && (x3 = y3)) so it assumes statistical independence of those conditions and splits that into 2: P(x1 = y1) * P(x2 = y2) * P(x3 = y3) But when those conditions are dependent - for example when (x1=y1) means that ((x2=y2) && (x3=y3)) - this results into significant underestimate. E.g. let's assume that each of those conditions matches 1/100 rows, but that essentially x1=x2=x3 and y1=y2=y3. Then (1) is 1/100 but (2) ends up being 1/1000000, so 10000x off. Chances are this is what's happenning with the inner side of the hash join, which is estimated to return 14425 but ends up returning 37826042. There's one trick you might try, though - using indexes on composite types: create table t1 (a int, b int); create table t2 (a int, b int); insert into t1 select mod(i,1000), mod(i,1000) from generate_series(1,100000) s(i); insert into t2 select mod(i,1000), mod(i,1000) from generate_series(1,100000) s(i); analyze t1; analyze t2; explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b); QUERY PLAN -------------------------------------------------------------------- Merge Join (cost=19495.72..21095.56 rows=9999 width=16) (actual time=183.043..10360.276 rows=10000000 loops=1) Merge Cond: ((t1.a = t2.a) AND (t1.b = t2.b)) ... create type composite_id as (a int, b int); create index on t1 (((a,b)::composite_id)); create index on t2 (((a,b)::composite_id)); analyze t1; analyze t2; explain analyze select * from t1 join t2 on ((t1.a,t1.b)::composite_id = (t2.a,t2.b)::composite_id); QUERY PLAN -------------------------------------------------------------------------- Merge Join (cost=0.83..161674.40 rows=9999460 width=16) (actual time=0.020..12726.767 rows=10000000 loops=1) Merge Cond: (ROW(t1.a, t1.b)::composite_id = ROW(t2.a, t2.b)::composite_id) Obviously, that's not exactly free - you have to pay price for the index creation, maintenance and storage. > ... > > Unique/Sort actual time 6,150,303.060 ms = 6,150 s <~ 2 h. > Hash Right Join actual time 325,240.679 ms. > > So really all time is wasted in that sort, no need for you guys to worry > about anything else with these 2 hours. Tomas just stated the same thing. > > Right. Chances are that with a bettwe estimate the optimizer would pick > merge join instead. I wonder if that would be significantly faster. > > The prospect of a merge join is interesting here to consider: with the > Sort/Unique step taking so long, it seems the Merge Join might also take a > lot of time? I see my disks are churning for the most time in this way: > > avg-cpu: %user %nice %system %iowait %steal %idle > 7.50 0.00 2.50 89.50 0.00 0.50 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util > nvme1n1 0.00 0.00 253.00 131.00 30.15 32.20 332.50 2.01 8.40 8.41 8.37 2.59 99.60 > nvme1n1p24 0.00 0.00 253.00 131.00 30.15 32.20 332.50 2.01 8.40 8.41 8.37 2.59 99.60 > > I.e. 400 IOPS at 60 MB/s half of it read, half of it write. During the > previous steps, the hash join presumably, throughput was a lot higher, > like 2000 IOPS with 120 MB/s read or write. > > But even if the Merge Join would have taken about the same or a little > more time than the Hash Join, I wonder, if one could not use that to > collapse the Sort/Unique step into that? Like it seems that after the > Sort/Merge has completed, one should be able to read Uniqe records without > any further sorting? In that case the Merge would be a great advantage. > Probably not, because there are far more columns in the Unique step. We might have done something with "incremental sort" but we don't have that capability yet. > What I like about the situation now is that with that 4x bigger work_mem, > the overall memory situation remains the same. I.e., we are scraping just > below 1GB for this process and we see oscillation, growth and shrinkage > occurring. So I consider this case closed for me. That doesn't mean I > wouldn't be available if you guys want to try anything else about it. > > OK, now here is the result with the 16 MB work_mem: > > Unique (cost=5462874.86..5465557.83 rows=34619 width=1197) (actual time=6283539.282..7003311.451 rows=435274 loops=1) > -> Sort (cost=5462874.86..5462961.41 rows=34619 width=1197) (actual time=6283539.280..6908879.456 rows=113478386 loops=1) > ... > Planning Time: 2.953 ms > Execution Time: 7004340.091 ms > (70 rows) > > There isn't really any big news here. But what matters is that it works. > Yeah. Once the hash join outgrows the work_mem, the fallback logick starts ignoring that in the effort to keep the memory usage minimal. I still think the idea with an "overflow batch" is worth considering, because it'd allow us to keep the memory usage within work_mem. And after getting familiar with the hash join code again (haven't messed with it since 9.5 or so) I think it should not be all that difficult. I'll give it a try over the weekend if I get bored for a while. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Apr 24, 2019 at 02:36:33AM +0200, Tomas Vondra wrote: > > ... > >I still think the idea with an "overflow batch" is worth considering, >because it'd allow us to keep the memory usage within work_mem. And >after getting familiar with the hash join code again (haven't messed >with it since 9.5 or so) I think it should not be all that difficult. >I'll give it a try over the weekend if I get bored for a while. > OK, so I took a stab at this, and overall it seems to be workable. The patches I have are nowhere near committable, but I think the approach works fairly well - the memory is kept in check, and the performance is comparable to the "ballancing" approach tested before. To explain it a bit, the idea is that we can compute how many BufFile structures we can keep in memory - we can't use more than work_mem/2 for that, because then we'd mostly eliminate space for the actual data. For example with 4MB, we know we can keep 128 batches - we need 128 for outer and inner side, so 256 in total, and 256*8kB = 2MB. And then, we just increase the number of batches but instead of adding the BufFile entries, we split batches into slices that we can keep in memory (say, the 128 batches). And we keep BufFiles for the current one and an "overflow file" for the other slices. After processing a slice, we simply switch to the next one, and use the overflow file as a temp file for the first batch - we redistribute it into the other batches in the slice and another overflow file. That's what the v3 patch (named 'single overflow file') does. I does work, but unfortunately it significantly inflates the amount of data written to temporary files. Assume we need e.g. 1024 batches, but only 128 fit into memory. That means we'll need 8 slices, and during the first pass we'll handle 1/8 of the data and write 7/8 to the overflow file. Then after processing the slice and switching to the next one, we repeat this dance - 1/8 gets processed, 6/8 written to another overflow file. So essentially we "forward" about 7/8 + 6/8 + 5/8 + ... + 1/8 = 28/8 = 3.5 of data between slices, and we need to re-shuffle data in each slice, which amounts to additional 1x data. That's pretty significant overhead, as will be clear from the measurements I'll present shortly. But luckily, there's a simple solution to this - instead of writing the data into a single overflow file, we can create one overflow file for each slice. That will leave us with the ~1x of additional writes when distributing data into batches in the current slice, but it eliminates the main source of write amplification - awalanche-like forwarding of data between slices. This relaxes the memory limit a bit again, because we can't really keep the number of overflow files constrained by work_mem, but we should only need few of them (much less than when adding one file per batch right away). For example with 128 in-memory batches, this reduces the amount of necessary memory 128x. And this is what v4 (per-slice overflow file) does, pretty much. Two more comments, regarding memory accounting in previous patches. It was a bit broken, because we actually need 2x the number of BufFiles. We needed nbatch files for outer side and nbatch files for inner side, but we only considered one of those - both when deciding when to increase the number of batches / increase spaceAllowed, and when reporting the memory usage. So with large number of batches the reported amount of used memory was roughly 1/2 of the actual value :-/ The memory accounting was a bit bogus for another reason - spaceUsed simply tracks the amount of memory for hash table contents. But at the end we were simply adding the current space for BufFile stuff, ignoring the fact that that's likely much larger than when the spacePeak value got stored. For example we might have kept early spaceUsed when it was almost work_mem, and then added the final large BufFile allocation. I've fixed both issues in the patches attached to this message. It does not make a huge difference in practice, but it makes it easier to compare values between patches. Now, some test results - I've repeated the simple test with uniform data set, which is pretty much ideal for hash joins (no unexlectedly large batches that can't be split, etc.). I've done this with 1M, 5M, 10M, 25M and 50M rows in the large table (which gets picked for the "hash" side), and measured how much memory gets used, how many batches, how long it takes and how much data gets written to temp files. See the hashjoin-test.sh script for more details. So, here are the results with work_mem = 4MB (so the number of in-memory batches for the last two entries is 128). The columns are: * nbatch - the final number of batches * memory - memory usage, as reported by explain analyze * time - duration of the query (without explain analyze) in seconds * size - size of the large table * temp - amount of data written to temp files * amplif - write amplification (temp / size) 1M rows =================================================================== nbatch memory time size (MB) temp (MB) amplif ------------------------------------------------------------------- master 256 7681 3.3 730 899 1.23 rebalance 256 7711 3.3 730 884 1.21 single file 1024 4161 7.2 730 3168 4.34 per-slice file 1024 4161 4.7 730 1653 2.26 5M rows =================================================================== nbatch memory time size (MB) temp (MB) amplif ------------------------------------------------------------------- master 2048 36353 22 3652 5276 1.44 rebalance 512 16515 18 3652 4169 1.14 single file 4096 4353 156 3652 53897 14.76 per-slice file 4096 4353 28 3652 8106 2.21 10M rows =================================================================== nbatch memory time size (MB) temp (MB) amplif ------------------------------------------------------------------- master 4096 69121 61 7303 10556 1.45 rebalance 512 24326 46 7303 7405 1.01 single file 8192 4636 762 7303 211234 28.92 per-slice file 8192 4636 65 7303 16278 2.23 25M rows =================================================================== nbatch memory time size (MB) temp (MB) amplif ------------------------------------------------------------------- master 8192 134657 190 7303 24279 1.33 rebalance 1024 36611 158 7303 20024 1.10 single file 16384 6011 4054 7303 1046174 57.32 per-slice file 16384 6011 207 7303 39073 2.14 50M rows =================================================================== nbatch memory time size (MB) temp (MB) amplif ------------------------------------------------------------------- master 16384 265729 531 36500 48519 1.33 rebalance 2048 53241 447 36500 48077 1.32 single file - - - 36500 - - per-slice file 32768 8125 451 36500 78662 2.16 From those numbers it's pretty clear that per-slice overflow file does by far the best job in enforcing work_mem and minimizing the amount of data spilled to temp files. It does write a bit more data than both master and the simple rebalancing, but that's the cost for enforcing work_mem more strictly. It's generally a bit slower than those two approaches, although on the largest scale it's actually a bit faster than master. I think that's pretty acceptable, considering this is meant to address extreme underestimates where we currently just eat memory. The case with single overflow file performs rather poorly - I haven't even collected data from the largest scale, but considering it spilled 1TB of temp files with a dataset half the size, that's not an issue. (Note that this does not mean it needs 1TB of temp space, those writes are spread over time and the files are created/closed as we go. The system only has ~100GB of free disk space.) Gunther, could you try the v2 and v4 patches on your data set? That would be an interesting data point, I think. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Hi all, I am connecting to a discussion back from April this year. My data has grown and now I am running into new out of memory situations. Meanwhile the world turned from 11.2 to 11.5 which I just installed only to find the same out of memory error.
Have any of the things discussed and proposed, especially this last one by Tomas Vondra, been applied to the 11 releases? Should I try these older patches from April?
regards,
-Gunther
For what it is worth, this is what I am getting:
TopMemoryContext: 67424 total in 5 blocks; 7184 free (7 chunks); 60240 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 blocks; 7720 free (1 chunks); 472 used Operator lookup cache: 24576 total in 2 blocks; 10760 free (3 chunks); 13816 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 8388608 total in 11 blocks; 3094872 free (4 chunks); 5293736 used JoinRelHashTable: 16384 total in 2 blocks; 5576 free (1 chunks); 10808 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 12720 free (8 chunks); 20048 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used PortalContext: 1024 total in 1 blocks; 624 free (0 chunks); 400 used: ExecutorState: 202528536 total in 19 blocks; 433464 free (12 chunks); 202095072 used HashTableContext: 8192 total in 1 blocks; 7656 free (0 chunks); 536 used HashBatchContext: 10615104 total in 261 blocks; 7936 free (0 chunks); 10607168 used HashTableContext: 8192 total in 1 blocks; 7688 free (1 chunks); 504 used HashBatchContext: 13079304 total in 336 blocks; 7936 free (0 chunks); 13071368 used TupleSort main: 49208 total in 3 blocks; 8552 free (7 chunks); 40656 used Caller tuples: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Subplan HashTable Temp Context: 1024 total in 1 blocks; 768 free (0 chunks); 256 used Subplan HashTable Context: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Subplan HashTable Temp Context: 1024 total in 1 blocks; 768 free (0 chunks); 256 used Subplan HashTable Context: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Subplan HashTable Temp Context: 1024 total in 1 blocks; 768 free (0 chunks); 256 used Subplan HashTable Context: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Subplan HashTable Temp Context: 1024 total in 1 blocks; 768 free (0 chunks); 256 used Subplan HashTable Context: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Subplan HashTable Temp Context: 1024 total in 1 blocks; 768 free (0 chunks); 256 used Subplan HashTable Context: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7360 free (0 chunks); 832 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 1107296256 total in 142 blocks; 6328 free (101 chunks); 1107289928 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used 1 more child contexts containing 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 2472 free (2 chunks); 13912 used CacheMemoryContext: 1113488 total in 14 blocks; 16776 free (0 chunks); 1096712 used index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: docsubjh_sjrcode_ndx index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: docsubjh_sjrclass_ndx index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: docsubjh_scopeiid_ndx index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: docsubjh_dociid_ndx index info: 4096 total in 3 blocks; 2064 free (2 chunks); 2032 used: role_telecom_idx index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: role_addr_fkidx index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: role_id_fkidx index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: role_id_idx index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: role_name_fkidx index info: 4096 total in 3 blocks; 2064 free (2 chunks); 2032 used: entity_telecom_idx index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: entity_id_idx index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: entity_det_code_idx index info: 4096 total in 3 blocks; 2016 free (2 chunks); 2080 used: entity_code_nodash_idx index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_pkey index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: connect_rule_pkey index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: role_context_idx index info: 2048 total in 2 blocks; 640 free (2 chunks); 1408 used: role_partitions index info: 2048 total in 2 blocks; 640 free (2 chunks); 1408 used: role_scoper_idx index info: 2048 total in 2 blocks; 640 free (2 chunks); 1408 used: role_player_idx index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: role__pkey index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index index info: 2048 total in 2 blocks; 592 free (1 chunks); 1456 used: pg_constraint_conrelid_contypid_conname_index index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: participation_act_idx index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: participation_role_idx index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: participation_pkey index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_statistic_ext_relid_index index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: doc_ndx_internaiddoctype index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2618_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_index_indrelid_index relation rules: 827392 total in 104 blocks; 2400 free (1 chunks); 824992 used: v_documentsubjecthistory index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: pg_db_role_setting_databaseid_rol_index index info: 2048 total in 2 blocks; 624 free (2 chunks); 1424 used: pg_opclass_am_name_nsp_index index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_foreign_data_wrapper_name_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_enum_oid_index index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_class_relname_nsp_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_server_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_pubname_index index info: 2048 total in 2 blocks; 592 free (3 chunks); 1456 used: pg_statistic_relid_att_inh_index index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_cast_source_target_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_language_name_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_transform_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_collation_oid_index index info: 3072 total in 2 blocks; 1136 free (2 chunks); 1936 used: pg_amop_fam_strat_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_index_indexrelid_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_template_tmplname_index index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_ts_config_map_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_opclass_oid_index index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_foreign_data_wrapper_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_event_trigger_evtname_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_statistic_ext_name_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_dict_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_event_trigger_oid_index index info: 3072 total in 2 blocks; 1216 free (3 chunks); 1856 used: pg_conversion_default_index index info: 3072 total in 2 blocks; 1136 free (2 chunks); 1936 used: pg_operator_oprname_l_r_n_index index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_trigger_tgrelid_tgname_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_enum_typid_label_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_config_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_user_mapping_oid_index index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_opfamily_am_name_nsp_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_table_relid_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_type_oid_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_aggregate_fnoid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_constraint_oid_index index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_rewrite_rel_rulename_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_parser_prsname_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_config_cfgname_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_parser_oid_index index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_publication_rel_prrelid_prpubid_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_operator_oid_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_namespace_nspname_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_template_oid_index index info: 2048 total in 2 blocks; 624 free (2 chunks); 1424 used: pg_amop_opr_fam_index index info: 2048 total in 2 blocks; 672 free (3 chunks); 1376 used: pg_default_acl_role_nsp_obj_index index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_collation_name_enc_nsp_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_rel_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_range_rngtypid_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_dict_dictname_index index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_type_typname_nsp_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_opfamily_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_statistic_ext_oid_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index index info: 2048 total in 2 blocks; 624 free (2 chunks); 1424 used: pg_proc_proname_args_nsp_index index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_partitioned_table_partrelid_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_transform_type_lang_index index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_attribute_relid_attnum_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_proc_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_language_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_namespace_oid_index index info: 3072 total in 2 blocks; 1136 free (2 chunks); 1936 used: pg_amproc_fam_proc_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_server_name_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_attribute_relid_attnam_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_conversion_oid_index index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_user_mapping_user_server_index index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_subscription_rel_srrelid_srsubid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_sequence_seqrelid_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_conversion_name_nsp_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_authid_oid_index index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_auth_members_member_role_index 10 more child contexts containing 17408 total in 17 blocks; 6080 free (10 chunks); 11328 used WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 6408 free (0 chunks); 1784 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (4 chunks); 256 used Grand total: 1345345736 bytes in 1209 blocks; 4529600 free (270 chunks); 1340816136 used
On Wed, Apr 24, 2019 at 02:36:33AM +0200, Tomas Vondra wrote:
...
I still think the idea with an "overflow batch" is worth considering,
because it'd allow us to keep the memory usage within work_mem. And
after getting familiar with the hash join code again (haven't messed
with it since 9.5 or so) I think it should not be all that difficult.
I'll give it a try over the weekend if I get bored for a while.
OK, so I took a stab at this, and overall it seems to be workable. The
patches I have are nowhere near committable, but I think the approach
works fairly well - the memory is kept in check, and the performance is
comparable to the "ballancing" approach tested before.
To explain it a bit, the idea is that we can compute how many BufFile
structures we can keep in memory - we can't use more than work_mem/2 for
that, because then we'd mostly eliminate space for the actual data. For
example with 4MB, we know we can keep 128 batches - we need 128 for
outer and inner side, so 256 in total, and 256*8kB = 2MB.
And then, we just increase the number of batches but instead of adding
the BufFile entries, we split batches into slices that we can keep in
memory (say, the 128 batches). And we keep BufFiles for the current one
and an "overflow file" for the other slices. After processing a slice,
we simply switch to the next one, and use the overflow file as a temp
file for the first batch - we redistribute it into the other batches in
the slice and another overflow file.
That's what the v3 patch (named 'single overflow file') does. I does
work, but unfortunately it significantly inflates the amount of data
written to temporary files. Assume we need e.g. 1024 batches, but only
128 fit into memory. That means we'll need 8 slices, and during the
first pass we'll handle 1/8 of the data and write 7/8 to the overflow
file. Then after processing the slice and switching to the next one, we
repeat this dance - 1/8 gets processed, 6/8 written to another overflow
file. So essentially we "forward" about
7/8 + 6/8 + 5/8 + ... + 1/8 = 28/8 = 3.5
of data between slices, and we need to re-shuffle data in each slice,
which amounts to additional 1x data. That's pretty significant overhead,
as will be clear from the measurements I'll present shortly.
But luckily, there's a simple solution to this - instead of writing the
data into a single overflow file, we can create one overflow file for
each slice. That will leave us with the ~1x of additional writes when
distributing data into batches in the current slice, but it eliminates
the main source of write amplification - awalanche-like forwarding of
data between slices.
This relaxes the memory limit a bit again, because we can't really keep
the number of overflow files constrained by work_mem, but we should only
need few of them (much less than when adding one file per batch right
away). For example with 128 in-memory batches, this reduces the amount
of necessary memory 128x.
And this is what v4 (per-slice overflow file) does, pretty much.
Two more comments, regarding memory accounting in previous patches. It
was a bit broken, because we actually need 2x the number of BufFiles. We
needed nbatch files for outer side and nbatch files for inner side, but
we only considered one of those - both when deciding when to increase
the number of batches / increase spaceAllowed, and when reporting the
memory usage. So with large number of batches the reported amount of
used memory was roughly 1/2 of the actual value :-/
The memory accounting was a bit bogus for another reason - spaceUsed
simply tracks the amount of memory for hash table contents. But at the
end we were simply adding the current space for BufFile stuff, ignoring
the fact that that's likely much larger than when the spacePeak value
got stored. For example we might have kept early spaceUsed when it was
almost work_mem, and then added the final large BufFile allocation.
I've fixed both issues in the patches attached to this message. It does
not make a huge difference in practice, but it makes it easier to
compare values between patches.
Now, some test results - I've repeated the simple test with uniform data
set, which is pretty much ideal for hash joins (no unexlectedly large
batches that can't be split, etc.). I've done this with 1M, 5M, 10M, 25M
and 50M rows in the large table (which gets picked for the "hash" side),
and measured how much memory gets used, how many batches, how long it
takes and how much data gets written to temp files.
See the hashjoin-test.sh script for more details.
So, here are the results with work_mem = 4MB (so the number of in-memory
batches for the last two entries is 128). The columns are:
* nbatch - the final number of batches
* memory - memory usage, as reported by explain analyze
* time - duration of the query (without explain analyze) in seconds
* size - size of the large table
* temp - amount of data written to temp files
* amplif - write amplification (temp / size)
1M rows
===================================================================
nbatch memory time size (MB) temp (MB) amplif
-------------------------------------------------------------------
master 256 7681 3.3 730 899 1.23
rebalance 256 7711 3.3 730 884 1.21
single file 1024 4161 7.2 730 3168 4.34
per-slice file 1024 4161 4.7 730 1653 2.26
5M rows
===================================================================
nbatch memory time size (MB) temp (MB) amplif
-------------------------------------------------------------------
master 2048 36353 22 3652 5276 1.44
rebalance 512 16515 18 3652 4169 1.14
single file 4096 4353 156 3652 53897 14.76
per-slice file 4096 4353 28 3652 8106 2.21
10M rows
===================================================================
nbatch memory time size (MB) temp (MB) amplif
-------------------------------------------------------------------
master 4096 69121 61 7303 10556 1.45
rebalance 512 24326 46 7303 7405 1.01
single file 8192 4636 762 7303 211234 28.92
per-slice file 8192 4636 65 7303 16278 2.23
25M rows
===================================================================
nbatch memory time size (MB) temp (MB) amplif
-------------------------------------------------------------------
master 8192 134657 190 7303 24279 1.33
rebalance 1024 36611 158 7303 20024 1.10
single file 16384 6011 4054 7303 1046174 57.32
per-slice file 16384 6011 207 7303 39073 2.14
50M rows
===================================================================
nbatch memory time size (MB) temp (MB) amplif
-------------------------------------------------------------------
master 16384 265729 531 36500 48519 1.33
rebalance 2048 53241 447 36500 48077 1.32
single file - - - 36500 - -
per-slice file 32768 8125 451 36500 78662 2.16
From those numbers it's pretty clear that per-slice overflow file does
by far the best job in enforcing work_mem and minimizing the amount of
data spilled to temp files. It does write a bit more data than both
master and the simple rebalancing, but that's the cost for enforcing
work_mem more strictly. It's generally a bit slower than those two
approaches, although on the largest scale it's actually a bit faster
than master. I think that's pretty acceptable, considering this is meant
to address extreme underestimates where we currently just eat memory.
The case with single overflow file performs rather poorly - I haven't
even collected data from the largest scale, but considering it spilled
1TB of temp files with a dataset half the size, that's not an issue.
(Note that this does not mean it needs 1TB of temp space, those writes
are spread over time and the files are created/closed as we go. The
system only has ~100GB of free disk space.)
Gunther, could you try the v2 and v4 patches on your data set? That
would be an interesting data point, I think.
regards
OK, I went back through that old thread, and I noticed an early opinion by a certain Peter <pmc at citylink> who said that I should provision some swap space. Since I had plenty of disk and no other option I tried that. And it did some magic. Here this is a steady state now:
top - 14:07:32 up 103 days, 9:57, 5 users, load average: 1.33, 1.05, 0.54 Tasks: 329 total, 2 running, 117 sleeping, 0 stopped, 0 zombie %Cpu(s): 31.0 us, 11.4 sy, 0.0 ni, 35.3 id, 22.3 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 7910376 total, 120524 free, 2174940 used, 5614912 buff/cache KiB Swap: 16777212 total, 16777212 free, 0 used. 3239724 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND5068 postgres 20 0 4352496 4.0g 2.0g R 76.4 52.6 3:01.39 postgres: postgres integrator [local] INSERT 435 root 20 0 0 0 0 S 4.0 0.0 10:52.38 [kswapd0]
and the nice thing is, the backend server process appears to be bounded at 4GB, so there isn't really a "memory leak". And also, the swap space isn't really being used. This may have to do with these vm. sysctl settings, overcommit, etc.
- vm.overcommit_memory = 2 -- values are
- 0 -- estimate free memory
- 1 -- always assume there is enough memory
- 2 -- no over-commit allocate only inside the following two parameters
- vm.overcommit_kbytes = 0 -- how many kB above swap can be over-committed, EITHER this OR
- vm.overcommit_ratio = 50 -- percent of main memory that can be committed over swap,
- with 0 swap, that percent can be committed
- i.e., this of 8 GB, 4 GB are reserved for buffer cache
- not a good idea probably
- at least we should allow 75% committed, i.e., 6 GB of 8 GB, leaving
- 2 GB of buffer cache
- 2 GB of shared buffers
- 4 GB of all other memory
I have vm.overcommit_memory = 2, _kbytes = 0, _ratio = 50. So this means with _ratio = 50 I can commit 50% of memory, 4GB and this is exactly what the server process wants. So with little impact on the available buffer cache I am in a fairly good position now. The swap (that in my case I set at 2 x main memory = 16G) serves as a buffer to smooth out this peak usage without ever actually paging.
I suppose even without swap I could have set vm.overcommit_ratio = 75, and I notice now that I already commented this much (the above bullet points are my own notes.)
Anyway, for now, I am good. Thank you very much.
regards,
-Gunther
Hi all, I am connecting to a discussion back from April this year. My data has grown and now I am running into new out of memory situations. Meanwhile the world turned from 11.2 to 11.5 which I just installed only to find the same out of memory error.
Have any of the things discussed and proposed, especially this last one by Tomas Vondra, been applied to the 11 releases? Should I try these older patches from April?
regards,
-GuntherFor what it is worth, this is what I am getting:
TopMemoryContext: 67424 total in 5 blocks; 7184 free (7 chunks); 60240 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 blocks; 7720 free (1 chunks); 472 used Operator lookup cache: 24576 total in 2 blocks; 10760 free (3 chunks); 13816 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 8388608 total in 11 blocks; 3094872 free (4 chunks); 5293736 used JoinRelHashTable: 16384 total in 2 blocks; 5576 free (1 chunks); 10808 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 12720 free (8 chunks); 20048 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used PortalContext: 1024 total in 1 blocks; 624 free (0 chunks); 400 used: ExecutorState: 202528536 total in 19 blocks; 433464 free (12 chunks); 202095072 used HashTableContext: 8192 total in 1 blocks; 7656 free (0 chunks); 536 used HashBatchContext: 10615104 total in 261 blocks; 7936 free (0 chunks); 10607168 used HashTableContext: 8192 total in 1 blocks; 7688 free (1 chunks); 504 used HashBatchContext: 13079304 total in 336 blocks; 7936 free (0 chunks); 13071368 used TupleSort main: 49208 total in 3 blocks; 8552 free (7 chunks); 40656 used Caller tuples: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Subplan HashTable Temp Context: 1024 total in 1 blocks; 768 free (0 chunks); 256 used Subplan HashTable Context: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Subplan HashTable Temp Context: 1024 total in 1 blocks; 768 free (0 chunks); 256 used Subplan HashTable Context: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Subplan HashTable Temp Context: 1024 total in 1 blocks; 768 free (0 chunks); 256 used Subplan HashTable Context: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Subplan HashTable Temp Context: 1024 total in 1 blocks; 768 free (0 chunks); 256 used Subplan HashTable Context: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Subplan HashTable Temp Context: 1024 total in 1 blocks; 768 free (0 chunks); 256 used Subplan HashTable Context: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7360 free (0 chunks); 832 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 1107296256 total in 142 blocks; 6328 free (101 chunks); 1107289928 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used 1 more child contexts containing 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 2472 free (2 chunks); 13912 used CacheMemoryContext: 1113488 total in 14 blocks; 16776 free (0 chunks); 1096712 used index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: docsubjh_sjrcode_ndx index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: docsubjh_sjrclass_ndx index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: docsubjh_scopeiid_ndx index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: docsubjh_dociid_ndx index info: 4096 total in 3 blocks; 2064 free (2 chunks); 2032 used: role_telecom_idx index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: role_addr_fkidx index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: role_id_fkidx index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: role_id_idx index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: role_name_fkidx index info: 4096 total in 3 blocks; 2064 free (2 chunks); 2032 used: entity_telecom_idx index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: entity_id_idx index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: entity_det_code_idx index info: 4096 total in 3 blocks; 2016 free (2 chunks); 2080 used: entity_code_nodash_idx index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_pkey index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: connect_rule_pkey index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: role_context_idx index info: 2048 total in 2 blocks; 640 free (2 chunks); 1408 used: role_partitions index info: 2048 total in 2 blocks; 640 free (2 chunks); 1408 used: role_scoper_idx index info: 2048 total in 2 blocks; 640 free (2 chunks); 1408 used: role_player_idx index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: role__pkey index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index index info: 2048 total in 2 blocks; 592 free (1 chunks); 1456 used: pg_constraint_conrelid_contypid_conname_index index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: participation_act_idx index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: participation_role_idx index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: participation_pkey index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_statistic_ext_relid_index index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: doc_ndx_internaiddoctype index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2618_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_index_indrelid_index relation rules: 827392 total in 104 blocks; 2400 free (1 chunks); 824992 used: v_documentsubjecthistory index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: pg_db_role_setting_databaseid_rol_index index info: 2048 total in 2 blocks; 624 free (2 chunks); 1424 used: pg_opclass_am_name_nsp_index index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_foreign_data_wrapper_name_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_enum_oid_index index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_class_relname_nsp_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_server_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_pubname_index index info: 2048 total in 2 blocks; 592 free (3 chunks); 1456 used: pg_statistic_relid_att_inh_index index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_cast_source_target_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_language_name_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_transform_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_collation_oid_index index info: 3072 total in 2 blocks; 1136 free (2 chunks); 1936 used: pg_amop_fam_strat_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_index_indexrelid_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_template_tmplname_index index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_ts_config_map_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_opclass_oid_index index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_foreign_data_wrapper_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_event_trigger_evtname_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_statistic_ext_name_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_dict_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_event_trigger_oid_index index info: 3072 total in 2 blocks; 1216 free (3 chunks); 1856 used: pg_conversion_default_index index info: 3072 total in 2 blocks; 1136 free (2 chunks); 1936 used: pg_operator_oprname_l_r_n_index index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_trigger_tgrelid_tgname_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_enum_typid_label_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_config_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_user_mapping_oid_index index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_opfamily_am_name_nsp_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_table_relid_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_type_oid_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_aggregate_fnoid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_constraint_oid_index index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_rewrite_rel_rulename_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_parser_prsname_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_config_cfgname_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_parser_oid_index index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_publication_rel_prrelid_prpubid_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_operator_oid_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_namespace_nspname_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_template_oid_index index info: 2048 total in 2 blocks; 624 free (2 chunks); 1424 used: pg_amop_opr_fam_index index info: 2048 total in 2 blocks; 672 free (3 chunks); 1376 used: pg_default_acl_role_nsp_obj_index index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_collation_name_enc_nsp_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_rel_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_range_rngtypid_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_dict_dictname_index index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_type_typname_nsp_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_opfamily_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_statistic_ext_oid_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index index info: 2048 total in 2 blocks; 624 free (2 chunks); 1424 used: pg_proc_proname_args_nsp_index index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_partitioned_table_partrelid_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_transform_type_lang_index index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_attribute_relid_attnum_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_proc_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_language_oid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_namespace_oid_index index info: 3072 total in 2 blocks; 1136 free (2 chunks); 1936 used: pg_amproc_fam_proc_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_server_name_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_attribute_relid_attnam_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_conversion_oid_index index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_user_mapping_user_server_index index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_subscription_rel_srrelid_srsubid_index index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_sequence_seqrelid_index index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_conversion_name_nsp_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_authid_oid_index index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_auth_members_member_role_index 10 more child contexts containing 17408 total in 17 blocks; 6080 free (10 chunks); 11328 used WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 6408 free (0 chunks); 1784 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (4 chunks); 256 used Grand total: 1345345736 bytes in 1209 blocks; 4529600 free (270 chunks); 1340816136 usedOn 4/28/2019 10:19, Tomas Vondra wrote:On Wed, Apr 24, 2019 at 02:36:33AM +0200, Tomas Vondra wrote:
...
I still think the idea with an "overflow batch" is worth considering,
because it'd allow us to keep the memory usage within work_mem. And
after getting familiar with the hash join code again (haven't messed
with it since 9.5 or so) I think it should not be all that difficult.
I'll give it a try over the weekend if I get bored for a while.
OK, so I took a stab at this, and overall it seems to be workable. The
patches I have are nowhere near committable, but I think the approach
works fairly well - the memory is kept in check, and the performance is
comparable to the "ballancing" approach tested before.
To explain it a bit, the idea is that we can compute how many BufFile
structures we can keep in memory - we can't use more than work_mem/2 for
that, because then we'd mostly eliminate space for the actual data. For
example with 4MB, we know we can keep 128 batches - we need 128 for
outer and inner side, so 256 in total, and 256*8kB = 2MB.
And then, we just increase the number of batches but instead of adding
the BufFile entries, we split batches into slices that we can keep in
memory (say, the 128 batches). And we keep BufFiles for the current one
and an "overflow file" for the other slices. After processing a slice,
we simply switch to the next one, and use the overflow file as a temp
file for the first batch - we redistribute it into the other batches in
the slice and another overflow file.
That's what the v3 patch (named 'single overflow file') does. I does
work, but unfortunately it significantly inflates the amount of data
written to temporary files. Assume we need e.g. 1024 batches, but only
128 fit into memory. That means we'll need 8 slices, and during the
first pass we'll handle 1/8 of the data and write 7/8 to the overflow
file. Then after processing the slice and switching to the next one, we
repeat this dance - 1/8 gets processed, 6/8 written to another overflow
file. So essentially we "forward" about
7/8 + 6/8 + 5/8 + ... + 1/8 = 28/8 = 3.5
of data between slices, and we need to re-shuffle data in each slice,
which amounts to additional 1x data. That's pretty significant overhead,
as will be clear from the measurements I'll present shortly.
But luckily, there's a simple solution to this - instead of writing the
data into a single overflow file, we can create one overflow file for
each slice. That will leave us with the ~1x of additional writes when
distributing data into batches in the current slice, but it eliminates
the main source of write amplification - awalanche-like forwarding of
data between slices.
This relaxes the memory limit a bit again, because we can't really keep
the number of overflow files constrained by work_mem, but we should only
need few of them (much less than when adding one file per batch right
away). For example with 128 in-memory batches, this reduces the amount
of necessary memory 128x.
And this is what v4 (per-slice overflow file) does, pretty much.
Two more comments, regarding memory accounting in previous patches. It
was a bit broken, because we actually need 2x the number of BufFiles. We
needed nbatch files for outer side and nbatch files for inner side, but
we only considered one of those - both when deciding when to increase
the number of batches / increase spaceAllowed, and when reporting the
memory usage. So with large number of batches the reported amount of
used memory was roughly 1/2 of the actual value :-/
The memory accounting was a bit bogus for another reason - spaceUsed
simply tracks the amount of memory for hash table contents. But at the
end we were simply adding the current space for BufFile stuff, ignoring
the fact that that's likely much larger than when the spacePeak value
got stored. For example we might have kept early spaceUsed when it was
almost work_mem, and then added the final large BufFile allocation.
I've fixed both issues in the patches attached to this message. It does
not make a huge difference in practice, but it makes it easier to
compare values between patches.
Now, some test results - I've repeated the simple test with uniform data
set, which is pretty much ideal for hash joins (no unexlectedly large
batches that can't be split, etc.). I've done this with 1M, 5M, 10M, 25M
and 50M rows in the large table (which gets picked for the "hash" side),
and measured how much memory gets used, how many batches, how long it
takes and how much data gets written to temp files.
See the hashjoin-test.sh script for more details.
So, here are the results with work_mem = 4MB (so the number of in-memory
batches for the last two entries is 128). The columns are:
* nbatch - the final number of batches
* memory - memory usage, as reported by explain analyze
* time - duration of the query (without explain analyze) in seconds
* size - size of the large table
* temp - amount of data written to temp files
* amplif - write amplification (temp / size)
1M rows
===================================================================
nbatch memory time size (MB) temp (MB) amplif
-------------------------------------------------------------------
master 256 7681 3.3 730 899 1.23
rebalance 256 7711 3.3 730 884 1.21
single file 1024 4161 7.2 730 3168 4.34
per-slice file 1024 4161 4.7 730 1653 2.26
5M rows
===================================================================
nbatch memory time size (MB) temp (MB) amplif
-------------------------------------------------------------------
master 2048 36353 22 3652 5276 1.44
rebalance 512 16515 18 3652 4169 1.14
single file 4096 4353 156 3652 53897 14.76
per-slice file 4096 4353 28 3652 8106 2.21
10M rows
===================================================================
nbatch memory time size (MB) temp (MB) amplif
-------------------------------------------------------------------
master 4096 69121 61 7303 10556 1.45
rebalance 512 24326 46 7303 7405 1.01
single file 8192 4636 762 7303 211234 28.92
per-slice file 8192 4636 65 7303 16278 2.23
25M rows
===================================================================
nbatch memory time size (MB) temp (MB) amplif
-------------------------------------------------------------------
master 8192 134657 190 7303 24279 1.33
rebalance 1024 36611 158 7303 20024 1.10
single file 16384 6011 4054 7303 1046174 57.32
per-slice file 16384 6011 207 7303 39073 2.14
50M rows
===================================================================
nbatch memory time size (MB) temp (MB) amplif
-------------------------------------------------------------------
master 16384 265729 531 36500 48519 1.33
rebalance 2048 53241 447 36500 48077 1.32
single file - - - 36500 - -
per-slice file 32768 8125 451 36500 78662 2.16
From those numbers it's pretty clear that per-slice overflow file does
by far the best job in enforcing work_mem and minimizing the amount of
data spilled to temp files. It does write a bit more data than both
master and the simple rebalancing, but that's the cost for enforcing
work_mem more strictly. It's generally a bit slower than those two
approaches, although on the largest scale it's actually a bit faster
than master. I think that's pretty acceptable, considering this is meant
to address extreme underestimates where we currently just eat memory.
The case with single overflow file performs rather poorly - I haven't
even collected data from the largest scale, but considering it spilled
1TB of temp files with a dataset half the size, that's not an issue.
(Note that this does not mean it needs 1TB of temp space, those writes
are spread over time and the files are created/closed as we go. The
system only has ~100GB of free disk space.)
Gunther, could you try the v2 and v4 patches on your data set? That
would be an interesting data point, I think.
regards
Gunther <raj@gusw.net> writes: > Hi all, I am connecting to a discussion back from April this year. My > data has grown and now I am running into new out of memory situations. It doesn't look like this has much of anything to do with the hash-table discussion. The big hog is an ExprContext: > ExprContext: 1107296256 total in 142 blocks; 6328 free (101 chunks); > 1107289928 used So there's something leaking in there, but this isn't enough info to guess what. regards, tom lane
Thanks Tom, yes I'd say it's using a lot of memory, but wouldn't call it "leak" as it doesn't grow during the 30 min or so that this query runs. It explodes to 4GB and then stays flat until done. Yes, and this time the query is super complicated with many joins and tables involved. The query plan has 100 lines. Not easy to share for reproduce and I have my issue under control by adding some swap just in case. The swap space was never actually used. thanks, -Gunther On 8/23/2019 10:20, Tom Lane wrote: > Gunther <raj@gusw.net> writes: >> Hi all, I am connecting to a discussion back from April this year. My >> data has grown and now I am running into new out of memory situations. > It doesn't look like this has much of anything to do with the hash-table > discussion. The big hog is an ExprContext: > >> ExprContext: 1107296256 total in 142 blocks; 6328 free (101 chunks); >> 1107289928 used > So there's something leaking in there, but this isn't enough info > to guess what. > > regards, tom lane
On Sat, Aug 24, 2019 at 11:40:09AM -0400, Gunther wrote: >Thanks Tom, yes I'd say it's using a lot of memory, but wouldn't call >it "leak" as it doesn't grow during the 30 min or so that this query >runs. It explodes to 4GB and then stays flat until done. > Well, the memory context stats you've shared however show this: total: 1345345736 bytes in 1209 blocks; 4529600 free (270 chunks); 1340816136 used That's only ~1.3GB, and ~1.1GB of that is the expression context. So when you say 4GB, when does that happen and can you share stats showing state at that point? >Yes, and this time the query is super complicated with many joins and >tables involved. The query plan has 100 lines. Not easy to share for >reproduce and I have my issue under control by adding some swap just >in case. The swap space was never actually used. > Still, without the query plan we can hardly do any guesses about what might be the issue. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Aug 23, 2019 at 09:17:38AM -0400, Gunther wrote: >Hi all, I am connecting to a discussion back from April this year. My >data has grown and now I am running into new out of memory situations. >Meanwhile the world turned from 11.2 to 11.5 which I just installed >only to find the same out of memory error. > As Tom already said, this seems like a quite independent issue. Next time it'd be better to share it in a new thread, not to mix it up with the old discussion. >Have any of the things discussed and proposed, especially this last >one by Tomas Vondra, been applied to the 11 releases? Should I try >these older patches from April? > Unfortunately, no. We're still discussing what would be the right fix (it's rather tricky and the patches I shared were way too experimental for that). But I'm pretty sure whatever we end up doing it's going to be way too invasive for backpatch. I.e. the older branches will likely have this issue until EOL. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Apr 14, 2019 at 3:51 PM Gunther <raj@gusw.net> wrote: > > For weeks now, I am banging my head at an "out of memory" situation. There is only one query I am running on an 8 GB system,whatever I try, I get knocked out on this out of memory. It is extremely impenetrable to understand and fix this error.I guess I could add a swap file, and then I would have to take the penalty of swapping. But how can I actually addressan out of memory condition if the system doesn't tell me where it is happening? > We can't really see anything too worrisome. There is always lots of memory used by cache, which could have been mobilized.The only possible explanation I can think of is that in that moment of the crash the memory utilization suddenlyskyrocketed in less than a second, so that the 2 second vmstat interval wouldn't show it??? Nah. > > I have already much reduced work_mem, which has helped in some other cases before. Now I am going to reduce the shared_buffersnow, but that seems counter-intuitive because we are sitting on all that cache memory unused! > > Might this be a bug? It feels like a bug. It feels like those out of memory issues should be handled more gracefully (garbagecollection attempt?) and that somehow there should be more information so the person can do anything about it. I kind of agree that nothing according to vmstat suggests you have a problem. One thing you left out is the precise mechanics of the failure; is the database getting nuked by the oom killer? Do you have the logs? *) what are values of shared_buffers and work_mem and maintenance_work_mem? *) Is this a 32 bit build? (I'm guessing no, but worth asking) *) I see that you've disabled swap. Maybe it should be enabled? *) Can you get the query to run through? an 'explain analyze' might point to gross misses in plan; say, sort memory overuse *) If you're still getting failures, maybe we need to look at sampling frequency of memory usage. *) iowait is super high. *) I see optimization potential in this query; explain analyze would help here too. merlin
On Tue, Oct 8, 2019 at 12:44 PM Merlin Moncure <mmoncure@gmail.com> wrote: > On Sun, Apr 14, 2019 at 3:51 PM Gunther <raj@gusw.net> wrote: > > > > For weeks now, I am banging my head at an "out of memory" situation. There is only one query I am running on an 8 GBsystem, whatever I try, I get knocked out on this out of memory. It is extremely impenetrable to understand and fix thiserror. I guess I could add a swap file, and then I would have to take the penalty of swapping. But how can I actuallyaddress an out of memory condition if the system doesn't tell me where it is happening? > > We can't really see anything too worrisome. There is always lots of memory used by cache, which could have been mobilized.The only possible explanation I can think of is that in that moment of the crash the memory utilization suddenlyskyrocketed in less than a second, so that the 2 second vmstat interval wouldn't show it??? Nah. > > > > I have already much reduced work_mem, which has helped in some other cases before. Now I am going to reduce the shared_buffersnow, but that seems counter-intuitive because we are sitting on all that cache memory unused! > > > > Might this be a bug? It feels like a bug. It feels like those out of memory issues should be handled more gracefully(garbage collection attempt?) and that somehow there should be more information so the person can do anything aboutit. > > I kind of agree that nothing according to vmstat suggests you have a > problem. One thing you left out is the precise mechanics of the > failure; is the database getting nuked by the oom killer? Do you have > the logs? oops, I missed quite a bit of context upthread. sorry for repeat noise. merlin