Re: Out of Memory errors are frustrating as heck! - Mailing list pgsql-performance

From Gunther
Subject Re: Out of Memory errors are frustrating as heck!
Date
Msg-id 2682858c-7916-11fe-d7e8-7c649afbbc74@gusw.net
Whole thread Raw
In response to Re: Out of Memory errors are frustrating as heck!  (Jeff Janes <jeff.janes@gmail.com>)
Responses Re: Out of Memory errors are frustrating as heck!
Re: Out of Memory errors are frustrating as heck!
Re: Out of Memory errors are frustrating as heck!
List pgsql-performance

OK Guys, you are very kind to continue taking an interest in this matter.

I will try what I can to help squish the bug.

Tomas Vondra just added a good idea that explains why I get the out of memory with still having so much cache available:

# sysctl vm.overcommit_memory
vm.overcommit_memory = 2
# sysctl vm.overcommit_ratio
vm.overcommit_ratio = 50

as he predicted.

# cat /proc/meminfo |grep Commit
CommitLimit:     3955192 kB
Committed_AS:    2937352 kB

So I thing that explains why it turns into an out of memory error. We don't worry or wonder about that any more. I will change that parameter in the future to allow for some spikes. But it's not going to resolve the underlying memory leak issue.

Now I run explain analyze SELECT ... without the INSERT. 

integrator=#  \set VERBOSITY verbose
integrator=#
integrator=# \pset pager off
Pager usage is off.
integrator=# \pset format unaligned
Output format is unaligned.
integrator=# \set VERBOSITY verbose
integrator=#
integrator=# SET ENABLE_NESTLOOP TO OFF;
SET
integrator=# explain analyze SELECT * FROM reports.v_BusinessOperation;
ERROR:  53200: out of memory
DETAIL:  Failed on request of size 32800 in memory context "HashBatchContext".
LOCATION:  MemoryContextAlloc, mcxt.c:798

And since that failed already, I guess we don't need to worry about the temporary table insert.

About adding LIMIT, I don't think it makes sense in the outer query, since the error is probably happening earlier. I did put a LIMIT 100 on one of the tables we join to, and it helped. But that doesn't really tell us anything I think.

Then yes, I can try the backtrace with the NLs enabled. It will just take a long long time and unfortunately it is extremely likely that I lose the console and then will be unable to get back to it. OK, screen(1) resolves that problem too. Will do, after I reported the above.

But now you have already produced more ideas ...

  Maybe it is memory for trigger or constraint checking, although I don't
  know why that would appear instantly.  What triggers or constraints do you
  have on businessoperation? 

Yeah, that would be my guess too. If I had to guess, something likely gets
confused and allocates memory in es_query_ctx instead of the per-tuple
context (es_per_tuple_exprcontext).

Triggers, constraints and expr evaluation all seem like a plausible
candidates. It's going to be hard to nail the exact place, though

I think triggers and constraints is ruled out, because the problem happens without the INSERT.

That leaves us with expression evaluation. And OK, now you really wanna see the query, although it should be in the plan too. But for what it is worth:

SELECT DISTINCT       documentInternalId, is_current,documentId,documentTypeCode,subjectRoleInternalId,subjectEntityInternalId,subjectEntityId,subjectEntityIdRoot,subjectEntityName,subjectEntityTel,subjectEntityEmail,otherEntityInternalId,confidentialityCode,actInternalId, code_code as operationCode, code_displayName AS operationName,operationQualifierCode,operationQualifierName,approvalNumber,approvalNumberSystem,approvalStateCode,approvalStateCodeSystem,approvalEffectiveTimeLow,approvalEffectiveTimeHigh,approvalStatusCode,licenseCode,agencyId,agencyName,productItemCode,productInternalId FROM reports.DocumentInformationSubject  LEFT OUTER JOIN (SELECT documentInternalId, documentId, actInternalId,                  subjectEntityCode as productItemCode,		  subjectEntityInternalId as productInternalId	     FROM reports.DocumentInformationSubject	    WHERE participationTypeCode = 'PRD') prd   USING(documentInternalId, documentId, actInternalId) LEFT OUTER JOIN (   SELECT documentInternalId,          q.code_code AS operationQualifierCode, 	   q.code_displayName AS operationQualifierName, 	   r.targetInternalId AS actInternalId,   actInternalId AS approvalInternalId, 	   an.extension AS approvalNumber, 	   an.root AS approvalNumberSystem,   qs.subjectEntityCode AS approvalStateCode,   qs.subjectEntityCodeSystem AS approvalStateCodeSystem, 	   qs.effectivetime_low AS approvalEffectiveTimeLow, 	   qs.effectivetime_high AS approvalEffectiveTimeHigh, 	   qs.statusCode AS approvalStatusCode,   qs.code_code AS licenseCode,   agencyId.extension AS agencyId,   agencyName.trivialName AS agencyName     FROM reports.DocumentInformation q     LEFT OUTER JOIN (SELECT * FROM reports.DocumentInformationSubject WHERE participationTypeCode = 'AUT') qs       USING(documentInternalId, actInternalId)     INNER JOIN integrator.ActRelationship r        ON(    r.sourceInternalId = actInternalId  	   AND r.typeCode = 'SUBJ')     LEFT OUTER JOIN integrator.Act_id an USING(actInternalId)     LEFT OUTER JOIN integrator.Entity_id agencyId ON(agencyId.entityInternalId = otherEntityInternalId)     LEFT OUTER JOIN reports.BestName agencyName ON(agencyName.entityInternalId = otherEntityInternalId)    WHERE q.classCode = 'CNTRCT'      AND q.moodCode = 'EVN'      AND q.code_codeSystem = '2.16.840.1.113883.3.26.1.1'         ) q   USING(documentInternalId, actInternalId)WHERE classCode = 'ACT'   AND moodCode = 'DEF'   AND code_codeSystem = '2.16.840.1.113883.3.26.1.1'   AND participationTypeCode IN ('PPRF','PRF');

You see that the expressions are all just equal operations, some IN, nothing outlandish.

Now I will try what Tom Lane suggested. Here you go. And I have it stopped at this state, so if you want me to inspect anything else, I can do it.

With screen(1) I can be sure I won't lose my stuff when my internet goes down. Nice.

I have one screen session with 3 windows:

  1. psql
  2. gdb
  3. misc (vmstat, etc.)

Now I have let this run for a good long time while setting up my screen stuff. And then:

ps -x 

look for the postgres job with the EXPLAIN ... that's $PID, then:

gdb -p $PID

Then first I do

cont

but then it stops at SIGUSR1, because of the parallel workers signalling each other. 

handle SIGUSR1 nostop

suppresses that stopping. Then I break CTRL-C, and set the breakpoint where Tom Lane said:

b AllocSetAlloc

once it stops there I do

Breakpoint 1, AllocSetAlloc (context=0x1168230, size=8) at aset.c:715
715     {
(gdb) p context->name
$4 = 0x96ce5b "ExecutorState"

So I should even be able to set a conditional breakpoint.

(gdb) delete
Delete all breakpoints? (y or n) y
(gdb) b AllocSetAlloc if  strcmp(context->name, "ExecutorState") == 0
Breakpoint 2 at 0x848ed0: file aset.c, line 715.
(gdb) cont
(gdb) cont
Continuing.

Breakpoint 2, AllocSetAlloc (context=0x1168230, size=10) at aset.c:715
715     {
(gdb) cont
Continuing.

Program received signal SIGUSR1, User defined signal 1.

Breakpoint 2, AllocSetAlloc (context=0x1168230, size=152) at aset.c:715
715     {
(gdb) cont
Continuing.

Program received signal SIGUSR1, User defined signal 1.

Breakpoint 2, AllocSetAlloc (context=0x1168230, size=201) at aset.c:715
715     {
(gdb) cont
Continuing.

Breakpoint 2, AllocSetAlloc (context=0x1168230, size=8272) at aset.c:715
715     {
(gdb) p context->name
$8 = 0x96ce5b "ExecutorState"

Nice. Now the question is, am I at the place where memory gets squeezed? And I think yes. With top

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
31752 postgres  20   0 2772964   1.2g 329640 t   0.0 16.5   8:46.59 postgres: postgres integrator [local] EXPLAIN

I guess I should run this for a little longer. So I disable my breakpoints

(gdb) info breakpoints
Num     Type           Disp Enb Address            What
2       breakpoint     keep y   0x0000000000848ed0 in AllocSetAlloc at aset.c:715       stop only if  strcmp(context->name, "ExecutorState") == 0       breakpoint already hit 6 times
(gdb) disable 2
(gdb) cont
Continuing.

while watching top:

31752 postgres  20   0 2777060   1.3g 329920 D  33.2 17.9   8:52.07 postgres: postgres integrator [local] EXPLAIN

31752 postgres  20   0 2777060   1.4g 329920 D  33.2 17.9   8:52.07 postgres: postgres integrator [local] EXPLAIN

31752 postgres  20   0 2777060   1.5g 329920 D  33.2 17.9   8:52.07 postgres: postgres integrator [local] EXPLAIN

it went up pretty quick from 1.2 GB to 1.5 GB, but then it stopped growing fast, so now back to gdb and break:

^C
Program received signal SIGINT, Interrupt.
0x00007f048f336d71 in read () from /lib64/libpthread.so.0
(gdb) enable 2
(gdb) cont
Continuing.

Breakpoint 2, AllocSetAlloc (context=0x1168230, size=385) at aset.c:715
715     {

Now I give you a bt so we have something to look at:

#0  AllocSetAlloc (context=0x1168230, size=385) at aset.c:715
#1  0x000000000084e6cd in palloc (size=385) at mcxt.c:938
#2  0x000000000061019c in ExecHashJoinGetSavedTuple (file=file@entry=0x8bbc528, hashvalue=hashvalue@entry=0x7fff2e4ca76c,   tupleSlot=0x10856b8, hjstate=0x11688e0) at nodeHashjoin.c:1277
#3  0x0000000000610c83 in ExecHashJoinNewBatch (hjstate=0x11688e0) at nodeHashjoin.c:1042
#4  ExecHashJoinImpl (parallel=false, pstate=0x11688e0) at nodeHashjoin.c:539
#5  ExecHashJoin (pstate=0x11688e0) at nodeHashjoin.c:565
#6  0x00000000005fde68 in ExecProcNodeInstr (node=0x11688e0) at execProcnode.c:461
#7  0x000000000061ce4e in ExecProcNode (node=0x11688e0) at ../../../src/include/executor/executor.h:247
#8  ExecSort (pstate=0x11687d0) at nodeSort.c:107
#9  0x00000000005fde68 in ExecProcNodeInstr (node=0x11687d0) at execProcnode.c:461
#10 0x000000000061d2c4 in ExecProcNode (node=0x11687d0) at ../../../src/include/executor/executor.h:247
#11 ExecUnique (pstate=0x11685e0) at nodeUnique.c:73
#12 0x00000000005fde68 in ExecProcNodeInstr (node=0x11685e0) at execProcnode.c:461
#13 0x00000000005f75ba in ExecProcNode (node=0x11685e0) at ../../../src/include/executor/executor.h:247
#14 ExecutePlan (execute_once=<optimized out>, dest=0xcc60e0 <donothingDR>, direction=<optimized out>, numberTuples=0,   sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x11685e0, estate=0x1168340)   at execMain.c:1723
#15 standard_ExecutorRun (queryDesc=0x119b6d8, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364
#16 0x000000000059c6f8 in ExplainOnePlan (plannedstmt=plannedstmt@entry=0x1199a68, into=into@entry=0x0, es=es@entry=0x1141d48,   queryString=<optimized out>, params=0x0, queryEnv=queryEnv@entry=0x0, planduration=0x7fff2e4ca990) at explain.c:535
#17 0x000000000059c9ef in ExplainOneQuery (query=<optimized out>, cursorOptions=<optimized out>, into=0x0, es=0x1141d48,   queryString=0xf4af30 "explain analyze\nSELECT DISTINCT\n        documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"..., params=0x0, queryEnv=0x0)   at explain.c:371
#18 0x000000000059ce37 in ExplainQuery (pstate=pstate@entry=0xf74608, stmt=stmt@entry=0x11ef240,   queryString=queryString@entry=0xf4af30 "explain analyze\nSELECT DISTINCT\n        documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"...,   params=params@entry=0x0, queryEnv=queryEnv@entry=0x0, dest=dest@entry=0xf74578) at explain.c:254
#19 0x000000000072ca5d in standard_ProcessUtility (pstmt=0x11ef390,   queryString=0xf4af30 "explain analyze\nSELECT DISTINCT\n        documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"...,   context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0xf74578, completionTag=0x7fff2e4cab20 "") at utility.c:675
#20 0x000000000072a052 in PortalRunUtility (portal=0xfb06b0, pstmt=0x11ef390, isTopLevel=<optimized out>,   setHoldSnapshot=<optimized out>, dest=<optimized out>, completionTag=0x7fff2e4cab20 "") at pquery.c:1178
#21 0x000000000072add2 in FillPortalStore (portal=portal@entry=0xfb06b0, isTopLevel=isTopLevel@entry=true) at pquery.c:1038
#22 0x000000000072b855 in PortalRun (portal=portal@entry=0xfb06b0, count=count@entry=9223372036854775807,   isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0xf4c570, altdest=altdest@entry=0xf4c570,   completionTag=0x7fff2e4cad30 "") at pquery.c:768
#23 0x00000000007276e8 in exec_simple_query (   query_string=0xf4af30 "explain analyze\nSELECT DISTINCT\n        documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"...) at postgres.c:1145
#24 0x0000000000729534 in PostgresMain (argc=<optimized out>, argv=argv@entry=0xf76ce8, dbname=<optimized out>,   username=<optimized out>) at postgres.c:4182
#25 0x00000000006be215 in BackendRun (port=0xf6efe0) at postmaster.c:4361
#26 BackendStartup (port=0xf6efe0) at postmaster.c:4033
#27 ServerLoop () at postmaster.c:1706
#28 0x00000000006bf122 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0xf45320) at postmaster.c:1379
#29 0x00000000004822dc in main (argc=3, argv=0xf45320) at main.c:228

But who knows if that's it. I continue and watch top again...

31752 postgres  20   0 3112352   1.8g 329920 D  32.2 23.7   9:43.75 postgres: postgres integrator [local] EXPLAIN

it went quickly to 1.6, then after some time to 1.7, then 1.8, and I stop again:

^C
Program received signal SIGINT, Interrupt.
0x00007f048f336d71 in read () from /lib64/libpthread.so.0
(gdb) enable 2
(gdb) cont
Continuing.

Breakpoint 2, AllocSetAlloc (context=0x1168230, size=375) at aset.c:715
715     {
bt
#0  AllocSetAlloc (context=0x1168230, size=375) at aset.c:715
#1  0x000000000084e6cd in palloc (size=375) at mcxt.c:938
#2  0x000000000061019c in ExecHashJoinGetSavedTuple (file=file@entry=0x21df688, hashvalue=hashvalue@entry=0x7fff2e4ca76c,   tupleSlot=0x10856b8, hjstate=0x11688e0) at nodeHashjoin.c:1277
#3  0x0000000000610c83 in ExecHashJoinNewBatch (hjstate=0x11688e0) at nodeHashjoin.c:1042
#4  ExecHashJoinImpl (parallel=false, pstate=0x11688e0) at nodeHashjoin.c:539
#5  ExecHashJoin (pstate=0x11688e0) at nodeHashjoin.c:565
#6  0x00000000005fde68 in ExecProcNodeInstr (node=0x11688e0) at execProcnode.c:461
#7  0x000000000061ce4e in ExecProcNode (node=0x11688e0) at ../../../src/include/executor/executor.h:247
#8  ExecSort (pstate=0x11687d0) at nodeSort.c:107
#9  0x00000000005fde68 in ExecProcNodeInstr (node=0x11687d0) at execProcnode.c:461
#10 0x000000000061d2c4 in ExecProcNode (node=0x11687d0) at ../../../src/include/executor/executor.h:247
#11 ExecUnique (pstate=0x11685e0) at nodeUnique.c:73
#12 0x00000000005fde68 in ExecProcNodeInstr (node=0x11685e0) at execProcnode.c:461
#13 0x00000000005f75ba in ExecProcNode (node=0x11685e0) at ../../../src/include/executor/executor.h:247
#14 ExecutePlan (execute_once=<optimized out>, dest=0xcc60e0 <donothingDR>, direction=<optimized out>, numberTuples=0,   sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x11685e0, estate=0x1168340)   at execMain.c:1723
#15 standard_ExecutorRun (queryDesc=0x119b6d8, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364
#16 0x000000000059c6f8 in ExplainOnePlan (plannedstmt=plannedstmt@entry=0x1199a68, into=into@entry=0x0, es=es@entry=0x1141d48,   queryString=<optimized out>, params=0x0, queryEnv=queryEnv@entry=0x0, planduration=0x7fff2e4ca990) at explain.c:535
#17 0x000000000059c9ef in ExplainOneQuery (query=<optimized out>, cursorOptions=<optimized out>, into=0x0, es=0x1141d48,   queryString=0xf4af30 "explain analyze\nSELECT DISTINCT\n        documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"..., params=0x0, queryEnv=0x0)   at explain.c:371
#18 0x000000000059ce37 in ExplainQuery (pstate=pstate@entry=0xf74608, stmt=stmt@entry=0x11ef240,   queryString=queryString@entry=0xf4af30 "explain analyze\nSELECT DISTINCT\n        documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"...,   params=params@entry=0x0, queryEnv=queryEnv@entry=0x0, dest=dest@entry=0xf74578) at explain.c:254
#19 0x000000000072ca5d in standard_ProcessUtility (pstmt=0x11ef390,   queryString=0xf4af30 "explain analyze\nSELECT DISTINCT\n        documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"...,   context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0xf74578, completionTag=0x7fff2e4cab20 "") at utility.c:675
#20 0x000000000072a052 in PortalRunUtility (portal=0xfb06b0, pstmt=0x11ef390, isTopLevel=<optimized out>,   setHoldSnapshot=<optimized out>, dest=<optimized out>, completionTag=0x7fff2e4cab20 "") at pquery.c:1178
#21 0x000000000072add2 in FillPortalStore (portal=portal@entry=0xfb06b0, isTopLevel=isTopLevel@entry=true) at pquery.c:1038
#22 0x000000000072b855 in PortalRun (portal=portal@entry=0xfb06b0, count=count@entry=9223372036854775807,   isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0xf4c570, altdest=altdest@entry=0xf4c570,   completionTag=0x7fff2e4cad30 "") at pquery.c:768
#23 0x00000000007276e8 in exec_simple_query (   query_string=0xf4af30 "explain analyze\nSELECT DISTINCT\n        documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"...) at postgres.c:1145
#24 0x0000000000729534 in PostgresMain (argc=<optimized out>, argv=argv@entry=0xf76ce8, dbname=<optimized out>,   username=<optimized out>) at postgres.c:4182
#25 0x00000000006be215 in BackendRun (port=0xf6efe0) at postmaster.c:4361
#26 BackendStartup (port=0xf6efe0) at postmaster.c:4033
#27 ServerLoop () at postmaster.c:1706
#28 0x00000000006bf122 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0xf45320) at postmaster.c:1379
#29 0x00000000004822dc in main (argc=3, argv=0xf45320) at main.c:228

Good, now I leave this all sitting like that for you to ask me what else you might want to see.

We are now close to the edge of the cliff.

-Gunther

pgsql-performance by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Block / Page Size Optimization
Next
From: Jeff Janes
Date:
Subject: Re: Out of Memory errors are frustrating as heck!