Re: Out of Memory errors are frustrating as heck! - Mailing list pgsql-performance
From | Gunther |
---|---|
Subject | Re: Out of Memory errors are frustrating as heck! |
Date | |
Msg-id | 2682858c-7916-11fe-d7e8-7c649afbbc74@gusw.net Whole thread Raw |
In response to | Re: Out of Memory errors are frustrating as heck! (Jeff Janes <jeff.janes@gmail.com>) |
Responses |
Re: Out of Memory errors are frustrating as heck!
Re: Out of Memory errors are frustrating as heck! Re: Out of Memory errors are frustrating as heck! |
List | pgsql-performance |
OK Guys, you are very kind to continue taking an interest in this matter.
I will try what I can to help squish the bug.
Tomas Vondra just added a good idea that explains why I get the out of memory with still having so much cache available:
# sysctl vm.overcommit_memory vm.overcommit_memory = 2 # sysctl vm.overcommit_ratio vm.overcommit_ratio = 50
as he predicted.
# cat /proc/meminfo |grep Commit CommitLimit: 3955192 kB Committed_AS: 2937352 kB
So I thing that explains why it turns into an out of memory error. We don't worry or wonder about that any more. I will change that parameter in the future to allow for some spikes. But it's not going to resolve the underlying memory leak issue.
Now I run explain analyze SELECT ... without the INSERT.
integrator=# \set VERBOSITY verbose integrator=# integrator=# \pset pager off Pager usage is off. integrator=# \pset format unaligned Output format is unaligned. integrator=# \set VERBOSITY verbose integrator=# integrator=# SET ENABLE_NESTLOOP TO OFF; SET integrator=# explain analyze SELECT * FROM reports.v_BusinessOperation; ERROR: 53200: out of memory DETAIL: Failed on request of size 32800 in memory context "HashBatchContext". LOCATION: MemoryContextAlloc, mcxt.c:798
And since that failed already, I guess we don't need to worry about the temporary table insert.
About adding LIMIT, I don't think it makes sense in the outer query, since the error is probably happening earlier. I did put a LIMIT 100 on one of the tables we join to, and it helped. But that doesn't really tell us anything I think.
Then yes, I can try the backtrace with the NLs enabled. It will just take a long long time and unfortunately it is extremely likely that I lose the console and then will be unable to get back to it. OK, screen(1) resolves that problem too. Will do, after I reported the above.
But now you have already produced more ideas ...
Maybe it is memory for trigger or constraint checking, although I don't
know why that would appear instantly. What triggers or constraints do you
have on businessoperation?
Yeah, that would be my guess too. If I had to guess, something likely gets
confused and allocates memory in es_query_ctx instead of the per-tuple
context (es_per_tuple_exprcontext).
Triggers, constraints and expr evaluation all seem like a plausible
candidates. It's going to be hard to nail the exact place, though
I think triggers and constraints is ruled out, because the problem happens without the INSERT.
That leaves us with expression evaluation. And OK, now you really wanna see the query, although it should be in the plan too. But for what it is worth:
SELECT DISTINCT documentInternalId, is_current,documentId,documentTypeCode,subjectRoleInternalId,subjectEntityInternalId,subjectEntityId,subjectEntityIdRoot,subjectEntityName,subjectEntityTel,subjectEntityEmail,otherEntityInternalId,confidentialityCode,actInternalId, code_code as operationCode, code_displayName AS operationName,operationQualifierCode,operationQualifierName,approvalNumber,approvalNumberSystem,approvalStateCode,approvalStateCodeSystem,approvalEffectiveTimeLow,approvalEffectiveTimeHigh,approvalStatusCode,licenseCode,agencyId,agencyName,productItemCode,productInternalId FROM reports.DocumentInformationSubject LEFT OUTER JOIN (SELECT documentInternalId, documentId, actInternalId, subjectEntityCode as productItemCode, subjectEntityInternalId as productInternalId FROM reports.DocumentInformationSubject WHERE participationTypeCode = 'PRD') prd USING(documentInternalId, documentId, actInternalId) LEFT OUTER JOIN ( SELECT documentInternalId, q.code_code AS operationQualifierCode, q.code_displayName AS operationQualifierName, r.targetInternalId AS actInternalId, actInternalId AS approvalInternalId, an.extension AS approvalNumber, an.root AS approvalNumberSystem, qs.subjectEntityCode AS approvalStateCode, qs.subjectEntityCodeSystem AS approvalStateCodeSystem, qs.effectivetime_low AS approvalEffectiveTimeLow, qs.effectivetime_high AS approvalEffectiveTimeHigh, qs.statusCode AS approvalStatusCode, qs.code_code AS licenseCode, agencyId.extension AS agencyId, agencyName.trivialName AS agencyName FROM reports.DocumentInformation q LEFT OUTER JOIN (SELECT * FROM reports.DocumentInformationSubject WHERE participationTypeCode = 'AUT') qs USING(documentInternalId, actInternalId) INNER JOIN integrator.ActRelationship r ON( r.sourceInternalId = actInternalId AND r.typeCode = 'SUBJ') LEFT OUTER JOIN integrator.Act_id an USING(actInternalId) LEFT OUTER JOIN integrator.Entity_id agencyId ON(agencyId.entityInternalId = otherEntityInternalId) LEFT OUTER JOIN reports.BestName agencyName ON(agencyName.entityInternalId = otherEntityInternalId) WHERE q.classCode = 'CNTRCT' AND q.moodCode = 'EVN' AND q.code_codeSystem = '2.16.840.1.113883.3.26.1.1' ) q USING(documentInternalId, actInternalId)WHERE classCode = 'ACT' AND moodCode = 'DEF' AND code_codeSystem = '2.16.840.1.113883.3.26.1.1' AND participationTypeCode IN ('PPRF','PRF');
You see that the expressions are all just equal operations, some IN, nothing outlandish.
Now I will try what Tom Lane suggested. Here you go. And I have it stopped at this state, so if you want me to inspect anything else, I can do it.
With screen(1) I can be sure I won't lose my stuff when my internet goes down. Nice.
I have one screen session with 3 windows:
- psql
- gdb
- misc (vmstat, etc.)
Now I have let this run for a good long time while setting up my screen stuff. And then:
ps -x
look for the postgres job with the EXPLAIN ... that's $PID, then:
gdb -p $PID
Then first I do
cont
but then it stops at SIGUSR1, because of the parallel workers signalling each other.
handle SIGUSR1 nostop
suppresses that stopping. Then I break CTRL-C, and set the breakpoint where Tom Lane said:
b AllocSetAlloc
once it stops there I do
Breakpoint 1, AllocSetAlloc (context=0x1168230, size=8) at aset.c:715 715 { (gdb) p context->name $4 = 0x96ce5b "ExecutorState"
So I should even be able to set a conditional breakpoint.
(gdb) delete Delete all breakpoints? (y or n) y (gdb) b AllocSetAlloc if strcmp(context->name, "ExecutorState") == 0 Breakpoint 2 at 0x848ed0: file aset.c, line 715. (gdb) cont (gdb) cont Continuing. Breakpoint 2, AllocSetAlloc (context=0x1168230, size=10) at aset.c:715 715 { (gdb) cont Continuing. Program received signal SIGUSR1, User defined signal 1. Breakpoint 2, AllocSetAlloc (context=0x1168230, size=152) at aset.c:715 715 { (gdb) cont Continuing. Program received signal SIGUSR1, User defined signal 1. Breakpoint 2, AllocSetAlloc (context=0x1168230, size=201) at aset.c:715 715 { (gdb) cont Continuing. Breakpoint 2, AllocSetAlloc (context=0x1168230, size=8272) at aset.c:715 715 { (gdb) p context->name $8 = 0x96ce5b "ExecutorState"
Nice. Now the question is, am I at the place where memory gets squeezed? And I think yes. With top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 31752 postgres 20 0 2772964 1.2g 329640 t 0.0 16.5 8:46.59 postgres: postgres integrator [local] EXPLAIN
I guess I should run this for a little longer. So I disable my breakpoints
(gdb) info breakpoints Num Type Disp Enb Address What 2 breakpoint keep y 0x0000000000848ed0 in AllocSetAlloc at aset.c:715 stop only if strcmp(context->name, "ExecutorState") == 0 breakpoint already hit 6 times (gdb) disable 2 (gdb) cont Continuing.
while watching top:
31752 postgres 20 0 2777060 1.3g 329920 D 33.2 17.9 8:52.07 postgres: postgres integrator [local] EXPLAIN 31752 postgres 20 0 2777060 1.4g 329920 D 33.2 17.9 8:52.07 postgres: postgres integrator [local] EXPLAIN 31752 postgres 20 0 2777060 1.5g 329920 D 33.2 17.9 8:52.07 postgres: postgres integrator [local] EXPLAIN
it went up pretty quick from 1.2 GB to 1.5 GB, but then it stopped growing fast, so now back to gdb and break:
^C Program received signal SIGINT, Interrupt. 0x00007f048f336d71 in read () from /lib64/libpthread.so.0 (gdb) enable 2 (gdb) cont Continuing. Breakpoint 2, AllocSetAlloc (context=0x1168230, size=385) at aset.c:715 715 {
Now I give you a bt so we have something to look at:
#0 AllocSetAlloc (context=0x1168230, size=385) at aset.c:715 #1 0x000000000084e6cd in palloc (size=385) at mcxt.c:938 #2 0x000000000061019c in ExecHashJoinGetSavedTuple (file=file@entry=0x8bbc528, hashvalue=hashvalue@entry=0x7fff2e4ca76c, tupleSlot=0x10856b8, hjstate=0x11688e0) at nodeHashjoin.c:1277 #3 0x0000000000610c83 in ExecHashJoinNewBatch (hjstate=0x11688e0) at nodeHashjoin.c:1042 #4 ExecHashJoinImpl (parallel=false, pstate=0x11688e0) at nodeHashjoin.c:539 #5 ExecHashJoin (pstate=0x11688e0) at nodeHashjoin.c:565 #6 0x00000000005fde68 in ExecProcNodeInstr (node=0x11688e0) at execProcnode.c:461 #7 0x000000000061ce4e in ExecProcNode (node=0x11688e0) at ../../../src/include/executor/executor.h:247 #8 ExecSort (pstate=0x11687d0) at nodeSort.c:107 #9 0x00000000005fde68 in ExecProcNodeInstr (node=0x11687d0) at execProcnode.c:461 #10 0x000000000061d2c4 in ExecProcNode (node=0x11687d0) at ../../../src/include/executor/executor.h:247 #11 ExecUnique (pstate=0x11685e0) at nodeUnique.c:73 #12 0x00000000005fde68 in ExecProcNodeInstr (node=0x11685e0) at execProcnode.c:461 #13 0x00000000005f75ba in ExecProcNode (node=0x11685e0) at ../../../src/include/executor/executor.h:247 #14 ExecutePlan (execute_once=<optimized out>, dest=0xcc60e0 <donothingDR>, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x11685e0, estate=0x1168340) at execMain.c:1723 #15 standard_ExecutorRun (queryDesc=0x119b6d8, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364 #16 0x000000000059c6f8 in ExplainOnePlan (plannedstmt=plannedstmt@entry=0x1199a68, into=into@entry=0x0, es=es@entry=0x1141d48, queryString=<optimized out>, params=0x0, queryEnv=queryEnv@entry=0x0, planduration=0x7fff2e4ca990) at explain.c:535 #17 0x000000000059c9ef in ExplainOneQuery (query=<optimized out>, cursorOptions=<optimized out>, into=0x0, es=0x1141d48, queryString=0xf4af30 "explain analyze\nSELECT DISTINCT\n documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"..., params=0x0, queryEnv=0x0) at explain.c:371 #18 0x000000000059ce37 in ExplainQuery (pstate=pstate@entry=0xf74608, stmt=stmt@entry=0x11ef240, queryString=queryString@entry=0xf4af30 "explain analyze\nSELECT DISTINCT\n documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"..., params=params@entry=0x0, queryEnv=queryEnv@entry=0x0, dest=dest@entry=0xf74578) at explain.c:254 #19 0x000000000072ca5d in standard_ProcessUtility (pstmt=0x11ef390, queryString=0xf4af30 "explain analyze\nSELECT DISTINCT\n documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"..., context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0xf74578, completionTag=0x7fff2e4cab20 "") at utility.c:675 #20 0x000000000072a052 in PortalRunUtility (portal=0xfb06b0, pstmt=0x11ef390, isTopLevel=<optimized out>, setHoldSnapshot=<optimized out>, dest=<optimized out>, completionTag=0x7fff2e4cab20 "") at pquery.c:1178 #21 0x000000000072add2 in FillPortalStore (portal=portal@entry=0xfb06b0, isTopLevel=isTopLevel@entry=true) at pquery.c:1038 #22 0x000000000072b855 in PortalRun (portal=portal@entry=0xfb06b0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0xf4c570, altdest=altdest@entry=0xf4c570, completionTag=0x7fff2e4cad30 "") at pquery.c:768 #23 0x00000000007276e8 in exec_simple_query ( query_string=0xf4af30 "explain analyze\nSELECT DISTINCT\n documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"...) at postgres.c:1145 #24 0x0000000000729534 in PostgresMain (argc=<optimized out>, argv=argv@entry=0xf76ce8, dbname=<optimized out>, username=<optimized out>) at postgres.c:4182 #25 0x00000000006be215 in BackendRun (port=0xf6efe0) at postmaster.c:4361 #26 BackendStartup (port=0xf6efe0) at postmaster.c:4033 #27 ServerLoop () at postmaster.c:1706 #28 0x00000000006bf122 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0xf45320) at postmaster.c:1379 #29 0x00000000004822dc in main (argc=3, argv=0xf45320) at main.c:228
But who knows if that's it. I continue and watch top again...
31752 postgres 20 0 3112352 1.8g 329920 D 32.2 23.7 9:43.75 postgres: postgres integrator [local] EXPLAIN
it went quickly to 1.6, then after some time to 1.7, then 1.8, and I stop again:
^C Program received signal SIGINT, Interrupt. 0x00007f048f336d71 in read () from /lib64/libpthread.so.0 (gdb) enable 2 (gdb) cont Continuing. Breakpoint 2, AllocSetAlloc (context=0x1168230, size=375) at aset.c:715 715 { bt #0 AllocSetAlloc (context=0x1168230, size=375) at aset.c:715 #1 0x000000000084e6cd in palloc (size=375) at mcxt.c:938 #2 0x000000000061019c in ExecHashJoinGetSavedTuple (file=file@entry=0x21df688, hashvalue=hashvalue@entry=0x7fff2e4ca76c, tupleSlot=0x10856b8, hjstate=0x11688e0) at nodeHashjoin.c:1277 #3 0x0000000000610c83 in ExecHashJoinNewBatch (hjstate=0x11688e0) at nodeHashjoin.c:1042 #4 ExecHashJoinImpl (parallel=false, pstate=0x11688e0) at nodeHashjoin.c:539 #5 ExecHashJoin (pstate=0x11688e0) at nodeHashjoin.c:565 #6 0x00000000005fde68 in ExecProcNodeInstr (node=0x11688e0) at execProcnode.c:461 #7 0x000000000061ce4e in ExecProcNode (node=0x11688e0) at ../../../src/include/executor/executor.h:247 #8 ExecSort (pstate=0x11687d0) at nodeSort.c:107 #9 0x00000000005fde68 in ExecProcNodeInstr (node=0x11687d0) at execProcnode.c:461 #10 0x000000000061d2c4 in ExecProcNode (node=0x11687d0) at ../../../src/include/executor/executor.h:247 #11 ExecUnique (pstate=0x11685e0) at nodeUnique.c:73 #12 0x00000000005fde68 in ExecProcNodeInstr (node=0x11685e0) at execProcnode.c:461 #13 0x00000000005f75ba in ExecProcNode (node=0x11685e0) at ../../../src/include/executor/executor.h:247 #14 ExecutePlan (execute_once=<optimized out>, dest=0xcc60e0 <donothingDR>, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x11685e0, estate=0x1168340) at execMain.c:1723 #15 standard_ExecutorRun (queryDesc=0x119b6d8, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364 #16 0x000000000059c6f8 in ExplainOnePlan (plannedstmt=plannedstmt@entry=0x1199a68, into=into@entry=0x0, es=es@entry=0x1141d48, queryString=<optimized out>, params=0x0, queryEnv=queryEnv@entry=0x0, planduration=0x7fff2e4ca990) at explain.c:535 #17 0x000000000059c9ef in ExplainOneQuery (query=<optimized out>, cursorOptions=<optimized out>, into=0x0, es=0x1141d48, queryString=0xf4af30 "explain analyze\nSELECT DISTINCT\n documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"..., params=0x0, queryEnv=0x0) at explain.c:371 #18 0x000000000059ce37 in ExplainQuery (pstate=pstate@entry=0xf74608, stmt=stmt@entry=0x11ef240, queryString=queryString@entry=0xf4af30 "explain analyze\nSELECT DISTINCT\n documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"..., params=params@entry=0x0, queryEnv=queryEnv@entry=0x0, dest=dest@entry=0xf74578) at explain.c:254 #19 0x000000000072ca5d in standard_ProcessUtility (pstmt=0x11ef390, queryString=0xf4af30 "explain analyze\nSELECT DISTINCT\n documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"..., context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0xf74578, completionTag=0x7fff2e4cab20 "") at utility.c:675 #20 0x000000000072a052 in PortalRunUtility (portal=0xfb06b0, pstmt=0x11ef390, isTopLevel=<optimized out>, setHoldSnapshot=<optimized out>, dest=<optimized out>, completionTag=0x7fff2e4cab20 "") at pquery.c:1178 #21 0x000000000072add2 in FillPortalStore (portal=portal@entry=0xfb06b0, isTopLevel=isTopLevel@entry=true) at pquery.c:1038 #22 0x000000000072b855 in PortalRun (portal=portal@entry=0xfb06b0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0xf4c570, altdest=altdest@entry=0xf4c570, completionTag=0x7fff2e4cad30 "") at pquery.c:768 #23 0x00000000007276e8 in exec_simple_query ( query_string=0xf4af30 "explain analyze\nSELECT DISTINCT\n documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"...) at postgres.c:1145 #24 0x0000000000729534 in PostgresMain (argc=<optimized out>, argv=argv@entry=0xf76ce8, dbname=<optimized out>, username=<optimized out>) at postgres.c:4182 #25 0x00000000006be215 in BackendRun (port=0xf6efe0) at postmaster.c:4361 #26 BackendStartup (port=0xf6efe0) at postmaster.c:4033 #27 ServerLoop () at postmaster.c:1706 #28 0x00000000006bf122 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0xf45320) at postmaster.c:1379 #29 0x00000000004822dc in main (argc=3, argv=0xf45320) at main.c:228
Good, now I leave this all sitting like that for you to ask me what else you might want to see.
We are now close to the edge of the cliff.
-Gunther
pgsql-performance by date: