Re: Out of Memory errors are frustrating as heck! - Mailing list pgsql-performance

From Gunther
Subject Re: Out of Memory errors are frustrating as heck!
Date
Msg-id 2917100e-2849-2ce0-2774-8b75384cc2f9@gusw.net
Whole thread Raw
In response to Re: Out of Memory errors are frustrating as heck!  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-performance
Now to Justin's patch. First undo Tomas' patch and apply:
$ mv src/include/executor/hashjoin.h.orig src/include/executor/hashjoin.h
$ mv src/backend/executor/nodeHash.c.orig src/backend/executor/nodeHash.c
$ mv src/backend/executor/nodeHashjoin.c.orig src/backend/executor/nodeHashjoin.c
$ patch -p1 <../limit-hash-nbatches-v2.patch
patching file src/backend/executor/nodeHash.c
Hunk #1 succeeded at 570 (offset -3 lines).
Hunk #2 succeeded at 917 (offset -3 lines).
Hunk #3 succeeded at 930 (offset -3 lines).
Hunk #4 succeeded at 1037 (offset -3 lines).
Hunk #5 succeeded at 1658 (offset -4 lines).

$ make 
$ make install
$ pg_ctl -c restart

and go ...

lots of CPU% again and very limited memory use as of yet.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
11054 postgres  20   0 1302772  90316  58004 R  94.4  1.1   4:38.05 postgres: postgres integrator [local] EXPLAIN
11055 postgres  20   0 1280532  68076  57168 R  97.7  0.9   2:03.54 postgres: parallel worker for PID 11054
11056 postgres  20   0 1280532  67964  57124 S   0.0  0.9   2:08.28 postgres: parallel worker for PID 11054

that's a pretty decent sign so far. Slight increase ... but still relatively steady ...

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
11054 postgres  20   0 1379704 167140  58004 R  95.0  2.1   5:56.28 postgres: postgres integrator [local] EXPLAIN
11055 postgres  20   0 1280532  68076  57168 S  25.6  0.9   2:36.59 postgres: parallel worker for PID 11054
11056 postgres  20   0 1280532  67964  57124 R  61.8  0.9   2:29.65 postgres: parallel worker for PID 11054

aaand break out ...

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
11119 postgres  20   0 1271660   1.0g   1.0g D   0.0 13.4   0:03.10 postgres: parallel worker for PID 11054
11054 postgres  20   0 1380940   1.0g 950508 D   0.0 13.4   6:56.09 postgres: postgres integrator [local] EXPLAIN
11118 postgres  20   0 1271660 884540 882724 D   0.0 11.2   0:02.84 postgres: parallel worker for PID 11054

and crash:

foo=# explain analyze select * from reports.v_BusinessOperation;
server closed the connection unexpectedly       This probably means the server terminated abnormally       before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

what happened? ouch, no space left on root device, too much logging? Maybe the core dump ... Log file content is simple:

2019-04-22 00:07:56.104 UTC [11048] LOG:  database system was shut down at 2019-04-22 00:07:55 UTC
2019-04-22 00:07:56.108 UTC [11046] LOG:  database system is ready to accept connections
Hashjoin 0x2122458: initial nbatch = 16, nbuckets = 8192
Hashjoin 0x2122458: increasing nbatch to 32 because space = 4128933
Hashjoin 0x2122458: freed 148 of 10584 tuples, space now 4071106
Hashjoin 0x2122458: increasing nbatch to 64 because space = 4128826
Hashjoin 0x2122458: freed 544 of 10584 tuples, space now 3916296
Hashjoin 0x2122458: increasing nbatch to 128 because space = 4128846
Hashjoin 0x2122458: freed 10419 of 10585 tuples, space now 65570
Hashjoin 0x2122458: increasing nbatch to 256 because space = 4128829
Hashjoin 0x2122458: freed 10308 of 10734 tuples, space now 161815
Hashjoin 0x2122458: increasing nbatch to 512 because space = 4128908
Hashjoin 0x2122458: freed 398 of 10379 tuples, space now 3977787
Hashjoin 0x3ac9918: initial nbatch = 1, nbuckets = 1024
Hashjoin 0x3ac91a8: initial nbatch = 1, nbuckets = 1024
Hashjoin 0x3ac93c8: initial nbatch = 1, nbuckets = 1024
Hashjoin 0x1f41018: initial nbatch = 1, nbuckets = 1024
Hashjoin 0x1f41048: initial nbatch = 1, nbuckets = 1024
2019-04-22 00:16:55.273 UTC [11046] LOG:  background worker "parallel worker" (PID 11119) was terminated by signal 11: Segmentation fault
2019-04-22 00:16:55.273 UTC [11046] DETAIL:  Failed process was running: explain analyze select * from reports.v_BusinessOperation;
2019-04-22 00:16:55.273 UTC [11046] LOG:  terminating any other active server processes
2019-04-22 00:16:55.274 UTC [11058] WARNING:  terminating connection because of crash of another server process
2019-04-22 00:16:55.274 UTC [11058] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2019-04-22 00:16:55.274 UTC [11058] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2019-04-22 00:16:55.277 UTC [11052] LOG:  could not write temporary statistics file "pg_stat/db_16903.tmp": No space left on device
2019-04-22 00:16:55.278 UTC [11052] LOG:  could not close temporary statistics file "pg_stat/db_0.tmp": No space left on device
2019-04-22 00:16:55.278 UTC [11052] LOG:  could not close temporary statistics file "pg_stat/global.tmp": No space left on device
2019-04-22 00:16:55.315 UTC [11046] LOG:  all server processes terminated; reinitializing
2019-04-22 00:16:55.425 UTC [11123] LOG:  database system was interrupted; last known up at 2019-04-22 00:12:56 UTC
2019-04-22 00:16:55.426 UTC [11124] FATAL:  the database system is in recovery mode
2019-04-22 00:16:55.545 UTC [11123] LOG:  database system was not properly shut down; automatic recovery in progress
2019-04-22 00:16:55.549 UTC [11123] LOG:  redo starts at 3D2/C44FDCF8

ok it is all because it dumped 3 core dumps, glad that I captured the top lines of the backend and its 2 workers

-rw------- 1 postgres postgres 1075843072 Apr 22 00:16 core.11054 -- backend
-rw------- 1 postgres postgres  894640128 Apr 22 00:16 core.11118 -- worker 1
-rw------- 1 postgres postgres 1079103488 Apr 22 00:16 core.11119 -- worker 2

And the melt down starts with "parallel worker" (PID 11119) receiving SIGSEGV.

So let's get gdb to the task to see what's up:

$ gdb -c data/core.11119 postgresql-11.2/src/backend/postgres
...
Reading symbols from postgresql-11.2/src/backend/postgres...done.
BFD: Warning: /var/lib/pgsql/data/core.11119 is truncated: expected core file size >= 1127112704, found: 1079103488.
[New LWP 11119]
Cannot access memory at address 0x7ff8d25dc108
Cannot access memory at address 0x7ff8d25dc100
Failed to read a valid object file image from memory.
Core was generated by `postgres: parallel worker for'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000000006bd792 in ExecParallelHashJoinNewBatch (   hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9c38>) at nodeHashjoin.c:1127
1127                            {
(gdb) bt 8
#0  0x00000000006bd792 in ExecParallelHashJoinNewBatch (   hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9c38>) at nodeHashjoin.c:1127
Backtrace stopped: Cannot access memory at address 0x7ffd45fa9c88
(gdb) info frame
Stack level 0, frame at 0x7ffd45fa9c90:rip = 0x6bd792 in ExecParallelHashJoinNewBatch (nodeHashjoin.c:1127); saved rip = <not saved>Outermost frame: Cannot access memory at address 0x7ffd45fa9c88source language c.Arglist at 0x7ffd45fa9c80, args: hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9c38>Locals at 0x7ffd45fa9c80, Previous frame's sp is 0x7ffd45fa9c90
Cannot access memory at address 0x7ffd45fa9c80
(gdb) list
1122                            SharedTuplestoreAccessor *inner_tuples;
1123                            Barrier    *batch_barrier =
1124                            &hashtable->batches[batchno].shared->batch_barrier;
1125
1126                            switch (BarrierAttach(batch_barrier))
1127                            {
1128                                    case PHJ_BATCH_ELECTING:
1129
1130                                            /* One backend allocates the hash table. */
1131                                            if (BarrierArriveAndWait(batch_barrier,

unfortunately this core file is truncated because of the file system running out of space. Let's see the others.

$ gdb -c data/core.11118 postgresql-11.2/src/backend/postgres
...
Reading symbols from postgresql-11.2/src/backend/postgres...done.
BFD: Warning: /var/lib/pgsql/data/core.11118 is truncated: expected core file size >= 1127112704, found: 894640128.
[New LWP 11118]
Cannot access memory at address 0x7ff8d25dc108
Cannot access memory at address 0x7ff8d25dc100
Failed to read a valid object file image from memory.
Core was generated by `postgres: parallel worker for'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000000006bd792 in ExecParallelHashJoinNewBatch (   hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9c38>) at nodeHashjoin.c:1127
1127                            {
(gdb) bt 5
#0  0x00000000006bd792 in ExecParallelHashJoinNewBatch (   hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9c38>) at nodeHashjoin.c:1127
Backtrace stopped: Cannot access memory at address 0x7ffd45fa9c88
(gdb) list
1122                            SharedTuplestoreAccessor *inner_tuples;
1123                            Barrier    *batch_barrier =
1124                            &hashtable->batches[batchno].shared->batch_barrier;
1125
1126                            switch (BarrierAttach(batch_barrier))
1127                            {
1128                                    case PHJ_BATCH_ELECTING:
1129
1130                                            /* One backend allocates the hash table. */
1131                                            if (BarrierArriveAndWait(batch_barrier,

strange, that one must have died very similar, same place, also truncated.

$ gdb -c data/core.11054 postgresql-11.2/src/backend/postgres
...
Reading symbols from postgresql-11.2/src/backend/postgres...done.
BFD: Warning: /var/lib/pgsql/data/core.11054 is truncated: expected core file size >= 1238786048, found: 1075843072.
[New LWP 11054]
Cannot access memory at address 0x7ff8d25dc108
Cannot access memory at address 0x7ff8d25dc100
Failed to read a valid object file image from memory.
Core was generated by `postgres: postgres integrator'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000000006bd792 in ExecParallelHashJoinNewBatch (   hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9498>) at nodeHashjoin.c:1127
1127                            {
(

I don't understand why all of them are at the same location. Doesn't make any sense to me.

But I'll leave it at that right now.

-Gunther

pgsql-performance by date:

Previous
From: Gunther
Date:
Subject: Re: Out of Memory errors are frustrating as heck!
Next
From: Tomas Vondra
Date:
Subject: Re: Out of Memory errors are frustrating as heck!