Re: stress test for parallel workers - Mailing list pgsql-hackers

From Justin Pryzby
Subject Re: stress test for parallel workers
Date
Msg-id 20190723174222.GO22387@telsasoft.com
Whole thread Raw
In response to Re: stress test for parallel workers  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: stress test for parallel workers
List pgsql-hackers
On Tue, Jul 23, 2019 at 01:28:47PM -0400, Tom Lane wrote:
> ... you'd think an OOM kill would show up in the kernel log.
> (Not necessarily in dmesg, though.  Did you try syslog?)

Nothing in /var/log/messages (nor dmesg ring).

I enabled abrtd while trying to reproduce it last week.  Since you asked I
looked again in messages, and found it'd logged 10 hours ago about this:

(gdb) bt
#0  0x000000395be32495 in raise () from /lib64/libc.so.6
#1  0x000000395be33c75 in abort () from /lib64/libc.so.6
#2  0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555
#3  0x00000000006f7e94 in CheckPointReplicationOrigin () at origin.c:588
#4  0x00000000004f6ef1 in CheckPointGuts (checkPointRedo=5507491783792, flags=128) at xlog.c:9150
#5  0x00000000004feff6 in CreateCheckPoint (flags=128) at xlog.c:8937
#6  0x00000000006d49e2 in CheckpointerMain () at checkpointer.c:491
#7  0x000000000050fe75 in AuxiliaryProcessMain (argc=2, argv=0x7ffe00d56b00) at bootstrap.c:451
#8  0x00000000006dcf54 in StartChildProcess (type=CheckpointerProcess) at postmaster.c:5337
#9  0x00000000006de78a in reaper (postgres_signal_arg=<value optimized out>) at postmaster.c:2867
#10 <signal handler called>
#11 0x000000395bee1603 in __select_nocancel () from /lib64/libc.so.6
#12 0x00000000006e1488 in ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1671
#13 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1380
#14 0x0000000000656420 in main (argc=3, argv=0x27ae410) at main.c:228

#2  0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555
        edata = <value optimized out>
        elevel = 22
        oldcontext = 0x27e15d0
        econtext = 0x0
        __func__ = "errfinish"
#3  0x00000000006f7e94 in CheckPointReplicationOrigin () at origin.c:588
        save_errno = <value optimized out>
        tmppath = 0x9c4518 "pg_logical/replorigin_checkpoint.tmp"
        path = 0x9c4300 "pg_logical/replorigin_checkpoint"
        tmpfd = 64
        i = <value optimized out>
        magic = 307747550
        crc = 4294967295
        __func__ = "CheckPointReplicationOrigin"
#4  0x00000000004f6ef1 in CheckPointGuts (checkPointRedo=5507491783792, flags=128) at xlog.c:9150
No locals.
#5  0x00000000004feff6 in CreateCheckPoint (flags=128) at xlog.c:8937
        shutdown = false
        checkPoint = {redo = 5507491783792, ThisTimeLineID = 1, PrevTimeLineID = 1, fullPageWrites = true, nextXidEpoch
=0, nextXid = 2141308, nextOid = 496731439, nextMulti = 1, nextMultiOffset = 0, 
 
          oldestXid = 561, oldestXidDB = 1, oldestMulti = 1, oldestMultiDB = 1, time = 1563781930, oldestCommitTsXid =
0,newestCommitTsXid = 0, oldestActiveXid = 2141308}
 
        recptr = <value optimized out>
        _logSegNo = <value optimized out>
        Insert = <value optimized out>
        freespace = <value optimized out>
        PriorRedoPtr = <value optimized out>
        curInsert = <value optimized out>
        last_important_lsn = <value optimized out>
        vxids = 0x280afb8
        nvxids = 0
        __func__ = "CreateCheckPoint"
#6  0x00000000006d49e2 in CheckpointerMain () at checkpointer.c:491
        ckpt_performed = false
        do_restartpoint = <value optimized out>
        flags = 128
        do_checkpoint = <value optimized out>
        now = 1563781930
        elapsed_secs = <value optimized out>
        cur_timeout = <value optimized out>
        rc = <value optimized out>
        local_sigjmp_buf = {{__jmpbuf = {2, -1669940128760174522, 9083146, 0, 140728912407216, 140728912407224,
-1669940128812603322,1670605924426606662}, __mask_was_saved = 1, __saved_mask = {__val = {
 
                18446744066192964103, 0, 246358747096, 140728912407296, 140446084917816, 140446078556040, 9083146, 0,
246346239061,140446078556040, 140447207471460, 0, 140447207471424, 140446084917816, 0, 
 
                7864320}}}}
        checkpointer_context = 0x27e15d0
        __func__ = "CheckpointerMain"

Supposedly it's trying to do this:

|    ereport(PANIC,  
|            (errcode_for_file_access(),
|             errmsg("could not write to file \"%s\": %m",
|                    tmppath)));

And since there's consistently nothing in logs, I'm guessing there's a
legitimate write error (legitimate from PG perspective).  Storage here is ext4
plus zfs tablespace on top of LVM on top of vmware thin volume.

Justin



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: [bug fix] Produce a crash dump before main() on Windows
Next
From: Jeevan Ladhe
Date:
Subject: Re: block-level incremental backup