Re: stress test for parallel workers - Mailing list pgsql-hackers
From | Justin Pryzby |
---|---|
Subject | Re: stress test for parallel workers |
Date | |
Msg-id | 20190723174222.GO22387@telsasoft.com Whole thread Raw |
In response to | Re: stress test for parallel workers (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: stress test for parallel workers
|
List | pgsql-hackers |
On Tue, Jul 23, 2019 at 01:28:47PM -0400, Tom Lane wrote: > ... you'd think an OOM kill would show up in the kernel log. > (Not necessarily in dmesg, though. Did you try syslog?) Nothing in /var/log/messages (nor dmesg ring). I enabled abrtd while trying to reproduce it last week. Since you asked I looked again in messages, and found it'd logged 10 hours ago about this: (gdb) bt #0 0x000000395be32495 in raise () from /lib64/libc.so.6 #1 0x000000395be33c75 in abort () from /lib64/libc.so.6 #2 0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555 #3 0x00000000006f7e94 in CheckPointReplicationOrigin () at origin.c:588 #4 0x00000000004f6ef1 in CheckPointGuts (checkPointRedo=5507491783792, flags=128) at xlog.c:9150 #5 0x00000000004feff6 in CreateCheckPoint (flags=128) at xlog.c:8937 #6 0x00000000006d49e2 in CheckpointerMain () at checkpointer.c:491 #7 0x000000000050fe75 in AuxiliaryProcessMain (argc=2, argv=0x7ffe00d56b00) at bootstrap.c:451 #8 0x00000000006dcf54 in StartChildProcess (type=CheckpointerProcess) at postmaster.c:5337 #9 0x00000000006de78a in reaper (postgres_signal_arg=<value optimized out>) at postmaster.c:2867 #10 <signal handler called> #11 0x000000395bee1603 in __select_nocancel () from /lib64/libc.so.6 #12 0x00000000006e1488 in ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1671 #13 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1380 #14 0x0000000000656420 in main (argc=3, argv=0x27ae410) at main.c:228 #2 0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555 edata = <value optimized out> elevel = 22 oldcontext = 0x27e15d0 econtext = 0x0 __func__ = "errfinish" #3 0x00000000006f7e94 in CheckPointReplicationOrigin () at origin.c:588 save_errno = <value optimized out> tmppath = 0x9c4518 "pg_logical/replorigin_checkpoint.tmp" path = 0x9c4300 "pg_logical/replorigin_checkpoint" tmpfd = 64 i = <value optimized out> magic = 307747550 crc = 4294967295 __func__ = "CheckPointReplicationOrigin" #4 0x00000000004f6ef1 in CheckPointGuts (checkPointRedo=5507491783792, flags=128) at xlog.c:9150 No locals. #5 0x00000000004feff6 in CreateCheckPoint (flags=128) at xlog.c:8937 shutdown = false checkPoint = {redo = 5507491783792, ThisTimeLineID = 1, PrevTimeLineID = 1, fullPageWrites = true, nextXidEpoch =0, nextXid = 2141308, nextOid = 496731439, nextMulti = 1, nextMultiOffset = 0, oldestXid = 561, oldestXidDB = 1, oldestMulti = 1, oldestMultiDB = 1, time = 1563781930, oldestCommitTsXid = 0,newestCommitTsXid = 0, oldestActiveXid = 2141308} recptr = <value optimized out> _logSegNo = <value optimized out> Insert = <value optimized out> freespace = <value optimized out> PriorRedoPtr = <value optimized out> curInsert = <value optimized out> last_important_lsn = <value optimized out> vxids = 0x280afb8 nvxids = 0 __func__ = "CreateCheckPoint" #6 0x00000000006d49e2 in CheckpointerMain () at checkpointer.c:491 ckpt_performed = false do_restartpoint = <value optimized out> flags = 128 do_checkpoint = <value optimized out> now = 1563781930 elapsed_secs = <value optimized out> cur_timeout = <value optimized out> rc = <value optimized out> local_sigjmp_buf = {{__jmpbuf = {2, -1669940128760174522, 9083146, 0, 140728912407216, 140728912407224, -1669940128812603322,1670605924426606662}, __mask_was_saved = 1, __saved_mask = {__val = { 18446744066192964103, 0, 246358747096, 140728912407296, 140446084917816, 140446078556040, 9083146, 0, 246346239061,140446078556040, 140447207471460, 0, 140447207471424, 140446084917816, 0, 7864320}}}} checkpointer_context = 0x27e15d0 __func__ = "CheckpointerMain" Supposedly it's trying to do this: | ereport(PANIC, | (errcode_for_file_access(), | errmsg("could not write to file \"%s\": %m", | tmppath))); And since there's consistently nothing in logs, I'm guessing there's a legitimate write error (legitimate from PG perspective). Storage here is ext4 plus zfs tablespace on top of LVM on top of vmware thin volume. Justin
pgsql-hackers by date: