Thread: stress test for parallel workers
Does anyone have a stress test for parallel workers ? On a customer's new VM, I got this several times while (trying to) migrate their DB: < 2019-07-23 10:33:51.552 CDT postgres >FATAL: postmaster exited during a parallel transaction < 2019-07-23 10:33:51.552 CDT postgres >STATEMENT: CREATE UNIQUE INDEX unused0_huawei_umts_nodeb_locell_201907_unique_idxON child.unused0_huawei_umts_nodeb_locell_201907 USING btree ... There's nothing in dmesg nor in postgres logs. At first I thought it's maybe because of a network disconnection, then I thought it's because we ran out of space (wal), then they had a faulty blade. After that, I'd tried and failed to reproduce it a number of times, but it's just recurred during what was intended to be their final restore. I've set max_parallel_workers_per_gather=0, but I'm planning to try to diagnose an issue in another instance. Ideally a minimal test, since I'm apparently going to have to run under gdb to see how it's dying, or even what process is failing. DMI: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015 CentOS release 6.9 (Final) Linux alextelsasrv01 2.6.32-754.17.1.el6.x86_64 #1 SMP Tue Jul 2 12:42:48 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux version | PostgreSQL 11.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-23), 64-bit Justin
Justin Pryzby <pryzby@telsasoft.com> writes: > Does anyone have a stress test for parallel workers ? > On a customer's new VM, I got this several times while (trying to) migrate their DB: > < 2019-07-23 10:33:51.552 CDT postgres >FATAL: postmaster exited during a parallel transaction We've been seeing this irregularly in the buildfarm, too. I've been suspicious that it's from an OOM kill on the postmaster in the buildfarm cases, but ... > There's nothing in dmesg nor in postgres logs. ... you'd think an OOM kill would show up in the kernel log. (Not necessarily in dmesg, though. Did you try syslog?) > Ideally a minimal test, since I'm apparently going to > have to run under gdb to see how it's dying, or even what process is failing. Like it told you, it's the postmaster that's going away. That's Not Supposed To Happen, of course, but unfortunately Linux' OOM kill heuristic preferentially targets the postmaster when its children are consuming too many resources. If that is the problem, there's some info on working around it at https://www.postgresql.org/docs/current/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT regards, tom lane
On Tue, Jul 23, 2019 at 01:28:47PM -0400, Tom Lane wrote: > ... you'd think an OOM kill would show up in the kernel log. > (Not necessarily in dmesg, though. Did you try syslog?) Nothing in /var/log/messages (nor dmesg ring). I enabled abrtd while trying to reproduce it last week. Since you asked I looked again in messages, and found it'd logged 10 hours ago about this: (gdb) bt #0 0x000000395be32495 in raise () from /lib64/libc.so.6 #1 0x000000395be33c75 in abort () from /lib64/libc.so.6 #2 0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555 #3 0x00000000006f7e94 in CheckPointReplicationOrigin () at origin.c:588 #4 0x00000000004f6ef1 in CheckPointGuts (checkPointRedo=5507491783792, flags=128) at xlog.c:9150 #5 0x00000000004feff6 in CreateCheckPoint (flags=128) at xlog.c:8937 #6 0x00000000006d49e2 in CheckpointerMain () at checkpointer.c:491 #7 0x000000000050fe75 in AuxiliaryProcessMain (argc=2, argv=0x7ffe00d56b00) at bootstrap.c:451 #8 0x00000000006dcf54 in StartChildProcess (type=CheckpointerProcess) at postmaster.c:5337 #9 0x00000000006de78a in reaper (postgres_signal_arg=<value optimized out>) at postmaster.c:2867 #10 <signal handler called> #11 0x000000395bee1603 in __select_nocancel () from /lib64/libc.so.6 #12 0x00000000006e1488 in ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1671 #13 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1380 #14 0x0000000000656420 in main (argc=3, argv=0x27ae410) at main.c:228 #2 0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555 edata = <value optimized out> elevel = 22 oldcontext = 0x27e15d0 econtext = 0x0 __func__ = "errfinish" #3 0x00000000006f7e94 in CheckPointReplicationOrigin () at origin.c:588 save_errno = <value optimized out> tmppath = 0x9c4518 "pg_logical/replorigin_checkpoint.tmp" path = 0x9c4300 "pg_logical/replorigin_checkpoint" tmpfd = 64 i = <value optimized out> magic = 307747550 crc = 4294967295 __func__ = "CheckPointReplicationOrigin" #4 0x00000000004f6ef1 in CheckPointGuts (checkPointRedo=5507491783792, flags=128) at xlog.c:9150 No locals. #5 0x00000000004feff6 in CreateCheckPoint (flags=128) at xlog.c:8937 shutdown = false checkPoint = {redo = 5507491783792, ThisTimeLineID = 1, PrevTimeLineID = 1, fullPageWrites = true, nextXidEpoch =0, nextXid = 2141308, nextOid = 496731439, nextMulti = 1, nextMultiOffset = 0, oldestXid = 561, oldestXidDB = 1, oldestMulti = 1, oldestMultiDB = 1, time = 1563781930, oldestCommitTsXid = 0,newestCommitTsXid = 0, oldestActiveXid = 2141308} recptr = <value optimized out> _logSegNo = <value optimized out> Insert = <value optimized out> freespace = <value optimized out> PriorRedoPtr = <value optimized out> curInsert = <value optimized out> last_important_lsn = <value optimized out> vxids = 0x280afb8 nvxids = 0 __func__ = "CreateCheckPoint" #6 0x00000000006d49e2 in CheckpointerMain () at checkpointer.c:491 ckpt_performed = false do_restartpoint = <value optimized out> flags = 128 do_checkpoint = <value optimized out> now = 1563781930 elapsed_secs = <value optimized out> cur_timeout = <value optimized out> rc = <value optimized out> local_sigjmp_buf = {{__jmpbuf = {2, -1669940128760174522, 9083146, 0, 140728912407216, 140728912407224, -1669940128812603322,1670605924426606662}, __mask_was_saved = 1, __saved_mask = {__val = { 18446744066192964103, 0, 246358747096, 140728912407296, 140446084917816, 140446078556040, 9083146, 0, 246346239061,140446078556040, 140447207471460, 0, 140447207471424, 140446084917816, 0, 7864320}}}} checkpointer_context = 0x27e15d0 __func__ = "CheckpointerMain" Supposedly it's trying to do this: | ereport(PANIC, | (errcode_for_file_access(), | errmsg("could not write to file \"%s\": %m", | tmppath))); And since there's consistently nothing in logs, I'm guessing there's a legitimate write error (legitimate from PG perspective). Storage here is ext4 plus zfs tablespace on top of LVM on top of vmware thin volume. Justin
On Wed, Jul 24, 2019 at 4:27 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > < 2019-07-23 10:33:51.552 CDT postgres >FATAL: postmaster exited during a parallel transaction > < 2019-07-23 10:33:51.552 CDT postgres >STATEMENT: CREATE UNIQUE INDEX unused0_huawei_umts_nodeb_locell_201907_unique_idxON child.unused0_huawei_umts_nodeb_locell_201907 USING btree ... > ... I've set > max_parallel_workers_per_gather=0, ... Just by the way, parallelism in CREATE INDEX is controlled by max_parallel_maintenance_workers, not max_parallel_workers_per_gather. -- Thomas Munro https://enterprisedb.com
On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > #2 0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555 > edata = <value optimized out> > elevel = 22 > oldcontext = 0x27e15d0 > econtext = 0x0 > __func__ = "errfinish" > #3 0x00000000006f7e94 in CheckPointReplicationOrigin () at origin.c:588 > save_errno = <value optimized out> > tmppath = 0x9c4518 "pg_logical/replorigin_checkpoint.tmp" > path = 0x9c4300 "pg_logical/replorigin_checkpoint" > tmpfd = 64 > i = <value optimized out> > magic = 307747550 > crc = 4294967295 > __func__ = "CheckPointReplicationOrigin" > Supposedly it's trying to do this: > > | ereport(PANIC, > | (errcode_for_file_access(), > | errmsg("could not write to file \"%s\": %m", > | tmppath))); > > And since there's consistently nothing in logs, I'm guessing there's a > legitimate write error (legitimate from PG perspective). Storage here is ext4 > plus zfs tablespace on top of LVM on top of vmware thin volume. If you have that core, it might be interesting to go to frame 2 and print *edata or edata->saved_errno. If the errno is EIO, it's a bit strange if that's not showing up in some form in kernel logs or dmesg or something; if it's ENOSPC I guess it'd be normal that it doesn't show up anywhere and there is nothing in the PostgreSQL logs if they're on the same full filesystem, but then you would probably already have mentioned that your filesystem was out of space. Could it have been fleetingly full due to some other thing happening on the system that rapidly expands and contracts? I'm confused by the evidence, though. If this PANIC is the origin of the problem, how do we get to postmaster-death based exit in a parallel leader*, rather than quickdie() (the kind of exit that happens when the postmaster sends SIGQUIT to every process and they say "terminating connection because of crash of another server process", because some backend crashed or panicked). Perhaps it would be clearer what's going on if you could put the PostgreSQL log onto a different filesystem, so we get a better chance of collecting evidence? But then... the parallel leader process was apparently able to log something -- maybe it was just lucky, but you said this happened this way more than once. I'm wondering how it could be that you got some kind of IO failure and weren't able to log the PANIC message AND your postmaster was killed, and you were able to log a message about that. Perhaps we're looking at evidence from two unrelated failures. *I suspect that the only thing implicating parallelism in this failure is that parallel leaders happen to print out that message if the postmaster dies while they are waiting for workers; most other places (probably every other backend in your cluster) just quietly exit. That tells us something about what's happening, but on its own doesn't tell us that parallelism plays an important role in the failure mode. -- Thomas Munro https://enterprisedb.com
Thomas Munro <thomas.munro@gmail.com> writes: > *I suspect that the only thing implicating parallelism in this failure > is that parallel leaders happen to print out that message if the > postmaster dies while they are waiting for workers; most other places > (probably every other backend in your cluster) just quietly exit. > That tells us something about what's happening, but on its own doesn't > tell us that parallelism plays an important role in the failure mode. I agree that there's little evidence implicating parallelism directly. The reason I'm suspicious about a possible OOM kill is that parallel queries would appear to the OOM killer to be eating more resources than the same workload non-parallel, so that we might be at more hazard of getting OOM'd just because of that. A different theory is that there's some hard-to-hit bug in the postmaster's processing of parallel workers that doesn't apply to regular backends. I've looked for one in a desultory way but not really focused on it. In any case, the evidence from the buildfarm is pretty clear that there is *some* connection. We've seen a lot of recent failures involving "postmaster exited during a parallel transaction", while the number of postmaster failures not involving that is epsilon. regards, tom lane
On Wed, Jul 24, 2019 at 10:03 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > edata = <value optimized out> > If you have that core, it might be interesting to go to frame 2 and > print *edata or edata->saved_errno. ... Rats. We already saw that it's optimised out so unless we can find that somewhere else in a variable that's present in the core, we probably can't find out what the operating system said. So my other idea for getting this information next time is to try putting the PostgreSQL logs somewhere that's more likely to be still working when that thing fails. -- Thomas Munro https://enterprisedb.com
On Wed, Jul 24, 2019 at 10:03:25AM +1200, Thomas Munro wrote: > On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > #2 0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555 > > edata = <value optimized out> > > If you have that core, it might be interesting to go to frame 2 and > print *edata or edata->saved_errno. As you saw..unless someone you know a trick, it's "optimized out". > Could it have been fleetingly full due to some other thing happening on the > system that rapidly expands and contracts? It's not impossible, especially while loading data, and data_dir is only 64GB; it may have happened that way sometimes; but it's hard to believe I that's been the case 5-10 times now. If I don't forget to drop the old database previously loaded, when loading old/historic data, it should have ~40GB free on data_dir, and no clients connected other than pg_restore. $ df -h /var/lib/pgsql Filesystem Size Used Avail Use% Mounted on /dev/mapper/data-postgres 64G 26G 38G 41% /var/lib/pgsql > | ereport(PANIC, > | (errcode_for_file_access(), > | errmsg("could not write to file \"%s\": %m", > | tmppath))); > > And since there's consistently nothing in logs, I'm guessing there's a > legitimate write error (legitimate from PG perspective). Storage here is ext4 > plus zfs tablespace on top of LVM on top of vmware thin volume. I realized this probably is *not* an issue with zfs, since it's failing to log (for one reason or another) to /var/lib/pgsql (ext4). > Perhaps it would be clearer what's going on if you could put the PostgreSQL > log onto a different filesystem, so we get a better chance of collecting > evidence? I didn't mention it but last weekend I'd left a loop around the restore process running overnight, and had convinced myself the issue didn't recur since their faulty blade was taken out of service... My plan was to leave the server running in the foreground with logging_collector=no, which I hope is enough, unless logging is itself somehow implicated. I'm trying to stress test that way now. > But then... the parallel leader process was apparently able > to log something -- maybe it was just lucky, but you said this > happened this way more than once. I'm wondering how it could be that > you got some kind of IO failure and weren't able to log the PANIC > message AND your postmaster was killed, and you were able to log a > message about that. Perhaps we're looking at evidence from two > unrelated failures. The messages from the parallel leader (building indices) were visible to the client, not via the server log. I was loading their data and the errors were visible when pg_restore failed. On Wed, Jul 24, 2019 at 09:10:41AM +1200, Thomas Munro wrote: > Just by the way, parallelism in CREATE INDEX is controlled by > max_parallel_maintenance_workers, not max_parallel_workers_per_gather. Thank you. Justin
On Wed, Jul 24, 2019 at 10:42 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > On Wed, Jul 24, 2019 at 10:03:25AM +1200, Thomas Munro wrote: > > On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > > #2 0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555 > > > edata = <value optimized out> > > > > If you have that core, it might be interesting to go to frame 2 and > > print *edata or edata->saved_errno. > > As you saw..unless someone you know a trick, it's "optimized out". How about something like this: print errorData[errordata_stack_depth] If you can't find errordata_stack_depth, maybe look at the whole array and try to find the interesting bit? -- Thomas Munro https://enterprisedb.com
On Wed, Jul 24, 2019 at 10:46:42AM +1200, Thomas Munro wrote: > On Wed, Jul 24, 2019 at 10:42 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > On Wed, Jul 24, 2019 at 10:03:25AM +1200, Thomas Munro wrote: > > > On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > > > #2 0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555 > > > > edata = <value optimized out> > > > > > > If you have that core, it might be interesting to go to frame 2 and > > > print *edata or edata->saved_errno. > > > > As you saw..unless someone you know a trick, it's "optimized out". > > How about something like this: > > print errorData[errordata_stack_depth] Clever. (gdb) p errordata[errordata_stack_depth] $2 = {elevel = 13986192, output_to_server = 254, output_to_client = 127, show_funcname = false, hide_stmt = false, hide_ctx= false, filename = 0x27b3790 "< %m %u >", lineno = 41745456, funcname = 0x3030313335 <Address 0x3030313335 out of bounds>, domain = 0x0, context_domain = 0x27cff90 "postgres", sqlerrcode= 0, message = 0xe8800000001 <Address 0xe8800000001 out of bounds>, detail = 0x297a <Address 0x297a out of bounds>, detail_log = 0x0, hint = 0xe88 <Address 0xe88 out of bounds>, context =0x297a <Address 0x297a out of bounds>, message_id = 0x0, schema_name = 0x0, table_name = 0x0, column_name = 0x0, datatype_name = 0x0, constraint_name = 0x0, cursorpos = 0, internalpos = 0, internalquery= 0x0, saved_errno = 0, assoc_context = 0x0} (gdb) p errordata $3 = {{elevel = 22, output_to_server = true, output_to_client = false, show_funcname = false, hide_stmt = false, hide_ctx= false, filename = 0x9c4030 "origin.c", lineno = 591, funcname = 0x9c46e0 "CheckPointReplicationOrigin", domain = 0x9ac810 "postgres-11", context_domain = 0x9ac810 "postgres-11",sqlerrcode = 4293, message = 0x27b0e40 "could not write to file \"pg_logical/replorigin_checkpoint.tmp\": No space left on device", detail= 0x0, detail_log = 0x0, hint = 0x0, context = 0x0, message_id = 0x8a7aa8 "could not write to file \"%s\": %m", ... I ought to have remembered that it *was* in fact out of space this AM when this core was dumped (due to having not touched it since scheduling transition to this VM last week). I want to say I'm almost certain it wasn't ENOSPC in other cases, since, failing to find log output, I ran df right after the failure. But that gives me an idea: is it possible there's an issue with files being held opened by worker processes ? Including by parallel workers? Probably WALs, even after they're rotated ? If there were worker processes holding opened lots of rotated WALs, that could cause ENOSPC, but that wouldn't be obvious after they die, since the space would then be freed. Justin
Justin Pryzby <pryzby@telsasoft.com> writes: > I want to say I'm almost certain it wasn't ENOSPC in other cases, since, > failing to find log output, I ran df right after the failure. The fact that you're not finding log output matching what was reported to the client seems to me to be a mighty strong indication that there *was* an ENOSPC problem. Can you reconfigure to put the postmaster log on a different volume? > But that gives me an idea: is it possible there's an issue with files being > held opened by worker processes ? Including by parallel workers? Probably > WALs, even after they're rotated ? If there were worker processes holding > opened lots of rotated WALs, that could cause ENOSPC, but that wouldn't be > obvious after they die, since the space would then be freed. Parallel workers aren't ever allowed to write, in the current implementation, so it's not real obvious why they'd have any WAL log files open at all. regards, tom lane
On Wed, Jul 24, 2019 at 11:04 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > I ought to have remembered that it *was* in fact out of space this AM when this > core was dumped (due to having not touched it since scheduling transition to > this VM last week). > > I want to say I'm almost certain it wasn't ENOSPC in other cases, since, > failing to find log output, I ran df right after the failure. Ok, cool, so the ENOSPC thing we understand, and the postmaster death thing is probably something entirely different. Which brings us to the question: what is killing your postmaster or causing it to exit silently and unexpectedly, but leaving no trace in any operating system log? You mentioned that you couldn't see any signs of the OOM killer. Are you in a situation to test an OOM failure so you can confirm what that looks like on your system? You might try typing this into Python: x = [42] for i in range(1000): x = x + x On my non-Linux system, it ran for a while and then was killed, and dmesg showed: pid 15956 (python3.6), jid 0, uid 1001, was killed: out of swap space pid 40238 (firefox), jid 0, uid 1001, was killed: out of swap space Admittedly it is quite hard for to distinguish between a web browser and a program designed to eat memory as fast as possible... Anyway on Linux you should see stuff about killed processes and/or OOM in one of dmesg, syslog, messages. > But that gives me an idea: is it possible there's an issue with files being > held opened by worker processes ? Including by parallel workers? Probably > WALs, even after they're rotated ? If there were worker processes holding > opened lots of rotated WALs, that could cause ENOSPC, but that wouldn't be > obvious after they die, since the space would then be freed. Parallel workers don't do anything with WAL files, but they can create temporary files. If you're building humongous indexes with parallel workers, you'll get some of those, but I don't think it'd be more than you'd get without parallelism. If you were using up all of your disk space with temporary files, wouldn't this be reproducible? I think you said you were testing this repeatedly, so if that were the problem I'd expect to see some non-panicky out-of-space errors when the temp files blow out your disk space, and only rarely a panic if a checkpoint happens to run exactly at a moment where the create index hasn't yet written the byte that breaks the camel's back, but the checkpoint pushes it over edge in one of these places where it panics on failure. -- Thomas Munro https://enterprisedb.com
On Wed, Jul 24, 2019 at 10:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > *I suspect that the only thing implicating parallelism in this failure > > is that parallel leaders happen to print out that message if the > > postmaster dies while they are waiting for workers; most other places > > (probably every other backend in your cluster) just quietly exit. > > That tells us something about what's happening, but on its own doesn't > > tell us that parallelism plays an important role in the failure mode. > > I agree that there's little evidence implicating parallelism directly. > The reason I'm suspicious about a possible OOM kill is that parallel > queries would appear to the OOM killer to be eating more resources > than the same workload non-parallel, so that we might be at more > hazard of getting OOM'd just because of that. > > A different theory is that there's some hard-to-hit bug in the > postmaster's processing of parallel workers that doesn't apply to > regular backends. I've looked for one in a desultory way but not > really focused on it. > > In any case, the evidence from the buildfarm is pretty clear that > there is *some* connection. We've seen a lot of recent failures > involving "postmaster exited during a parallel transaction", while > the number of postmaster failures not involving that is epsilon. I don't have access to the build farm history in searchable format (I'll go and ask for that). Do you have an example to hand? Is this failure always happening on Linux? -- Thomas Munro https://enterprisedb.com
On 2019-Jul-23, Justin Pryzby wrote: > I want to say I'm almost certain it wasn't ENOSPC in other cases, since, > failing to find log output, I ran df right after the failure. I'm not sure that this proves much, since I expect temporary files to be deleted on failure; by the time you run 'df' the condition might have already been cleared. You'd need to be capturing diskspace telemetry with sufficient granularity ... -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jul 24, 2019 at 11:32:30AM +1200, Thomas Munro wrote: > On Wed, Jul 24, 2019 at 11:04 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > I ought to have remembered that it *was* in fact out of space this AM when this > > core was dumped (due to having not touched it since scheduling transition to > > this VM last week). > > > > I want to say I'm almost certain it wasn't ENOSPC in other cases, since, > > failing to find log output, I ran df right after the failure. I meant it wasn't a trivial error on my part of failing to drop the previously loaded DB instance. It occured to me to check inodes, which can also cause ENOSPC. This is mkfs -T largefile, so running out of inodes is not an impossibility. But seems an unlikely culprit, unless something made tens of thousands of (small) files. [pryzbyj@alextelsasrv01 ~]$ df -i /var/lib/pgsql Filesystem Inodes IUsed IFree IUse% Mounted on /dev/mapper/data-postgres 65536 5605 59931 9% /var/lib/pgsql > Ok, cool, so the ENOSPC thing we understand, and the postmaster death > thing is probably something entirely different. Which brings us to > the question: what is killing your postmaster or causing it to exit > silently and unexpectedly, but leaving no trace in any operating > system log? You mentioned that you couldn't see any signs of the OOM > killer. Are you in a situation to test an OOM failure so you can > confirm what that looks like on your system? $ command time -v python -c "'x'*4999999999" |wc Traceback (most recent call last): File "<string>", line 1, in <module> MemoryError Command exited with non-zero status 1 ... Maximum resident set size (kbytes): 4276 $ dmesg ... Out of memory: Kill process 10665 (python) score 478 or sacrifice child Killed process 10665, UID 503, (python) total-vm:4024260kB, anon-rss:3845756kB, file-rss:1624kB I wouldn't burn too much more time on it until I can reproduce it. The failures were all during pg_restore, so checkpointer would've been very busy. It seems possible it for it to notice ENOSPC before workers...which would be fsyncing WAL, where checkpointer is fsyncing data. > Admittedly it is quite hard for to distinguish between a web browser > and a program designed to eat memory as fast as possible... Browsers making lots of progress here but still clearly 2nd place. Justin
Thomas Munro <thomas.munro@gmail.com> writes: > On Wed, Jul 24, 2019 at 10:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> In any case, the evidence from the buildfarm is pretty clear that >> there is *some* connection. We've seen a lot of recent failures >> involving "postmaster exited during a parallel transaction", while >> the number of postmaster failures not involving that is epsilon. > I don't have access to the build farm history in searchable format > (I'll go and ask for that). Yeah, it's definitely handy to be able to do SQL searches in the history. I forget whether Dunstan or Frost is the person to ask for access, but there's no reason you shouldn't have it. > Do you have an example to hand? Is this > failure always happening on Linux? I dug around a bit further, and while my recollection of a lot of "postmaster exited during a parallel transaction" failures is accurate, there is a very strong correlation I'd not noticed: it's just a few buildfarm critters that are producing those. To wit, I find that string in these recent failures (checked all runs in the past 3 months): sysname | branch | snapshot -----------+---------------+--------------------- lorikeet | HEAD | 2019-06-16 20:28:25 lorikeet | HEAD | 2019-07-07 14:58:38 lorikeet | HEAD | 2019-07-02 10:38:08 lorikeet | HEAD | 2019-06-14 14:58:24 lorikeet | HEAD | 2019-07-04 20:28:44 lorikeet | HEAD | 2019-04-30 11:00:49 lorikeet | HEAD | 2019-06-19 20:29:27 lorikeet | HEAD | 2019-05-21 08:28:26 lorikeet | REL_11_STABLE | 2019-07-11 08:29:08 lorikeet | REL_11_STABLE | 2019-07-09 08:28:41 lorikeet | REL_12_STABLE | 2019-07-16 08:28:37 lorikeet | REL_12_STABLE | 2019-07-02 21:46:47 lorikeet | REL9_6_STABLE | 2019-07-02 20:28:14 vulpes | HEAD | 2019-06-14 09:18:18 vulpes | HEAD | 2019-06-27 09:17:19 vulpes | HEAD | 2019-07-21 09:01:45 vulpes | HEAD | 2019-06-12 09:11:02 vulpes | HEAD | 2019-07-05 08:43:29 vulpes | HEAD | 2019-07-15 08:43:28 vulpes | HEAD | 2019-07-19 09:28:12 wobbegong | HEAD | 2019-06-09 20:43:22 wobbegong | HEAD | 2019-07-02 21:17:41 wobbegong | HEAD | 2019-06-04 21:06:07 wobbegong | HEAD | 2019-07-14 20:43:54 wobbegong | HEAD | 2019-06-19 21:05:04 wobbegong | HEAD | 2019-07-08 20:55:18 wobbegong | HEAD | 2019-06-28 21:18:46 wobbegong | HEAD | 2019-06-02 20:43:20 wobbegong | HEAD | 2019-07-04 21:01:37 wobbegong | HEAD | 2019-06-14 21:20:59 wobbegong | HEAD | 2019-06-23 21:36:51 wobbegong | HEAD | 2019-07-18 21:31:36 (32 rows) We already knew that lorikeet has its own peculiar stability problems, and these other two critters run different compilers on the same Fedora 27 ppc64le platform. So I think I've got to take back the assertion that we've got some lurking generic problem. This pattern looks way more like a platform-specific issue. Overaggressive OOM killer would fit the facts on vulpes/wobbegong, perhaps, though it's odd that it only happens on HEAD runs. regards, tom lane
On Wed, Jul 24, 2019 at 5:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > On Wed, Jul 24, 2019 at 10:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Do you have an example to hand? Is this > > failure always happening on Linux? > > I dug around a bit further, and while my recollection of a lot of > "postmaster exited during a parallel transaction" failures is accurate, > there is a very strong correlation I'd not noticed: it's just a few > buildfarm critters that are producing those. To wit, I find that > string in these recent failures (checked all runs in the past 3 months): > > sysname | branch | snapshot > -----------+---------------+--------------------- > lorikeet | HEAD | 2019-06-16 20:28:25 > lorikeet | HEAD | 2019-07-07 14:58:38 > lorikeet | HEAD | 2019-07-02 10:38:08 > lorikeet | HEAD | 2019-06-14 14:58:24 > lorikeet | HEAD | 2019-07-04 20:28:44 > lorikeet | HEAD | 2019-04-30 11:00:49 > lorikeet | HEAD | 2019-06-19 20:29:27 > lorikeet | HEAD | 2019-05-21 08:28:26 > lorikeet | REL_11_STABLE | 2019-07-11 08:29:08 > lorikeet | REL_11_STABLE | 2019-07-09 08:28:41 > lorikeet | REL_12_STABLE | 2019-07-16 08:28:37 > lorikeet | REL_12_STABLE | 2019-07-02 21:46:47 > lorikeet | REL9_6_STABLE | 2019-07-02 20:28:14 > vulpes | HEAD | 2019-06-14 09:18:18 > vulpes | HEAD | 2019-06-27 09:17:19 > vulpes | HEAD | 2019-07-21 09:01:45 > vulpes | HEAD | 2019-06-12 09:11:02 > vulpes | HEAD | 2019-07-05 08:43:29 > vulpes | HEAD | 2019-07-15 08:43:28 > vulpes | HEAD | 2019-07-19 09:28:12 > wobbegong | HEAD | 2019-06-09 20:43:22 > wobbegong | HEAD | 2019-07-02 21:17:41 > wobbegong | HEAD | 2019-06-04 21:06:07 > wobbegong | HEAD | 2019-07-14 20:43:54 > wobbegong | HEAD | 2019-06-19 21:05:04 > wobbegong | HEAD | 2019-07-08 20:55:18 > wobbegong | HEAD | 2019-06-28 21:18:46 > wobbegong | HEAD | 2019-06-02 20:43:20 > wobbegong | HEAD | 2019-07-04 21:01:37 > wobbegong | HEAD | 2019-06-14 21:20:59 > wobbegong | HEAD | 2019-06-23 21:36:51 > wobbegong | HEAD | 2019-07-18 21:31:36 > (32 rows) > > We already knew that lorikeet has its own peculiar stability > problems, and these other two critters run different compilers > on the same Fedora 27 ppc64le platform. > > So I think I've got to take back the assertion that we've got > some lurking generic problem. This pattern looks way more > like a platform-specific issue. Overaggressive OOM killer > would fit the facts on vulpes/wobbegong, perhaps, though > it's odd that it only happens on HEAD runs. chipmunk also: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2019-08-06%2014:16:16 I wondered if the build farm should try to report OOM kill -9 or other signal activity affecting the postmaster. On some systems (depending on sysctl kernel.dmesg_restrict on Linux, security.bsd.unprivileged_read_msgbuf on FreeBSD etc) you can run dmesg as a non-root user, and there the OOM killer's footprints or signaled exit statuses for processes under init might normally be found, but that seems a bit invasive for the host system (I guess you'd filter it carefully). Unfortunately it isn't enabled on many common systems anyway. Maybe there is a systemd-specific way to get the info we need without being root? Another idea: start the postmaster under a subreaper (Linux 3.4+ prctl(PR_SET_CHILD_SUBREAPER), FreeBSD 10.2+ procctl(PROC_REAP_ACQUIRE)) that exists just to report on its children's exit status, so the build farm could see "pid XXX was killed by signal 9" message if it is nuked by the OOM killer. Perhaps there is a common subreaper wrapper out there that would wait, print messages like that, rince and repeat until it has no children and then exit, or perhaps pg_ctl or even a perl script could do somethign like that if requested. Another thought, not explored, is the brand new Linux pidfd stuff that can be used to wait and get exit status for a non-child process (or the older BSD equivalent), but the paint isn't even dry on that stuff anwyay. -- Thomas Munro https://enterprisedb.com
Thomas Munro <thomas.munro@gmail.com> writes: > I wondered if the build farm should try to report OOM kill -9 or other > signal activity affecting the postmaster. Yeah, I've been wondering whether pg_ctl could fork off a subprocess that would fork the postmaster, wait for the postmaster to exit, and then report the exit status. Where to report it *to* seems like the hard part, but maybe an answer that worked for the buildfarm would be enough for now. regards, tom lane
On Wed, Aug 7, 2019 at 4:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > I wondered if the build farm should try to report OOM kill -9 or other > > signal activity affecting the postmaster. > > Yeah, I've been wondering whether pg_ctl could fork off a subprocess > that would fork the postmaster, wait for the postmaster to exit, and then > report the exit status. Where to report it *to* seems like the hard part, > but maybe an answer that worked for the buildfarm would be enough for now. Oh, right, you don't even need subreaper tricks (I was imagining we had a double fork somewhere we don't). Another question is whether the build farm should be setting the Linux oom score adjust thing. -- Thomas Munro https://enterprisedb.com
Thomas Munro <thomas.munro@gmail.com> writes: > Another question is whether the build farm should be setting the Linux > oom score adjust thing. AFAIK you can't do that without being root. regards, tom lane
On Wed, Aug 7, 2019 at 5:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > Another question is whether the build farm should be setting the Linux > > oom score adjust thing. > > AFAIK you can't do that without being root. Rats, yeah you need CAP_SYS_RESOURCE or root to lower it. -- Thomas Munro https://enterprisedb.com
On 07/08/2019 02:57, Thomas Munro wrote: > On Wed, Jul 24, 2019 at 5:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> So I think I've got to take back the assertion that we've got >> some lurking generic problem. This pattern looks way more >> like a platform-specific issue. Overaggressive OOM killer >> would fit the facts on vulpes/wobbegong, perhaps, though >> it's odd that it only happens on HEAD runs. > > chipmunk also: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2019-08-06%2014:16:16 > > I wondered if the build farm should try to report OOM kill -9 or other > signal activity affecting the postmaster. FWIW, I looked at the logs in /var/log/* on chipmunk, and found no evidence of OOM killings. I can see nothing unusual in the OS logs around the time of that failure. - Heikki
Heikki Linnakangas <hlinnaka@iki.fi> writes: > On 07/08/2019 02:57, Thomas Munro wrote: >> On Wed, Jul 24, 2019 at 5:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> So I think I've got to take back the assertion that we've got >>> some lurking generic problem. This pattern looks way more >>> like a platform-specific issue. Overaggressive OOM killer >>> would fit the facts on vulpes/wobbegong, perhaps, though >>> it's odd that it only happens on HEAD runs. >> chipmunk also: >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2019-08-06%2014:16:16 > FWIW, I looked at the logs in /var/log/* on chipmunk, and found no > evidence of OOM killings. I can see nothing unusual in the OS logs > around the time of that failure. Oh, that is very useful info, thanks. That seems to mean that we should be suspecting a segfault, assertion failure, etc inside the postmaster. I don't see any TRAP message in chipmunk's log, so assertion failure seems to be ruled out, but other sorts of process-crashing errors would fit the facts. A stack trace from the crash would be mighty useful info along about here. I wonder whether chipmunk has the infrastructure needed to create such a thing. From memory, the buildfarm requires gdb for that, but not sure if there are additional requirements. Also, if you're using systemd or something else that thinks it ought to interfere with where cores get dropped, that could be a problem. regards, tom lane
On 07/08/2019 16:57, Tom Lane wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: >> On 07/08/2019 02:57, Thomas Munro wrote: >>> On Wed, Jul 24, 2019 at 5:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>> So I think I've got to take back the assertion that we've got >>>> some lurking generic problem. This pattern looks way more >>>> like a platform-specific issue. Overaggressive OOM killer >>>> would fit the facts on vulpes/wobbegong, perhaps, though >>>> it's odd that it only happens on HEAD runs. > >>> chipmunk also: >>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2019-08-06%2014:16:16 > >> FWIW, I looked at the logs in /var/log/* on chipmunk, and found no >> evidence of OOM killings. I can see nothing unusual in the OS logs >> around the time of that failure. > > Oh, that is very useful info, thanks. That seems to mean that we > should be suspecting a segfault, assertion failure, etc inside > the postmaster. I don't see any TRAP message in chipmunk's log, > so assertion failure seems to be ruled out, but other sorts of > process-crashing errors would fit the facts. > > A stack trace from the crash would be mighty useful info along > about here. I wonder whether chipmunk has the infrastructure > needed to create such a thing. From memory, the buildfarm requires > gdb for that, but not sure if there are additional requirements. It does have gdb installed. > Also, if you're using systemd or something else that thinks it > ought to interfere with where cores get dropped, that could be > a problem. I think they should just go to a file called "core", I don't think I've changed any settings related to it, at least. I tried "find / -name core*", but didn't find any core files, though. - Heikki
Heikki Linnakangas <hlinnaka@iki.fi> writes: > On 07/08/2019 16:57, Tom Lane wrote: >> Also, if you're using systemd or something else that thinks it >> ought to interfere with where cores get dropped, that could be >> a problem. > I think they should just go to a file called "core", I don't think I've > changed any settings related to it, at least. I tried "find / -name > core*", but didn't find any core files, though. On Linux machines the first thing to check is cat /proc/sys/kernel/core_pattern On a Debian machine I have handy, that just says "core", but Red Hat tends to mess with it ... regards, tom lane
On 07/08/2019 17:45, Tom Lane wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: >> On 07/08/2019 16:57, Tom Lane wrote: >>> Also, if you're using systemd or something else that thinks it >>> ought to interfere with where cores get dropped, that could be >>> a problem. > >> I think they should just go to a file called "core", I don't think I've >> changed any settings related to it, at least. I tried "find / -name >> core*", but didn't find any core files, though. > > On Linux machines the first thing to check is > > cat /proc/sys/kernel/core_pattern > > On a Debian machine I have handy, that just says "core", but Red Hat > tends to mess with it ... It's just "core" on chipmunk, too. - Heikki
Thomas Munro <thomas.munro@gmail.com> writes: > On Wed, Aug 7, 2019 at 4:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Yeah, I've been wondering whether pg_ctl could fork off a subprocess >> that would fork the postmaster, wait for the postmaster to exit, and then >> report the exit status. Where to report it *to* seems like the hard part, >> but maybe an answer that worked for the buildfarm would be enough for now. > Oh, right, you don't even need subreaper tricks (I was imagining we > had a double fork somewhere we don't). I got around to looking at how to do this. Seeing that chipmunk hasn't failed again, I'm inclined to write that off as perhaps unrelated. That leaves us to diagnose the pg_upgrade failures on wobbegong and vulpes. The pg_upgrade test uses pg_ctl to start the postmaster, and the only simple way to wedge this requirement into pg_ctl is as attached. Now, the attached is completely *not* suitable as a permanent patch, because it degrades or breaks a number of pg_ctl behaviors that rely on knowing the postmaster's actual PID rather than that of the parent shell. But it gets through check-world, so I think we can stick it in transiently to see what it can teach us about the buildfarm failures. Given wobbegong's recent failure rate, I don't think we'll have to wait long. Some notes about the patch: * The core idea is to change start_postmaster's shell invocation so that the shell doesn't just exec the postmaster, but runs a mini shell script that runs the postmaster and then reports its exit status. I found that this still needed a dummy exec to force the shell to perform the I/O redirections on itself, else pg_ctl's TAP tests fail. (I think what was happening was that if the shell continued to hold open its original stdin, IPC::Run didn't believe the command was done.) * This means that what start_postmaster returns is not the postmaster's own PID, but that of the parent shell. So we have to lobotomize wait_for_postmaster to handle the PID the same way as on Windows (where that was already true); it can't test for exact equality between the child process PID and what's in postmaster.pid. (trap_sigint_during_startup is also broken, but we don't need that to work to get through the regression tests.) * That makes recovery/t/017_shm.pl fail, because there's a race condition: after killing the old postmaster, the existing postmaster.pid is enough to fool "pg_ctl start" into thinking the new postmaster is already running. I fixed that by making pg_ctl reject any PID seen in a pre-existing postmaster.pid file. That has a nonzero probability of false match, so I would not want to stick it in as a permanent thing on Unix ... but I wonder if it wouldn't be an improvement over the current situation on Windows. regards, tom lane diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c index dd76be6..316651c 100644 --- a/src/bin/pg_ctl/pg_ctl.c +++ b/src/bin/pg_ctl/pg_ctl.c @@ -106,6 +106,7 @@ static char promote_file[MAXPGPATH]; static char logrotate_file[MAXPGPATH]; static volatile pgpid_t postmasterPID = -1; +static pgpid_t old_postmaster_pid = 0; #ifdef WIN32 static DWORD pgctl_start_type = SERVICE_AUTO_START; @@ -490,16 +491,17 @@ start_postmaster(void) /* * Since there might be quotes to handle here, it is easier simply to pass - * everything to a shell to process them. Use exec so that the postmaster - * has the same PID as the current child process. + * everything to a shell to process them. + * + * Since we aren't telling the shell to directly exec the postmaster, + * the returned PID is a parent process, the same as on Windows. */ if (log_file != NULL) - snprintf(cmd, MAXPGPATH, "exec \"%s\" %s%s < \"%s\" >> \"%s\" 2>&1", - exec_path, pgdata_opt, post_opts, - DEVNULL, log_file); + snprintf(cmd, MAXPGPATH, "exec < \"%s\" >> \"%s\" 2>&1; \"%s\" %s%s; echo postmaster exit status is $?", + DEVNULL, log_file, exec_path, pgdata_opt, post_opts); else - snprintf(cmd, MAXPGPATH, "exec \"%s\" %s%s < \"%s\" 2>&1", - exec_path, pgdata_opt, post_opts, DEVNULL); + snprintf(cmd, MAXPGPATH, "exec < \"%s\" 2>&1; \"%s\" %s%s; echo postmaster exit status is $?", + DEVNULL, exec_path, pgdata_opt, post_opts); (void) execl("/bin/sh", "/bin/sh", "-c", cmd, (char *) NULL); @@ -586,12 +588,8 @@ wait_for_postmaster(pgpid_t pm_pid, bool do_checkpoint) pmpid = atol(optlines[LOCK_FILE_LINE_PID - 1]); pmstart = atol(optlines[LOCK_FILE_LINE_START_TIME - 1]); if (pmstart >= start_time - 2 && -#ifndef WIN32 - pmpid == pm_pid -#else - /* Windows can only reject standalone-backend PIDs */ - pmpid > 0 -#endif + /* If pid is the value we saw before starting, assume it's stale */ + pmpid > 0 && pmpid != old_postmaster_pid ) { /* @@ -621,7 +619,7 @@ wait_for_postmaster(pgpid_t pm_pid, bool do_checkpoint) * Check whether the child postmaster process is still alive. This * lets us exit early if the postmaster fails during startup. * - * On Windows, we may be checking the postmaster's parent shell, but + * We may be checking the postmaster's parent shell, but * that's fine for this purpose. */ #ifndef WIN32 @@ -823,13 +821,12 @@ do_init(void) static void do_start(void) { - pgpid_t old_pid = 0; pgpid_t pm_pid; if (ctl_command != RESTART_COMMAND) { - old_pid = get_pgpid(false); - if (old_pid != 0) + old_postmaster_pid = get_pgpid(false); + if (old_postmaster_pid != 0) write_stderr(_("%s: another server might be running; " "trying to start server anyway\n"), progname);
On Tue, Jul 23, 2019 at 7:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Parallel workers aren't ever allowed to write, in the current > implementation, so it's not real obvious why they'd have any > WAL log files open at all. Parallel workers are not forbidden to write WAL, nor are they forbidden to modify blocks. They could legally HOT-prune, for example, though I'm not positive whether they actually do. The prohibition is at a higher level: they can't create new tuples or delete existing ones. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I wrote: >>> Yeah, I've been wondering whether pg_ctl could fork off a subprocess >>> that would fork the postmaster, wait for the postmaster to exit, and then >>> report the exit status. > [ pushed at 6a5084eed ] > Given wobbegong's recent failure rate, I don't think we'll have to wait > long. Indeed, we didn't: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wobbegong&dt=2019-10-10%2020%3A54%3A46 The tail end of the system log looks like 2019-10-10 21:00:33.717 UTC [15127:306] pg_regress/date FATAL: postmaster exited during a parallel transaction 2019-10-10 21:00:33.717 UTC [15127:307] pg_regress/date LOG: disconnection: session time: 0:00:02.896 user=fedora database=regressionhost=[local] /bin/sh: line 1: 14168 Segmentation fault (core dumped) "/home/fedora/build-farm-10-clang/buildroot/HEAD/pgsql.build/tmp_install/home/fedora/build-farm-clang/buildroot/HEAD/inst/bin/postgres" -F-c listen_addresses="" -k "/tmp/pg_upgrade_check-ZrhQ4h" postmaster exit status is 139 So that's definitive proof that the postmaster is suffering a SIGSEGV. Unfortunately, we weren't blessed with a stack trace, even though wobbegong is running a buildfarm client version that is new enough to try to collect one. However, seeing that wobbegong is running a pretty-recent Fedora release, the odds are that systemd-coredump has commandeered the core dump and squirreled it someplace where we can't find it. Much as one could wish otherwise, systemd doesn't seem likely to either go away or scale back its invasiveness; so I'm afraid we are probably going to need to teach the buildfarm client how to negotiate with systemd-coredump at some point. I don't much want to do that right this minute, though. A nearer-term solution would be to reproduce this manually and dig into the core. Mark, are you in a position to give somebody ssh access to wobbegong's host, or another similarly-configured VM? (While at it, it'd be nice to investigate the infinite_recurse failures we've been seeing on all those ppc64 critters ...) regards, tom lane
On Thu, Oct 10, 2019 at 05:34:51PM -0400, Tom Lane wrote: > A nearer-term solution would be to reproduce this manually and > dig into the core. Mark, are you in a position to give somebody > ssh access to wobbegong's host, or another similarly-configured VM? > > (While at it, it'd be nice to investigate the infinite_recurse > failures we've been seeing on all those ppc64 critters ...) Yeah, whoever would like access, just send me your ssh key and login you'd like to use, and I'll get you set up. Regards, Mark -- Mark Wong 2ndQuadrant - PostgreSQL Solutions for the Enterprise https://www.2ndQuadrant.com/
On 10/10/19 5:34 PM, Tom Lane wrote: > I wrote: >>>> Yeah, I've been wondering whether pg_ctl could fork off a subprocess >>>> that would fork the postmaster, wait for the postmaster to exit, and then >>>> report the exit status. >> [ pushed at 6a5084eed ] >> Given wobbegong's recent failure rate, I don't think we'll have to wait >> long. > Indeed, we didn't: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wobbegong&dt=2019-10-10%2020%3A54%3A46 > > The tail end of the system log looks like > > 2019-10-10 21:00:33.717 UTC [15127:306] pg_regress/date FATAL: postmaster exited during a parallel transaction > 2019-10-10 21:00:33.717 UTC [15127:307] pg_regress/date LOG: disconnection: session time: 0:00:02.896 user=fedora database=regressionhost=[local] > /bin/sh: line 1: 14168 Segmentation fault (core dumped) "/home/fedora/build-farm-10-clang/buildroot/HEAD/pgsql.build/tmp_install/home/fedora/build-farm-clang/buildroot/HEAD/inst/bin/postgres" -F-c listen_addresses="" -k "/tmp/pg_upgrade_check-ZrhQ4h" > postmaster exit status is 139 > > So that's definitive proof that the postmaster is suffering a SIGSEGV. > Unfortunately, we weren't blessed with a stack trace, even though > wobbegong is running a buildfarm client version that is new enough > to try to collect one. However, seeing that wobbegong is running > a pretty-recent Fedora release, the odds are that systemd-coredump > has commandeered the core dump and squirreled it someplace where > we can't find it. At least on F29 I have set /proc/sys/kernel/core_pattern and it works. > > Much as one could wish otherwise, systemd doesn't seem likely to > either go away or scale back its invasiveness; so I'm afraid we > are probably going to need to teach the buildfarm client how to > negotiate with systemd-coredump at some point. I don't much want > to do that right this minute, though. > > A nearer-term solution would be to reproduce this manually and > dig into the core. Mark, are you in a position to give somebody > ssh access to wobbegong's host, or another similarly-configured VM? I have given Mark my SSH key. That doesn't mean others interested shouldn't. > > (While at it, it'd be nice to investigate the infinite_recurse > failures we've been seeing on all those ppc64 critters ...) > > Yeah. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 10/10/19 6:01 PM, Andrew Dunstan wrote: > On 10/10/19 5:34 PM, Tom Lane wrote: >> I wrote: >>>>> Yeah, I've been wondering whether pg_ctl could fork off a subprocess >>>>> that would fork the postmaster, wait for the postmaster to exit, and then >>>>> report the exit status. >>> [ pushed at 6a5084eed ] >>> Given wobbegong's recent failure rate, I don't think we'll have to wait >>> long. >> Indeed, we didn't: >> >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wobbegong&dt=2019-10-10%2020%3A54%3A46 >> >> The tail end of the system log looks like >> >> 2019-10-10 21:00:33.717 UTC [15127:306] pg_regress/date FATAL: postmaster exited during a parallel transaction >> 2019-10-10 21:00:33.717 UTC [15127:307] pg_regress/date LOG: disconnection: session time: 0:00:02.896 user=fedora database=regressionhost=[local] >> /bin/sh: line 1: 14168 Segmentation fault (core dumped) "/home/fedora/build-farm-10-clang/buildroot/HEAD/pgsql.build/tmp_install/home/fedora/build-farm-clang/buildroot/HEAD/inst/bin/postgres" -F-c listen_addresses="" -k "/tmp/pg_upgrade_check-ZrhQ4h" >> postmaster exit status is 139 >> >> So that's definitive proof that the postmaster is suffering a SIGSEGV. >> Unfortunately, we weren't blessed with a stack trace, even though >> wobbegong is running a buildfarm client version that is new enough >> to try to collect one. However, seeing that wobbegong is running >> a pretty-recent Fedora release, the odds are that systemd-coredump >> has commandeered the core dump and squirreled it someplace where >> we can't find it. > > > At least on F29 I have set /proc/sys/kernel/core_pattern and it works. I have done the same on this machine. wobbegong runs every hour, so let's see what happens next. With any luck the buildfarm will give us a stack trace without needing further action. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: >> At least on F29 I have set /proc/sys/kernel/core_pattern and it works. FWIW, I'm not excited about that as a permanent solution. It requires root privilege, and it affects the whole machine not only the buildfarm, and making it persist across reboots is even more invasive. > I have done the same on this machine. wobbegong runs every hour, so > let's see what happens next. With any luck the buildfarm will give us a > stack trace without needing further action. I already collected one manually. It looks like this: Program terminated with signal SIGSEGV, Segmentation fault. #0 sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5114 5114 { Missing separate debuginfos, use: dnf debuginfo-install glibc-2.26-30.fc27.ppc64le (gdb) bt #0 sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5114 #1 <signal handler called> #2 0x00007fff93923ca4 in sigprocmask () from /lib64/libc.so.6 #3 0x00000000103fad08 in reaper (postgres_signal_arg=<optimized out>) at postmaster.c:3215 #4 <signal handler called> #5 0x00007fff93923ca4 in sigprocmask () from /lib64/libc.so.6 #6 0x00000000103f9f98 in sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5275 #7 <signal handler called> #8 0x00007fff93923ca4 in sigprocmask () from /lib64/libc.so.6 #9 0x00000000103fad08 in reaper (postgres_signal_arg=<optimized out>) at postmaster.c:3215 #10 <signal handler called> #11 sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5114 #12 <signal handler called> #13 0x00007fff93923ca4 in sigprocmask () from /lib64/libc.so.6 #14 0x00000000103f9f98 in sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5275 #15 <signal handler called> #16 0x00007fff93923ca4 in sigprocmask () from /lib64/libc.so.6 #17 0x00000000103fad08 in reaper (postgres_signal_arg=<optimized out>) at postmaster.c:3215 ... #572 <signal handler called> #573 0x00007fff93923ca4 in sigprocmask () from /lib64/libc.so.6 #574 0x00000000103f9f98 in sigusr1_handler ( postgres_signal_arg=<optimized out>) at postmaster.c:5275 #575 <signal handler called> #576 0x00007fff93923ca4 in sigprocmask () from /lib64/libc.so.6 #577 0x00000000103fad08 in reaper (postgres_signal_arg=<optimized out>) at postmaster.c:3215 #578 <signal handler called> #579 sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5114 #580 <signal handler called> #581 0x00007fff93a01514 in select () from /lib64/libc.so.6 #582 0x00000000103f7ad8 in ServerLoop () at postmaster.c:1682 #583 PostmasterMain (argc=<optimized out>, argv=<optimized out>) at postmaster.c:1391 #584 0x0000000000000000 in ?? () What we've apparently got here is that signals were received so fast that the postmaster ran out of stack space. I remember Andres complaining about this as a theoretical threat, but I hadn't seen it in the wild before. I haven't finished investigating though, as there are some things that remain to be explained. The dependency on having force_parallel_mode = regress makes sense now, because the extra traffic to launch and reap all those parallel workers would increase the stress on the postmaster (and it seems likely that this stack trace corresponds exactly to alternating launch and reap signals). But why does it only happen during the pg_upgrade test --- plain "make check" ought to be about the same? I also want to investigate why clang builds seem more prone to this than gcc builds on the same machine; that might just be down to more or less stack consumption, but it bears looking into. regards, tom lane
On 10/11/19 11:45 AM, Tom Lane wrote: > Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: >>> At least on F29 I have set /proc/sys/kernel/core_pattern and it works. > FWIW, I'm not excited about that as a permanent solution. It requires > root privilege, and it affects the whole machine not only the buildfarm, > and making it persist across reboots is even more invasive. OK, but I'm not keen to have to tussle with coredumpctl. Right now our logic says: for every core file in the data directory try to get a backtrace. Use of systemd-coredump means that gets blown out of the water, and we no longer even have a simple test to see if our program caused a core dump. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I wrote: > What we've apparently got here is that signals were received > so fast that the postmaster ran out of stack space. I remember > Andres complaining about this as a theoretical threat, but I > hadn't seen it in the wild before. > I haven't finished investigating though, as there are some things > that remain to be explained. I still don't have a good explanation for why this only seems to happen in the pg_upgrade test sequence. However, I did notice something very interesting: the postmaster crashes after consuming only about 1MB of stack space. This is despite the prevailing setting of "ulimit -s" being 8192 (8MB). I also confirmed that the value of max_stack_depth within the crashed process is 2048, which implies that get_stack_depth_rlimit got some value larger than 2MB from getrlimit(RLIMIT_STACK). And yet, here we have a crash, and the process memory map confirms that only 1MB was allocated in the stack region. So it's really hard to explain that as anything except a kernel bug: sometimes, the kernel doesn't give us as much stack as it promised it would. And the machine is not loaded enough for there to be any rational resource-exhaustion excuse for that. This matches up with the intermittent infinite_recurse failures we've been seeing in the buildfarm. Those are happening across a range of systems, but they're (almost) all Linux-based ppc64, suggesting that there's a longstanding arch-specific kernel bug involved. For reference, I scraped the attached list of such failures in the last three months. I wonder whether we can get the attention of any kernel hackers about that. Anyway, as to what to do about it --- it occurred to me to wonder why we are relying on having the signal handlers block and unblock signals manually, when we could tell sigaction() that we'd like signals blocked. It is reasonable to expect that the signal support is designed to not recursively consume stack space in the face of a series of signals, while the way we are doing it clearly opens us up to recursive space consumption. The stack trace I showed before proves that the recursion happens at the points where the signal handlers unblock signals. As a quick hack I made the attached patch, and it seems to fix the problem on wobbegong's host. I don't see crashes any more, and watching the postmaster's stack space consumption, it stays comfortably at a tad under 200KB (probably the default initial allocation), while without the patch it tends to blow up to 700K or more even in runs that don't crash. This patch isn't committable as-is because it will (I suppose) break things on Windows; we still need the old way there for lack of sigaction(). But that could be fixed with a few #ifdefs. I'm also kind of tempted to move pqsignal_no_restart into backend/libpq/pqsignal.c (where BlockSig is defined) and maybe rename it, but I'm not sure to what. This issue might go away if we switched to a postmaster implementation that doesn't do work in the signal handlers, but I'm not entirely convinced of that. The existing handlers don't seem to consume a lot of stack space in themselves (there's not many local variables in them). The bulk of the stack consumption is seemingly in the platform's signal infrastructure, so that we might still have a stack consumption issue even with fairly trivial handlers, if we don't tell sigaction to block signals. In any case, this fix seems potentially back-patchable, while we surely wouldn't risk back-patching a postmaster rewrite. Comments? regards, tom lane sysname | architecture | operating_system | sys_owner | branch | snapshot | stage | err --------------+------------------+--------------------+-----------------+---------------+---------------------+-----------------+--------------------------------------------------------------------------------------------------------------- cavefish | ppc64le (POWER9) | Ubuntu | Mark Wong | HEAD | 2019-07-13 03:49:38 | pg_upgradeCheck| 2019-07-13 04:01:23.437 UTC [9365:71] DETAIL: Failed process was running: select infinite_recurse(); pintail | ppc64le (POWER9) | Debian GNU/Linux | Mark Wong | REL_12_STABLE | 2019-07-13 19:36:51 | Check | 2019-07-13 19:39:29.013 UTC [31086:5] DETAIL: Failed process was running: select infinite_recurse(); bonito | ppc64le (POWER9) | Fedora | Mark Wong | HEAD | 2019-07-19 23:13:01 | Check | 2019-07-19 23:16:33.330 UTC [24191:70] DETAIL: Failed process was running: select infinite_recurse(); takin | ppc64le | opensuse | Mark Wong | HEAD | 2019-07-24 08:24:56 | Check | 2019-07-24 08:28:01.735 UTC [16366:75] DETAIL: Failed process was running: select infinite_recurse(); quokka | ppc64 | RHEL | Sandeep Thakkar | HEAD | 2019-07-31 02:00:07 | pg_upgradeCheck| 2019-07-31 03:04:04.043 BST [5d40f709.776a:5] DETAIL: Failed process was running: select infinite_recurse(); elasmobranch | ppc64le (POWER9) | openSUSE Leap | Mark Wong | HEAD | 2019-08-01 03:13:38 | Check | 2019-08-01 03:19:05.394 UTC [22888:62] DETAIL: Failed process was running: select infinite_recurse(); buri | ppc64le (POWER9) | CentOS Linux | Mark Wong | HEAD | 2019-08-02 00:10:23 | Check | 2019-08-02 00:17:11.075 UTC [28222:73] DETAIL: Failed process was running: select infinite_recurse(); urocryon | ppc64le | debian | Mark Wong | HEAD | 2019-08-02 05:43:46 | Check | 2019-08-02 05:51:51.944 UTC [2724:64] DETAIL: Failed process was running: select infinite_recurse(); batfish | ppc64le | Ubuntu | Mark Wong | HEAD | 2019-08-04 19:02:36 | pg_upgradeCheck| 2019-08-04 19:08:11.728 UTC [23899:79] DETAIL: Failed process was running: select infinite_recurse(); buri | ppc64le (POWER9) | CentOS Linux | Mark Wong | REL_12_STABLE | 2019-08-07 00:03:29 | pg_upgradeCheck| 2019-08-07 00:11:24.500 UTC [1405:5] DETAIL: Failed process was running: select infinite_recurse(); quokka | ppc64 | RHEL | Sandeep Thakkar | REL_12_STABLE | 2019-08-08 02:43:45 | pg_upgradeCheck| 2019-08-08 03:47:38.115 BST [5d4b8d3f.cdd7:5] DETAIL: Failed process was running: select infinite_recurse(); quokka | ppc64 | RHEL | Sandeep Thakkar | HEAD | 2019-08-08 14:00:08 | Check | 2019-08-08 15:02:59.770 BST [5d4c2b88.cad9:5] DETAIL: Failed process was running: select infinite_recurse(); mereswine | ARMv7 | Linux debian-armhf | Clarence Ho | REL_11_STABLE | 2019-08-11 02:10:12 | InstallCheck-C | 2019-08-11 02:36:10.159 PDT [5004:4] DETAIL: Failed process was running: select infinite_recurse(); takin | ppc64le | opensuse | Mark Wong | HEAD | 2019-08-11 08:02:48 | Check | 2019-08-11 08:05:57.789 UTC [11500:67] DETAIL: Failed process was running: select infinite_recurse(); mereswine | ARMv7 | Linux debian-armhf | Clarence Ho | REL_12_STABLE | 2019-08-11 09:52:46 | pg_upgradeCheck| 2019-08-11 04:21:16.756 PDT [6804:5] DETAIL: Failed process was running: select infinite_recurse(); mereswine | ARMv7 | Linux debian-armhf | Clarence Ho | HEAD | 2019-08-11 11:29:27 | pg_upgradeCheck| 2019-08-11 07:15:28.454 PDT [9954:76] DETAIL: Failed process was running: select infinite_recurse(); demoiselle | ppc64le (POWER9) | openSUSE Leap | Mark Wong | HEAD | 2019-08-11 14:51:38 | pg_upgradeCheck| 2019-08-11 14:57:29.422 UTC [9436:70] DETAIL: Failed process was running: select infinite_recurse(); buri | ppc64le (POWER9) | CentOS Linux | Mark Wong | HEAD | 2019-08-15 00:09:57 | Check | 2019-08-15 00:17:43.282 UTC [2831:68] DETAIL: Failed process was running: select infinite_recurse(); urocryon | ppc64le | debian | Mark Wong | HEAD | 2019-08-19 06:28:34 | Check | 2019-08-19 06:39:25.749 UTC [26357:66] DETAIL: Failed process was running: select infinite_recurse(); urocryon | ppc64le | debian | Mark Wong | HEAD | 2019-08-21 06:34:47 | Check | 2019-08-21 06:37:39.089 UTC [14505:73] DETAIL: Failed process was running: select infinite_recurse(); demoiselle | ppc64le (POWER9) | openSUSE Leap | Mark Wong | REL_12_STABLE | 2019-09-04 14:42:08 | pg_upgradeCheck| 2019-09-04 14:56:15.219 UTC [11008:5] DETAIL: Failed process was running: select infinite_recurse(); pintail | ppc64le (POWER9) | Debian GNU/Linux | Mark Wong | REL_12_STABLE | 2019-09-07 19:22:48 | pg_upgradeCheck| 2019-09-07 19:27:20.789 UTC [25645:5] DETAIL: Failed process was running: select infinite_recurse(); quokka | ppc64 | RHEL | Sandeep Thakkar | REL_12_STABLE | 2019-09-10 02:00:15 | Check | 2019-09-10 03:03:17.711 BST [5d77045a.5776:5] DETAIL: Failed process was running: select infinite_recurse(); buri | ppc64le (POWER9) | CentOS Linux | Mark Wong | HEAD | 2019-09-17 23:12:33 | Check | 2019-09-17 23:19:45.769 UTC [20920:77] DETAIL: Failed process was running: select infinite_recurse(); shoveler | ppc64le (POWER8) | Debian GNU/Linux | Mark Wong | HEAD | 2019-09-18 13:52:55 | Check | 2019-09-18 13:56:11.273 UTC [563:71] DETAIL: Failed process was running: select infinite_recurse(); buri | ppc64le (POWER9) | CentOS Linux | Mark Wong | HEAD | 2019-09-19 00:01:54 | Check | 2019-09-19 00:09:30.734 UTC [11775:67] DETAIL: Failed process was running: select infinite_recurse(); gadwall | ppc64le (POWER9) | Debian GNU/Linux | Mark Wong | HEAD | 2019-09-21 12:26:50 | Check | 2019-09-21 12:31:16.199 UTC [7119:70] DETAIL: Failed process was running: select infinite_recurse(); quokka | ppc64 | RHEL | Sandeep Thakkar | HEAD | 2019-09-24 14:00:11 | pg_upgradeCheck| 2019-09-24 15:04:49.272 BST [5d8a2276.cba9:5] DETAIL: Failed process was running: select infinite_recurse(); urocryon | ppc64le | debian | Mark Wong | HEAD | 2019-09-25 06:24:24 | Check | 2019-09-25 06:31:54.876 UTC [26608:76] DETAIL: Failed process was running: select infinite_recurse(); pintail | ppc64le (POWER9) | Debian GNU/Linux | Mark Wong | HEAD | 2019-09-26 19:33:59 | Check | 2019-09-26 19:39:25.850 UTC [6259:69] DETAIL: Failed process was running: select infinite_recurse(); shoveler | ppc64le (POWER8) | Debian GNU/Linux | Mark Wong | HEAD | 2019-09-28 13:54:16 | Check | 2019-09-28 13:59:02.354 UTC [7586:71] DETAIL: Failed process was running: select infinite_recurse(); buri | ppc64le (POWER9) | CentOS Linux | Mark Wong | REL_12_STABLE | 2019-09-28 23:14:23 | pg_upgradeCheck| 2019-09-28 23:22:13.987 UTC [20133:5] DETAIL: Failed process was running: select infinite_recurse(); gadwall | ppc64le (POWER9) | Debian GNU/Linux | Mark Wong | HEAD | 2019-10-02 12:44:49 | Check | 2019-10-02 12:50:17.823 UTC [10840:76] DETAIL: Failed process was running: select infinite_recurse(); cavefish | ppc64le (POWER9) | Ubuntu | Mark Wong | HEAD | 2019-10-04 04:37:58 | Check | 2019-10-04 04:46:03.804 UTC [27768:69] DETAIL: Failed process was running: select infinite_recurse(); cavefish | ppc64le (POWER9) | Ubuntu | Mark Wong | HEAD | 2019-10-07 03:22:37 | pg_upgradeCheck| 2019-10-07 03:28:05.031 UTC [2991:68] DETAIL: Failed process was running: select infinite_recurse(); bufflehead | ppc64le (POWER8) | openSUSE Leap | Mark Wong | HEAD | 2019-10-09 20:46:56 | pg_upgradeCheck| 2019-10-09 20:51:47.408 UTC [18136:86] DETAIL: Failed process was running: select infinite_recurse(); vulpes | ppc64le | fedora | Mark Wong | HEAD | 2019-10-11 08:53:50 | Check | 2019-10-11 08:57:59.370 UTC [14908:77] DETAIL: Failed process was running: select infinite_recurse(); shoveler | ppc64le (POWER8) | Debian GNU/Linux | Mark Wong | HEAD | 2019-10-11 13:54:38 | pg_upgradeCheck| 2019-10-11 14:01:53.903 UTC [5911:76] DETAIL: Failed process was running: select infinite_recurse(); (38 rows) diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 85f15a5..fff83b7 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -2640,8 +2640,6 @@ SIGHUP_handler(SIGNAL_ARGS) { int save_errno = errno; - PG_SETMASK(&BlockSig); - if (Shutdown <= SmartShutdown) { ereport(LOG, @@ -2700,8 +2698,6 @@ SIGHUP_handler(SIGNAL_ARGS) #endif } - PG_SETMASK(&UnBlockSig); - errno = save_errno; } @@ -2714,8 +2710,6 @@ pmdie(SIGNAL_ARGS) { int save_errno = errno; - PG_SETMASK(&BlockSig); - ereport(DEBUG2, (errmsg_internal("postmaster received signal %d", postgres_signal_arg))); @@ -2880,8 +2874,6 @@ pmdie(SIGNAL_ARGS) break; } - PG_SETMASK(&UnBlockSig); - errno = save_errno; } @@ -2895,8 +2887,6 @@ reaper(SIGNAL_ARGS) int pid; /* process id of dead child process */ int exitstatus; /* its exit status */ - PG_SETMASK(&BlockSig); - ereport(DEBUG4, (errmsg_internal("reaping dead processes"))); @@ -3212,8 +3202,6 @@ reaper(SIGNAL_ARGS) PostmasterStateMachine(); /* Done with signal handler */ - PG_SETMASK(&UnBlockSig); - errno = save_errno; } @@ -5114,8 +5102,6 @@ sigusr1_handler(SIGNAL_ARGS) { int save_errno = errno; - PG_SETMASK(&BlockSig); - /* Process background worker state change. */ if (CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE)) { @@ -5272,8 +5258,6 @@ sigusr1_handler(SIGNAL_ARGS) signal_child(StartupPID, SIGUSR2); } - PG_SETMASK(&UnBlockSig); - errno = save_errno; } diff --git a/src/port/pqsignal.c b/src/port/pqsignal.c index ecb9ca2..93a039b 100644 --- a/src/port/pqsignal.c +++ b/src/port/pqsignal.c @@ -65,7 +65,11 @@ pqsignal(int signo, pqsigfunc func) * * On Windows, this would be identical to pqsignal(), so don't bother. */ -#ifndef WIN32 +#ifndef FRONTEND + +extern sigset_t UnBlockSig, + BlockSig, + StartupBlockSig; pqsigfunc pqsignal_no_restart(int signo, pqsigfunc func) @@ -74,7 +78,7 @@ pqsignal_no_restart(int signo, pqsigfunc func) oact; act.sa_handler = func; - sigemptyset(&act.sa_mask); + act.sa_mask = BlockSig; act.sa_flags = 0; #ifdef SA_NOCLDSTOP if (signo == SIGCHLD)
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: > On 10/11/19 11:45 AM, Tom Lane wrote: >> FWIW, I'm not excited about that as a permanent solution. It requires >> root privilege, and it affects the whole machine not only the buildfarm, >> and making it persist across reboots is even more invasive. > OK, but I'm not keen to have to tussle with coredumpctl. Right now our > logic says: for every core file in the data directory try to get a > backtrace. Use of systemd-coredump means that gets blown out of the > water, and we no longer even have a simple test to see if our program > caused a core dump. I haven't played that much with this software, but it seems you can do "coredumpctl list <path-to-executable>" to find out what it has for a particular executable. You would likely need a time-based filter too (to avoid regurgitating previous runs' failures), but that seems do-able. regards, tom lane
On Sat, Oct 12, 2019 at 7:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > This matches up with the intermittent infinite_recurse failures > we've been seeing in the buildfarm. Those are happening across > a range of systems, but they're (almost) all Linux-based ppc64, > suggesting that there's a longstanding arch-specific kernel bug > involved. For reference, I scraped the attached list of such > failures in the last three months. I wonder whether we can get > the attention of any kernel hackers about that. Yeah, I don't know anything about this stuff, but I was also beginning to wonder if something is busted in the arch-specific fault.c code that checks if stack expansion is valid[1], in a way that fails with a rapidly growing stack, well timed incoming signals, and perhaps Docker/LXC (that's on Mark's systems IIUC, not sure about the ARM boxes that failed or if it could be relevant here). Perhaps the arbitrary tolerances mentioned in that comment are relevant. [1] https://github.com/torvalds/linux/blob/master/arch/powerpc/mm/fault.c#L244
Thomas Munro <thomas.munro@gmail.com> writes: > Yeah, I don't know anything about this stuff, but I was also beginning > to wonder if something is busted in the arch-specific fault.c code > that checks if stack expansion is valid[1], in a way that fails with a > rapidly growing stack, well timed incoming signals, and perhaps > Docker/LXC (that's on Mark's systems IIUC, not sure about the ARM > boxes that failed or if it could be relevant here). Perhaps the > arbitrary tolerances mentioned in that comment are relevant. > [1] https://github.com/torvalds/linux/blob/master/arch/powerpc/mm/fault.c#L244 Hm, the bit about "we'll allow up to 1MB unconditionally" sure seems to match up with the observations here. I also wonder about the arbitrary definition of "a long way" as 2KB. Could it be that that misbehaves in the face of a userland function with more than 2KB of local variables? It's not very clear how those things would lead to an intermittent failure though. In the case of the postmaster crashes, we now see that timing of signal receipts is relevant. For infinite_recurse, maybe it only fails if an sinval interrupt happens at the wrong time? (This theory would predict that commit 798070ec0 made the problem way more prevalent than it had been ... need to go see if the buildfarm history supports that.) regards, tom lane
On Sat, Oct 12, 2019 at 08:41:12AM +1300, Thomas Munro wrote: > On Sat, Oct 12, 2019 at 7:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > This matches up with the intermittent infinite_recurse failures > > we've been seeing in the buildfarm. Those are happening across > > a range of systems, but they're (almost) all Linux-based ppc64, > > suggesting that there's a longstanding arch-specific kernel bug > > involved. For reference, I scraped the attached list of such > > failures in the last three months. I wonder whether we can get > > the attention of any kernel hackers about that. > > Yeah, I don't know anything about this stuff, but I was also beginning > to wonder if something is busted in the arch-specific fault.c code > that checks if stack expansion is valid[1], in a way that fails with a > rapidly growing stack, well timed incoming signals, and perhaps > Docker/LXC (that's on Mark's systems IIUC, not sure about the ARM > boxes that failed or if it could be relevant here). Perhaps the > arbitrary tolerances mentioned in that comment are relevant. This specific one (wobbegon) is OpenStack/KVM[2], for what it's worth... "... cluster is an OpenStack based cluster offering POWER8 & POWER9 LE instances running on KVM ..." But to keep you on your toes, some of my ppc animals are Docker within other OpenStack/KVM instance... Regards, Mark [1] https://github.com/torvalds/linux/blob/master/arch/powerpc/mm/fault.c#L244 [2] https://osuosl.org/services/powerdev/ -- Mark Wong 2ndQuadrant - PostgreSQL Solutions for the Enterprise https://www.2ndQuadrant.com/
Hi, On 2019-10-11 14:56:41 -0400, Tom Lane wrote: > I still don't have a good explanation for why this only seems to > happen in the pg_upgrade test sequence. However, I did notice > something very interesting: the postmaster crashes after consuming > only about 1MB of stack space. This is despite the prevailing > setting of "ulimit -s" being 8192 (8MB). I also confirmed that > the value of max_stack_depth within the crashed process is 2048, > which implies that get_stack_depth_rlimit got some value larger > than 2MB from getrlimit(RLIMIT_STACK). And yet, here we have > a crash, and the process memory map confirms that only 1MB was > allocated in the stack region. So it's really hard to explain > that as anything except a kernel bug: sometimes, the kernel > doesn't give us as much stack as it promised it would. And the > machine is not loaded enough for there to be any rational > resource-exhaustion excuse for that. Linux expands stack space only on demand, thus it's possible to run out of stack space while there ought to be stack space. Unfortunately that during a stack expansion, which means there's no easy place to report that. I've seen this be hit in production on busy machines. I wonder if the machine is configured with overcommit_memory=2, i.e. don't overcommit. cat /proc/sys/vm/overcommit_memory would tell. What does grep -E '^(Mem|Commit)' /proc/meminfo show while it's happening? What does the signal information say? You can see it with p $_siginfo after receiving the signal. A SIGSEGV here, I assume. IIRC si_code and si_errno should indicate whether ENOMEM is the reason. > This matches up with the intermittent infinite_recurse failures > we've been seeing in the buildfarm. Those are happening across > a range of systems, but they're (almost) all Linux-based ppc64, > suggesting that there's a longstanding arch-specific kernel bug > involved. For reference, I scraped the attached list of such > failures in the last three months. I wonder whether we can get > the attention of any kernel hackers about that. Most of them are operated by Mark, right? So it could also just be high memory pressure on those. [1;5B > Anyway, as to what to do about it --- it occurred to me to wonder > why we are relying on having the signal handlers block and unblock > signals manually, when we could tell sigaction() that we'd like > signals blocked. It is reasonable to expect that the signal support > is designed to not recursively consume stack space in the face of > a series of signals, while the way we are doing it clearly opens > us up to recursive space consumption. The stack trace I showed > before proves that the recursion happens at the points where the > signal handlers unblock signals. Yea, that seems like it might be good. But we have to be careful too, as there's some thing were do want to be interruptable from within a signal handler. We start some processes from within one after all... Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2019-10-11 14:56:41 -0400, Tom Lane wrote: >> ... So it's really hard to explain >> that as anything except a kernel bug: sometimes, the kernel >> doesn't give us as much stack as it promised it would. And the >> machine is not loaded enough for there to be any rational >> resource-exhaustion excuse for that. > Linux expands stack space only on demand, thus it's possible to run out > of stack space while there ought to be stack space. Unfortunately that > during a stack expansion, which means there's no easy place to report > that. I've seen this be hit in production on busy machines. As I said, this machine doesn't seem busy enough for that to be a tenable excuse; there's nobody but me logged in, and the buildfarm critter isn't running. > I wonder if the machine is configured with overcommit_memory=2, > i.e. don't overcommit. cat /proc/sys/vm/overcommit_memory would tell. $ cat /proc/sys/vm/overcommit_memory 0 > What does grep -E '^(Mem|Commit)' /proc/meminfo show while it's > happening? idle: $ grep -E '^(Mem|Commit)' /proc/meminfo MemTotal: 2074816 kB MemFree: 36864 kB MemAvailable: 1779584 kB CommitLimit: 1037376 kB Committed_AS: 412480 kB a few captures while regression tests are running: $ grep -E '^(Mem|Commit)' /proc/meminfo MemTotal: 2074816 kB MemFree: 8512 kB MemAvailable: 1819264 kB CommitLimit: 1037376 kB Committed_AS: 371904 kB $ grep -E '^(Mem|Commit)' /proc/meminfo MemTotal: 2074816 kB MemFree: 32640 kB MemAvailable: 1753792 kB CommitLimit: 1037376 kB Committed_AS: 585984 kB $ grep -E '^(Mem|Commit)' /proc/meminfo MemTotal: 2074816 kB MemFree: 56640 kB MemAvailable: 1695744 kB CommitLimit: 1037376 kB Committed_AS: 568768 kB > What does the signal information say? You can see it with > p $_siginfo > after receiving the signal. A SIGSEGV here, I assume. (gdb) p $_siginfo $1 = {si_signo = 11, si_errno = 0, si_code = 128, _sifields = {_pad = {0 <repeats 28 times>}, _kill = {si_pid = 0, si_uid= 0}, _timer = {si_tid = 0, si_overrun = 0, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _rt = {si_pid = 0, si_uid = 0, si_sigval= { sival_int = 0, sival_ptr = 0x0}}, _sigchld = {si_pid = 0, si_uid = 0, si_status = 0, si_utime = 0, si_stime = 0},_sigfault = { si_addr = 0x0}, _sigpoll = {si_band = 0, si_fd = 0}}} > Yea, that seems like it might be good. But we have to be careful too, as > there's some thing were do want to be interruptable from within a signal > handler. We start some processes from within one after all... The proposed patch has zero effect on what the signal mask will be inside a signal handler, only on the transient state during handler entry/exit. regards, tom lane
On Sat, Oct 12, 2019 at 9:40 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2019-10-11 14:56:41 -0400, Tom Lane wrote: > >> ... So it's really hard to explain > >> that as anything except a kernel bug: sometimes, the kernel > >> doesn't give us as much stack as it promised it would. And the > >> machine is not loaded enough for there to be any rational > >> resource-exhaustion excuse for that. > > > Linux expands stack space only on demand, thus it's possible to run out > > of stack space while there ought to be stack space. Unfortunately that > > during a stack expansion, which means there's no easy place to report > > that. I've seen this be hit in production on busy machines. > > As I said, this machine doesn't seem busy enough for that to be a > tenable excuse; there's nobody but me logged in, and the buildfarm > critter isn't running. Yeah. As I speculated in the other thread[1], the straightforward can't-allocate-any-more-space-but-no-other-way-to-tell-you-that case, ie, the explanation that doesn't involve a bug in Linux or PostgreSQL, seems unlikely unless we also see other more obvious signs of occasional overcommit problems (ie not during stack expansion) on those hosts, doesn't it? How likely is it that this 1-2MB of stack space is the straw that breaks the camels back, every time? [1] https://www.postgresql.org/message-id/CA%2BhUKGJ_MkqdEH-LmmebhNLSFeyWwvYVXfPaz3A2_p27EQfZwA%40mail.gmail.com
I wrote: > It's not very clear how those things would lead to an intermittent > failure though. In the case of the postmaster crashes, we now see > that timing of signal receipts is relevant. For infinite_recurse, > maybe it only fails if an sinval interrupt happens at the wrong time? > (This theory would predict that commit 798070ec0 made the problem > way more prevalent than it had been ... need to go see if the > buildfarm history supports that.) That seems to fit, roughly: commit 798070ec0 moved errors.sql to execute as part of a parallel group on 2019-04-11, and the first failure of the infinite_recurse test happened on 2019-04-27. Since then we've averaged about one such failure every four days, which makes a sixteen-day gap a little more than you'd expect, but not a huge amount more. Anyway, I do not see any other commits in between that would plausibly have affected this. In other news, I reproduced the problem with gcc on wobbegong's host, and confirmed that the gcc build uses less stack space: one recursive cycle of reaper() and sigusr1_handler() consumes 14768 bytes with clang, but just 9296 bytes with gcc. So the evident difference in failure rate between wobbegong and vulpes is nicely explained by that. Still no theory about pg_upgrade versus vanilla "make check" though. I did manage to make it happen during "make check" by dint of reducing the "ulimit -s" setting, so it's *possible* for it to happen there, it just doesn't. Weird. regards, tom lane
I've now also been able to reproduce the "infinite_recurse" segfault on wobbegong's host (or, since I was using a gcc build, I guess I should say vulpes' host). The first-order result is that it's the same problem with the kernel not giving us as much stack space as we expect: there's only 1179648 bytes in the stack segment in the core dump, though we should certainly have been allowed at least 8MB. The next interesting thing is that looking closely at the identified spot of the SIGSEGV, there's nothing there that should be touching the stack at all: (gdb) x/4i $pc => 0x10201df0 <core_yylex+1072>: ld r9,0(r30) 0x10201df4 <core_yylex+1076>: ld r8,128(r30) 0x10201df8 <core_yylex+1080>: ld r10,152(r30) 0x10201dfc <core_yylex+1084>: ld r9,0(r9) (r30 is not pointing at the stack, but at a valid heap location.) This code is the start of the switch case at scan.l:1064, so the most recent successfully-executed instructions were the switch jump, and they don't involve the stack either. The reported sp, (gdb) i reg sp sp 0x7fffe6940890 140737061849232 is a good 2192 bytes above the bottom of the allocated stack space, which is 0x7fffe6940000 according to gdb. So we really ought to have plenty of margin here. What's going on? What I suspect, given the difficulty of reproducing this, is that what really happened is that the kernel tried to deliver a SIGUSR1 signal to us just at this point. The kernel source code that Thomas pointed to comments that * The kernel signal delivery code writes up to about 1.5kB * below the stack pointer (r1) before decrementing it. There's more than 1.5kB available below sp, but what if that comment is a lie? In particular, I'm wondering if that number dates to PPC32 and needs to be doubled, or nearly so, to describe PPC64 reality. If that were the case, then the signal code would not have been able to fit its requirement, and would probably have come here to ask for more stack space, and the hard-wired 2048 test a little further down would have decided that that was a wild stack access. In short, my current belief is that Linux PPC64 fails when trying to deliver a signal if there's right around 2KB of stack remaining, even though it should be able to expand the stack and press on. It may well be that the reason is just that this heuristic in bad_stack_expansion() is out of date. Or there might be a similarly bogus value somewhere in the signal-delivery code. regards, tom lane
I wrote: > In short, my current belief is that Linux PPC64 fails when trying > to deliver a signal if there's right around 2KB of stack remaining, > even though it should be able to expand the stack and press on. I figured I should try to remove some variables from the equation by demonstrating this claim without involving Postgres. The attached test program eats some stack space and then waits to be sent SIGUSR1. For some values of "some stack space", it dumps core: [tgl@postgresql-fedora ~]$ gcc -g -Wall -O1 stacktest.c [tgl@postgresql-fedora ~]$ ./a.out 1240000 & [1] 11796 [tgl@postgresql-fedora ~]$ kill -USR1 11796 [tgl@postgresql-fedora ~]$ signal delivered, stack base 0x7fffdc160000 top 0x7fffdc031420 (1240032 used) [1]+ Done ./a.out 1240000 [tgl@postgresql-fedora ~]$ ./a.out 1242000 & [1] 11797 [tgl@postgresql-fedora ~]$ kill -USR1 11797 [tgl@postgresql-fedora ~]$ [1]+ Segmentation fault (core dumped) ./a.out 1242000 [tgl@postgresql-fedora ~]$ uname -a Linux postgresql-fedora.novalocal 4.18.19-100.fc27.ppc64le #1 SMP Wed Nov 14 21:53:32 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux I don't think any further proof is required that this is a kernel bug. Where would be a good place to file it? regards, tom lane #include <stdio.h> #include <stdlib.h> #include <signal.h> #include <unistd.h> static char *stack_base_ptr; static char *stack_top_ptr; static volatile sig_atomic_t sig_occurred = 0; static void sigusr1_handler(int signal_arg) { sig_occurred = 1; } static void consume_stack(long stackdepth) { char stack_cur; if ((stack_base_ptr - &stack_cur) < stackdepth) consume_stack(stackdepth); else { stack_top_ptr = &stack_cur; while (!sig_occurred) ; } } int main(int argc, char **argv) { long stackdepth = atol(argv[1]); char stack_base; struct sigaction act; act.sa_handler = sigusr1_handler; sigemptyset(&act.sa_mask); act.sa_flags = 0; if (sigaction(SIGUSR1, &act, NULL) < 0) perror("sigaction"); stack_base_ptr = (char *) (((size_t) &stack_base + 65535) & ~65535UL); consume_stack(stackdepth); if (sig_occurred) printf("signal delivered, stack base %p top %p (%zu used)\n", stack_base_ptr, stack_top_ptr, stack_base_ptr - stack_top_ptr); else printf("no signal delivered\n"); return 0; }
On Sun, Oct 13, 2019 at 1:06 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I don't think any further proof is required that this is > a kernel bug. Where would be a good place to file it? linuxppc-dev@lists.ozlabs.org might be the right place. https://lists.ozlabs.org/listinfo/linuxppc-dev
Hi, On 2019-10-13 13:44:59 +1300, Thomas Munro wrote: > On Sun, Oct 13, 2019 at 1:06 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I don't think any further proof is required that this is > > a kernel bug. Where would be a good place to file it? > > linuxppc-dev@lists.ozlabs.org might be the right place. > > https://lists.ozlabs.org/listinfo/linuxppc-dev Probably requires reproducing on a pretty recent kernel first, to have a decent chance of being investigated... Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > Probably requires reproducing on a pretty recent kernel first, to have a > decent chance of being investigated... How recent do you think it needs to be? The machine I was testing on yesterday is under a year old: uname -m = ppc64le uname -r = 4.18.19-100.fc27.ppc64le uname -s = Linux uname -v = #1 SMP Wed Nov 14 21:53:32 UTC 2018 The latest-by-version-number ppc64 kernel I can find in the buildfarm is bonito, uname -m = ppc64le uname -r = 4.19.15-300.fc29.ppc64le uname -s = Linux uname -v = #1 SMP Mon Jan 14 16:21:04 UTC 2019 and that's certainly shown it too. regards, tom lane
Hi, On 2019-10-13 10:29:45 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Probably requires reproducing on a pretty recent kernel first, to have a > > decent chance of being investigated... > > How recent do you think it needs to be? The machine I was testing on > yesterday is under a year old: > > uname -r = 4.18.19-100.fc27.ppc64le > ... > uname -r = 4.19.15-300.fc29.ppc64le My experience reporting kernel bugs is that the latest released version, or even just the tip of the git tree, is your best bet :/. And that reports using distro kernels - with all their out of tree changes - are also less likely to receive a response. IIRC there's a fedora repo with upstream kernels. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2019-10-13 10:29:45 -0400, Tom Lane wrote: >> How recent do you think it needs to be? > My experience reporting kernel bugs is that the latest released version, > or even just the tip of the git tree, is your best bet :/. Considering that we're going to point them at chapter and verse in Torvald's own repo, I do not think they can claim that we're worried about obsolete code. regards, tom lane
Filed at https://bugzilla.kernel.org/show_bug.cgi?id=205183 We'll see what happens ... regards, tom lane
I wrote: > Filed at > https://bugzilla.kernel.org/show_bug.cgi?id=205183 > We'll see what happens ... Further to this --- I went back and looked at the outlier events where we saw an infinite_recurse failure on a non-Linux-PPC64 platform. There were only three: mereswine | ARMv7 | Linux debian-armhf | Clarence Ho | REL_11_STABLE | 2019-08-11 02:10:12 | InstallCheck-C | 2019-08-11 02:36:10.159 PDT [5004:4] DETAIL: Failed process was running: select infinite_recurse(); mereswine | ARMv7 | Linux debian-armhf | Clarence Ho | REL_12_STABLE | 2019-08-11 09:52:46 | pg_upgradeCheck| 2019-08-11 04:21:16.756 PDT [6804:5] DETAIL: Failed process was running: select infinite_recurse(); mereswine | ARMv7 | Linux debian-armhf | Clarence Ho | HEAD | 2019-08-11 11:29:27 | pg_upgradeCheck| 2019-08-11 07:15:28.454 PDT [9954:76] DETAIL: Failed process was running: select infinite_recurse(); Looking closer at these, though, they were *not* SIGSEGV failures, but SIGKILLs. Seeing that they were all on the same machine on the same day, I'm thinking we can write them off as a transiently misconfigured OOM killer. So, pending some other theory emerging from the kernel hackers, we're down to it's-a-PPC64-kernel-bug. That leaves me wondering what if anything we want to do about it. Even if it's fixed reasonably promptly in Linux upstream, and then we successfully nag assorted vendors to incorporate the fix quickly, that's still going to leave us with frequent buildfarm failures on Mark's flotilla of not-the-very-shiniest Linux versions. Should we move the infinite_recurse test to happen alone in a parallel group just to stop these failures? That's annoying from a parallelism standpoint, but I don't see any other way to avoid these failures. regards, tom lane
I want to give some conclusion to our occurance of this, which now I think was neither an instance nor indicitive of any bug. Summary: postgres was being kill -9 by a deployment script, after it "timed out". Thus no log messages. I initially experienced this while testing migration of a customer's DB using an ansible script, which did "async: 2222, poll: 0". Which I guess it takes to mean "let the process run for 2222 seconds, after which send it SIGKILL". I probably made multiple attempts to migrate (for example to fix tablespaces or dependences on postgis), and the process was started during one test and never stopped nor restarted during following tests, until it finally hit 2222sec and stopped ungracefully. This just happened again, during a differnet migration, so I reproduced it like: ansible --sudo --sudo-user pryzbyj 'database.customer.*' --background=1 --poll=1 -m command -a '/usr/pgsql-12/bin/postmaster-D /home/pryzbyj/pg12.dat -c port=5678 -c unix-socket-directories=/tmp -c client_min_messages=debug-c log_temp_files=0 -c log_lock_waits=1' Connect to customer and verify it was killed uncleanly: [pryzbyj@database ~]$ /usr/pgsql-12/bin/postmaster -D ./pg12.dat -c port=5678 -c unix-socket-directories=/tmp -c client_min_messages=debug-c log_temp_files=0 -c log_lock_waits=1 -c logging_collector=off 2019-10-22 17:57:58.251 EDT [5895] FATAL: lock file "postmaster.pid" already exists 2019-10-22 17:57:58.251 EDT [5895] HINT: Is another postmaster (PID 5608) running in data directory "/home/pryzbyj/./pg12.dat"? [pryzbyj@database ~]$ /usr/pgsql-12/bin/postmaster -D ./pg12.dat -c port=5678 -c unix-socket-directories=/tmp -c client_min_messages=debug-c log_temp_files=0 -c log_lock_waits=1 -c logging_collector=off 2019-10-22 17:58:01.312 EDT [5962] LOG: starting PostgreSQL 12.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623(Red Hat 4.8.5-39), 64-bit 2019-10-22 17:58:01.312 EDT [5962] LOG: listening on IPv6 address "::1", port 5678 2019-10-22 17:58:01.312 EDT [5962] LOG: listening on IPv4 address "127.0.0.1", port 5678 2019-10-22 17:58:01.328 EDT [5962] LOG: listening on Unix socket "/tmp/.s.PGSQL.5678" 2019-10-22 17:58:01.353 EDT [5963] LOG: database system was interrupted; last known up at 2019-10-22 17:57:48 EDT 2019-10-22 17:58:01.460 EDT [5963] LOG: database system was not properly shut down; automatic recovery in progress 2019-10-22 17:58:01.478 EDT [5963] LOG: invalid record length at 6/E829D128: wanted 24, got 0 2019-10-22 17:58:01.478 EDT [5963] LOG: redo is not required 2019-10-22 17:58:01.526 EDT [5962] LOG: database system is ready to accept connections On Tue, Jul 23, 2019 at 11:27:03AM -0500, Justin Pryzby wrote: > Does anyone have a stress test for parallel workers ? > > On a customer's new VM, I got this several times while (trying to) migrate their DB: > > < 2019-07-23 10:33:51.552 CDT postgres >FATAL: postmaster exited during a parallel transaction > < 2019-07-23 10:33:51.552 CDT postgres >STATEMENT: CREATE UNIQUE INDEX unused0_huawei_umts_nodeb_locell_201907_unique_idxON child.unused0_huawei_umts_nodeb_locell_201907 USING btree ... > > There's nothing in dmesg nor in postgres logs. > > At first I thought it's maybe because of a network disconnection, then I > thought it's because we ran out of space (wal), then they had a faulty blade. > After that, I'd tried and failed to reproduce it a number of times, but it's > just recurred during what was intended to be their final restore. I've set > max_parallel_workers_per_gather=0, but I'm planning to try to diagnose an issue > in another instance. Ideally a minimal test, since I'm apparently going to > have to run under gdb to see how it's dying, or even what process is failing. > > DMI: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015 > CentOS release 6.9 (Final) > Linux alextelsasrv01 2.6.32-754.17.1.el6.x86_64 #1 SMP Tue Jul 2 12:42:48 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux > version | PostgreSQL 11.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-23), 64-bit
On Tue, Oct 15, 2019 at 4:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Filed at > > https://bugzilla.kernel.org/show_bug.cgi?id=205183 For the curious-and-not-subscribed, there's now a kernel patch proposed for this. We guessed pretty close, but the problem wasn't those dodgy looking magic numbers, it was that the bad stack expansion check only allows for user space to expand the stack (FAULT_FLAG_USER), and here the kernel itself wants to build a stack frame.
On Wed, Dec 11, 2019 at 3:22 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Tue, Oct 15, 2019 at 4:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > Filed at > > > https://bugzilla.kernel.org/show_bug.cgi?id=205183 > > For the curious-and-not-subscribed, there's now a kernel patch > proposed for this. We guessed pretty close, but the problem wasn't > those dodgy looking magic numbers, it was that the bad stack expansion > check only allows for user space to expand the stack > (FAULT_FLAG_USER), and here the kernel itself wants to build a stack > frame. Hehe, the dodgy looking magic numbers *were* wrong: - * The kernel signal delivery code writes up to about 1.5kB + * The kernel signal delivery code writes a bit over 4KB https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200724092528.1578671-2-mpe@ellerman.id.au/
Thomas Munro <thomas.munro@gmail.com> writes: > Hehe, the dodgy looking magic numbers *were* wrong: > - * The kernel signal delivery code writes up to about 1.5kB > + * The kernel signal delivery code writes a bit over 4KB > https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200724092528.1578671-2-mpe@ellerman.id.au/ ... and, having seemingly not learned a thing, they just replaced them with new magic numbers. Mumble sizeof() mumble. Anyway, I guess the interesting question for us is how long it will take for this fix to propagate into real-world systems. I don't have much of a clue about the Linux kernel workflow, anybody want to venture a guess? regards, tom lane
On Tue, Jul 28, 2020 at 3:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Anyway, I guess the interesting question for us is how long it > will take for this fix to propagate into real-world systems. > I don't have much of a clue about the Linux kernel workflow, > anybody want to venture a guess? Me neither. It just hit Torvalds' tree[1] marked "Cc: stable@vger.kernel.org # v2.6.27+". I looked at the time for a couple of other PowerPC-related commits of similar complexity involving some of the same names to get from there to a Debian stable kernel package and it seemed to be under a couple of months. [1] https://github.com/torvalds/linux/commit/63dee5df43a31f3844efabc58972f0a206ca4534
Thomas Munro <thomas.munro@gmail.com> writes: > On Tue, Jul 28, 2020 at 3:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Anyway, I guess the interesting question for us is how long it >> will take for this fix to propagate into real-world systems. >> I don't have much of a clue about the Linux kernel workflow, >> anybody want to venture a guess? > Me neither. It just hit Torvalds' tree[1] marked "Cc: > stable@vger.kernel.org # v2.6.27+". I looked at the time for a couple > of other PowerPC-related commits of similar complexity involving some > of the same names to get from there to a Debian stable kernel package > and it seemed to be under a couple of months. > [1] https://github.com/torvalds/linux/commit/63dee5df43a31f3844efabc58972f0a206ca4534 For our archives' sake: today I got seemingly-automated mail informing me that this patch has been merged into the 4.19-stable, 5.4-stable, 5.7-stable, and 5.8-stable kernel branches; but not 4.4-stable, 4.9-stable, or 4.14-stable, because it failed to apply. regards, tom lane
I wrote: > For our archives' sake: today I got seemingly-automated mail informing me > that this patch has been merged into the 4.19-stable, 5.4-stable, > 5.7-stable, and 5.8-stable kernel branches; but not 4.4-stable, > 4.9-stable, or 4.14-stable, because it failed to apply. And this morning's mail brought news that the latter three branches are now patched as well. So I guess at this point it's down to platform vendors as to whether or how fast they absorb such changes. It might help for us to file platform-specific bug reports asking for the change to be merged. regards, tom lane
On Tue, Aug 25, 2020 at 1:43 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: > > For our archives' sake: today I got seemingly-automated mail informing me > > that this patch has been merged into the 4.19-stable, 5.4-stable, > > 5.7-stable, and 5.8-stable kernel branches; but not 4.4-stable, > > 4.9-stable, or 4.14-stable, because it failed to apply. > > And this morning's mail brought news that the latter three branches > are now patched as well. So I guess at this point it's down to > platform vendors as to whether or how fast they absorb such changes. Today I upgraded a Debian buster box and saw a new kernel image roll in. Lo and behold: $ zgrep 'stack expansion' /usr/share/doc/linux-image-4.19.0-11-amd64/changelog.gz - [powerpc*] Allow 4224 bytes of stack expansion for the signal frame