Thread: stress test for parallel workers

stress test for parallel workers

From

Justin Pryzby

Date:

23 July 2019, 16:27:03

Does anyone have a stress test for parallel workers ?

On a customer's new VM, I got this several times while (trying to) migrate their DB:

< 2019-07-23 10:33:51.552 CDT postgres >FATAL:  postmaster exited during a parallel transaction
< 2019-07-23 10:33:51.552 CDT postgres >STATEMENT:  CREATE UNIQUE INDEX
unused0_huawei_umts_nodeb_locell_201907_unique_idxON child.unused0_huawei_umts_nodeb_locell_201907 USING btree ...
 

There's nothing in dmesg nor in postgres logs.

At first I thought it's maybe because of a network disconnection, then I
thought it's because we ran out of space (wal), then they had a faulty blade.
After that, I'd tried and failed to reproduce it a number of times, but it's
just recurred during what was intended to be their final restore.  I've set
max_parallel_workers_per_gather=0, but I'm planning to try to diagnose an issue
in another instance.  Ideally a minimal test, since I'm apparently going to
have to run under gdb to see how it's dying, or even what process is failing.

DMI: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015
CentOS release 6.9 (Final)
Linux alextelsasrv01 2.6.32-754.17.1.el6.x86_64 #1 SMP Tue Jul 2 12:42:48 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
version | PostgreSQL 11.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-23), 64-bit

Justin

Re: stress test for parallel workers

From

Tom Lane

Date:

23 July 2019, 17:28:47

Justin Pryzby <pryzby@telsasoft.com> writes:
> Does anyone have a stress test for parallel workers ?
> On a customer's new VM, I got this several times while (trying to) migrate their DB:

> < 2019-07-23 10:33:51.552 CDT postgres >FATAL:  postmaster exited during a parallel transaction

We've been seeing this irregularly in the buildfarm, too.  I've been
suspicious that it's from an OOM kill on the postmaster in the
buildfarm cases, but ...

> There's nothing in dmesg nor in postgres logs.

... you'd think an OOM kill would show up in the kernel log.
(Not necessarily in dmesg, though.  Did you try syslog?)

> Ideally a minimal test, since I'm apparently going to
> have to run under gdb to see how it's dying, or even what process is failing.

Like it told you, it's the postmaster that's going away.
That's Not Supposed To Happen, of course, but unfortunately Linux'
OOM kill heuristic preferentially targets the postmaster when
its children are consuming too many resources.

If that is the problem, there's some info on working around it at

https://www.postgresql.org/docs/current/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT

            regards, tom lane

Re: stress test for parallel workers

From

Justin Pryzby

Date:

23 July 2019, 17:42:22

On Tue, Jul 23, 2019 at 01:28:47PM -0400, Tom Lane wrote:
> ... you'd think an OOM kill would show up in the kernel log.
> (Not necessarily in dmesg, though.  Did you try syslog?)

Nothing in /var/log/messages (nor dmesg ring).

I enabled abrtd while trying to reproduce it last week.  Since you asked I
looked again in messages, and found it'd logged 10 hours ago about this:

(gdb) bt
#0  0x000000395be32495 in raise () from /lib64/libc.so.6
#1  0x000000395be33c75 in abort () from /lib64/libc.so.6
#2  0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555
#3  0x00000000006f7e94 in CheckPointReplicationOrigin () at origin.c:588
#4  0x00000000004f6ef1 in CheckPointGuts (checkPointRedo=5507491783792, flags=128) at xlog.c:9150
#5  0x00000000004feff6 in CreateCheckPoint (flags=128) at xlog.c:8937
#6  0x00000000006d49e2 in CheckpointerMain () at checkpointer.c:491
#7  0x000000000050fe75 in AuxiliaryProcessMain (argc=2, argv=0x7ffe00d56b00) at bootstrap.c:451
#8  0x00000000006dcf54 in StartChildProcess (type=CheckpointerProcess) at postmaster.c:5337
#9  0x00000000006de78a in reaper (postgres_signal_arg=<value optimized out>) at postmaster.c:2867
#10 <signal handler called>
#11 0x000000395bee1603 in __select_nocancel () from /lib64/libc.so.6
#12 0x00000000006e1488 in ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1671
#13 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1380
#14 0x0000000000656420 in main (argc=3, argv=0x27ae410) at main.c:228

#2  0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555
        edata = <value optimized out>
        elevel = 22
        oldcontext = 0x27e15d0
        econtext = 0x0
        __func__ = "errfinish"
#3  0x00000000006f7e94 in CheckPointReplicationOrigin () at origin.c:588
        save_errno = <value optimized out>
        tmppath = 0x9c4518 "pg_logical/replorigin_checkpoint.tmp"
        path = 0x9c4300 "pg_logical/replorigin_checkpoint"
        tmpfd = 64
        i = <value optimized out>
        magic = 307747550
        crc = 4294967295
        __func__ = "CheckPointReplicationOrigin"
#4  0x00000000004f6ef1 in CheckPointGuts (checkPointRedo=5507491783792, flags=128) at xlog.c:9150
No locals.
#5  0x00000000004feff6 in CreateCheckPoint (flags=128) at xlog.c:8937
        shutdown = false
        checkPoint = {redo = 5507491783792, ThisTimeLineID = 1, PrevTimeLineID = 1, fullPageWrites = true, nextXidEpoch
=0, nextXid = 2141308, nextOid = 496731439, nextMulti = 1, nextMultiOffset = 0, 
 
          oldestXid = 561, oldestXidDB = 1, oldestMulti = 1, oldestMultiDB = 1, time = 1563781930, oldestCommitTsXid =
0,newestCommitTsXid = 0, oldestActiveXid = 2141308}
 
        recptr = <value optimized out>
        _logSegNo = <value optimized out>
        Insert = <value optimized out>
        freespace = <value optimized out>
        PriorRedoPtr = <value optimized out>
        curInsert = <value optimized out>
        last_important_lsn = <value optimized out>
        vxids = 0x280afb8
        nvxids = 0
        __func__ = "CreateCheckPoint"
#6  0x00000000006d49e2 in CheckpointerMain () at checkpointer.c:491
        ckpt_performed = false
        do_restartpoint = <value optimized out>
        flags = 128
        do_checkpoint = <value optimized out>
        now = 1563781930
        elapsed_secs = <value optimized out>
        cur_timeout = <value optimized out>
        rc = <value optimized out>
        local_sigjmp_buf = {{__jmpbuf = {2, -1669940128760174522, 9083146, 0, 140728912407216, 140728912407224,
-1669940128812603322,1670605924426606662}, __mask_was_saved = 1, __saved_mask = {__val = {
 
                18446744066192964103, 0, 246358747096, 140728912407296, 140446084917816, 140446078556040, 9083146, 0,
246346239061,140446078556040, 140447207471460, 0, 140447207471424, 140446084917816, 0, 
 
                7864320}}}}
        checkpointer_context = 0x27e15d0
        __func__ = "CheckpointerMain"

Supposedly it's trying to do this:

|    ereport(PANIC,  
|            (errcode_for_file_access(),
|             errmsg("could not write to file \"%s\": %m",
|                    tmppath)));

And since there's consistently nothing in logs, I'm guessing there's a
legitimate write error (legitimate from PG perspective).  Storage here is ext4
plus zfs tablespace on top of LVM on top of vmware thin volume.

Justin

Re: stress test for parallel workers

From

Thomas Munro

Date:

23 July 2019, 21:10:41

On Wed, Jul 24, 2019 at 4:27 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> < 2019-07-23 10:33:51.552 CDT postgres >FATAL:  postmaster exited during a parallel transaction
> < 2019-07-23 10:33:51.552 CDT postgres >STATEMENT:  CREATE UNIQUE INDEX
unused0_huawei_umts_nodeb_locell_201907_unique_idxON child.unused0_huawei_umts_nodeb_locell_201907 USING btree ...
 

> ...  I've set
> max_parallel_workers_per_gather=0, ...

Just by the way, parallelism in CREATE INDEX is controlled by
max_parallel_maintenance_workers, not max_parallel_workers_per_gather.

-- 
Thomas Munro
https://enterprisedb.com

Re: stress test for parallel workers

From

Thomas Munro

Date:

23 July 2019, 22:03:25

On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> #2  0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555
>         edata = <value optimized out>
>         elevel = 22
>         oldcontext = 0x27e15d0
>         econtext = 0x0
>         __func__ = "errfinish"
> #3  0x00000000006f7e94 in CheckPointReplicationOrigin () at origin.c:588
>         save_errno = <value optimized out>
>         tmppath = 0x9c4518 "pg_logical/replorigin_checkpoint.tmp"
>         path = 0x9c4300 "pg_logical/replorigin_checkpoint"
>         tmpfd = 64
>         i = <value optimized out>
>         magic = 307747550
>         crc = 4294967295
>         __func__ = "CheckPointReplicationOrigin"

> Supposedly it's trying to do this:
>
> |       ereport(PANIC,
> |                       (errcode_for_file_access(),
> |                        errmsg("could not write to file \"%s\": %m",
> |                                       tmppath)));
>
> And since there's consistently nothing in logs, I'm guessing there's a
> legitimate write error (legitimate from PG perspective).  Storage here is ext4
> plus zfs tablespace on top of LVM on top of vmware thin volume.

If you have that core, it might be interesting to go to frame 2 and
print *edata or edata->saved_errno.  If the errno is EIO, it's a bit
strange if that's not showing up in some form in kernel logs or dmesg
or something; if it's ENOSPC I guess it'd be normal that it doesn't
show up anywhere and there is nothing in the PostgreSQL logs if
they're on the same full filesystem, but then you would probably
already have mentioned that your filesystem was out of space.  Could
it have been fleetingly full due to some other thing happening on the
system that rapidly expands and contracts?

I'm confused by the evidence, though.  If this PANIC is the origin of
the problem, how do we get to postmaster-death based exit in a
parallel leader*, rather than quickdie() (the kind of exit that
happens when the postmaster sends SIGQUIT to every process and they
say "terminating connection because of crash of another server
process", because some backend crashed or panicked).  Perhaps it would
be clearer what's going on if you could put the PostgreSQL log onto a
different filesystem, so we get a better chance of collecting
evidence?  But then... the parallel leader process was apparently able
to log something -- maybe it was just lucky, but you said this
happened this way more than once.  I'm wondering how it could be that
you got some kind of IO failure and weren't able to log the PANIC
message AND your postmaster was killed, and you were able to log a
message about that.  Perhaps we're looking at evidence from two
unrelated failures.

*I suspect that the only thing implicating parallelism in this failure
is that parallel leaders happen to print out that message if the
postmaster dies while they are waiting for workers; most other places
(probably every other backend in your cluster) just quietly exit.
That tells us something about what's happening, but on its own doesn't
tell us that parallelism plays an important role in the failure mode.

--
Thomas Munro
https://enterprisedb.com

Re: stress test for parallel workers

From

Tom Lane

Date:

23 July 2019, 22:11:58

Thomas Munro <thomas.munro@gmail.com> writes:
> *I suspect that the only thing implicating parallelism in this failure
> is that parallel leaders happen to print out that message if the
> postmaster dies while they are waiting for workers; most other places
> (probably every other backend in your cluster) just quietly exit.
> That tells us something about what's happening, but on its own doesn't
> tell us that parallelism plays an important role in the failure mode.

I agree that there's little evidence implicating parallelism directly.
The reason I'm suspicious about a possible OOM kill is that parallel
queries would appear to the OOM killer to be eating more resources
than the same workload non-parallel, so that we might be at more
hazard of getting OOM'd just because of that.

A different theory is that there's some hard-to-hit bug in the
postmaster's processing of parallel workers that doesn't apply to
regular backends.  I've looked for one in a desultory way but not
really focused on it.

In any case, the evidence from the buildfarm is pretty clear that
there is *some* connection.  We've seen a lot of recent failures
involving "postmaster exited during a parallel transaction", while
the number of postmaster failures not involving that is epsilon.

            regards, tom lane

Re: stress test for parallel workers

From

Thomas Munro

Date:

23 July 2019, 22:40:16

On Wed, Jul 24, 2019 at 10:03 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> >         edata = <value optimized out>

> If you have that core, it might be interesting to go to frame 2 and
> print *edata or edata->saved_errno. ...

Rats.  We already saw that it's optimised out so unless we can find
that somewhere else in a variable that's present in the core, we
probably can't find out what the operating system said.  So my other
idea for getting this information next time is to try putting the
PostgreSQL logs somewhere that's more likely to be still working when
that thing fails.

--
Thomas Munro
https://enterprisedb.com

Re: stress test for parallel workers

From

Justin Pryzby

Date:

23 July 2019, 22:42:55

On Wed, Jul 24, 2019 at 10:03:25AM +1200, Thomas Munro wrote:
> On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > #2  0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555
> >         edata = <value optimized out>
> 
> If you have that core, it might be interesting to go to frame 2 and
> print *edata or edata->saved_errno.

As you saw..unless someone you know a trick, it's "optimized out".

> Could it have been fleetingly full due to some other thing happening on the
> system that rapidly expands and contracts?

It's not impossible, especially while loading data, and data_dir is only 64GB;
it may have happened that way sometimes; but it's hard to believe I that's been
the case 5-10 times now.  If I don't forget to drop the old database previously
loaded, when loading old/historic data, it should have ~40GB free on data_dir,
and no clients connected other than pg_restore.

$ df -h /var/lib/pgsql
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/data-postgres
                       64G   26G   38G  41% /var/lib/pgsql

> |     ereport(PANIC,  
> |                     (errcode_for_file_access(),
> |                      errmsg("could not write to file \"%s\": %m",
> |                                     tmppath)));
> 
> And since there's consistently nothing in logs, I'm guessing there's a
> legitimate write error (legitimate from PG perspective).  Storage here is ext4
> plus zfs tablespace on top of LVM on top of vmware thin volume.

I realized this probably is *not* an issue with zfs, since it's failing to log
(for one reason or another) to /var/lib/pgsql (ext4).

> Perhaps it would be clearer what's going on if you could put the PostgreSQL
> log onto a different filesystem, so we get a better chance of collecting
> evidence?

I didn't mention it but last weekend I'd left a loop around the restore process
running overnight, and had convinced myself the issue didn't recur since their
faulty blade was taken out of service...  My plan was to leave the server
running in the foreground with logging_collector=no, which I hope is enough,
unless logging is itself somehow implicated.  I'm trying to stress test that
way now.

> But then... the parallel leader process was apparently able
> to log something -- maybe it was just lucky, but you said this
> happened this way more than once.  I'm wondering how it could be that
> you got some kind of IO failure and weren't able to log the PANIC
> message AND your postmaster was killed, and you were able to log a
> message about that.  Perhaps we're looking at evidence from two
> unrelated failures.

The messages from the parallel leader (building indices) were visible to the
client, not via the server log.  I was loading their data and the errors were
visible when pg_restore failed.

On Wed, Jul 24, 2019 at 09:10:41AM +1200, Thomas Munro wrote:
> Just by the way, parallelism in CREATE INDEX is controlled by
> max_parallel_maintenance_workers, not max_parallel_workers_per_gather.

Thank you.

Justin

Re: stress test for parallel workers

From

Thomas Munro

Date:

23 July 2019, 22:46:42

On Wed, Jul 24, 2019 at 10:42 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> On Wed, Jul 24, 2019 at 10:03:25AM +1200, Thomas Munro wrote:
> > On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > > #2  0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555
> > >         edata = <value optimized out>
> >
> > If you have that core, it might be interesting to go to frame 2 and
> > print *edata or edata->saved_errno.
>
> As you saw..unless someone you know a trick, it's "optimized out".

How about something like this:

print errorData[errordata_stack_depth]

If you can't find errordata_stack_depth, maybe look at the whole array
and try to find the interesting bit?


--
Thomas Munro
https://enterprisedb.com

Re: stress test for parallel workers

From

Justin Pryzby

Date:

23 July 2019, 23:04:40

On Wed, Jul 24, 2019 at 10:46:42AM +1200, Thomas Munro wrote:
> On Wed, Jul 24, 2019 at 10:42 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > On Wed, Jul 24, 2019 at 10:03:25AM +1200, Thomas Munro wrote:
> > > On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > > > #2  0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555
> > > >         edata = <value optimized out>
> > >
> > > If you have that core, it might be interesting to go to frame 2 and
> > > print *edata or edata->saved_errno.
> >
> > As you saw..unless someone you know a trick, it's "optimized out".
> 
> How about something like this:
> 
> print errorData[errordata_stack_depth]

Clever.

(gdb) p errordata[errordata_stack_depth]
$2 = {elevel = 13986192, output_to_server = 254, output_to_client = 127, show_funcname = false, hide_stmt = false,
hide_ctx= false, filename = 0x27b3790 "< %m %u >", lineno = 41745456, 
 
  funcname = 0x3030313335 <Address 0x3030313335 out of bounds>, domain = 0x0, context_domain = 0x27cff90 "postgres",
sqlerrcode= 0, message = 0xe8800000001 <Address 0xe8800000001 out of bounds>, 
 
  detail = 0x297a <Address 0x297a out of bounds>, detail_log = 0x0, hint = 0xe88 <Address 0xe88 out of bounds>, context
=0x297a <Address 0x297a out of bounds>, message_id = 0x0, schema_name = 0x0, 
 
  table_name = 0x0, column_name = 0x0, datatype_name = 0x0, constraint_name = 0x0, cursorpos = 0, internalpos = 0,
internalquery= 0x0, saved_errno = 0, assoc_context = 0x0}
 
(gdb) p errordata
$3 = {{elevel = 22, output_to_server = true, output_to_client = false, show_funcname = false, hide_stmt = false,
hide_ctx= false, filename = 0x9c4030 "origin.c", lineno = 591, 
 
    funcname = 0x9c46e0 "CheckPointReplicationOrigin", domain = 0x9ac810 "postgres-11", context_domain = 0x9ac810
"postgres-11",sqlerrcode = 4293, 
 
    message = 0x27b0e40 "could not write to file \"pg_logical/replorigin_checkpoint.tmp\": No space left on device",
detail= 0x0, detail_log = 0x0, hint = 0x0, context = 0x0, 
 
    message_id = 0x8a7aa8 "could not write to file \"%s\": %m", ...

I ought to have remembered that it *was* in fact out of space this AM when this
core was dumped (due to having not touched it since scheduling transition to
this VM last week).

I want to say I'm almost certain it wasn't ENOSPC in other cases, since,
failing to find log output, I ran df right after the failure.

But that gives me an idea: is it possible there's an issue with files being
held opened by worker processes ?  Including by parallel workers?  Probably
WALs, even after they're rotated ?  If there were worker processes holding
opened lots of rotated WALs, that could cause ENOSPC, but that wouldn't be
obvious after they die, since the space would then be freed.

Justin

Re: stress test for parallel workers

From

Tom Lane

Date:

23 July 2019, 23:29:04

Justin Pryzby <pryzby@telsasoft.com> writes:
> I want to say I'm almost certain it wasn't ENOSPC in other cases, since,
> failing to find log output, I ran df right after the failure.

The fact that you're not finding log output matching what was reported
to the client seems to me to be a mighty strong indication that there
*was* an ENOSPC problem.  Can you reconfigure to put the postmaster
log on a different volume?

> But that gives me an idea: is it possible there's an issue with files being
> held opened by worker processes ?  Including by parallel workers?  Probably
> WALs, even after they're rotated ?  If there were worker processes holding
> opened lots of rotated WALs, that could cause ENOSPC, but that wouldn't be
> obvious after they die, since the space would then be freed.

Parallel workers aren't ever allowed to write, in the current
implementation, so it's not real obvious why they'd have any
WAL log files open at all.

            regards, tom lane

Re: stress test for parallel workers

From

Thomas Munro

Date:

23 July 2019, 23:32:30

On Wed, Jul 24, 2019 at 11:04 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> I ought to have remembered that it *was* in fact out of space this AM when this
> core was dumped (due to having not touched it since scheduling transition to
> this VM last week).
>
> I want to say I'm almost certain it wasn't ENOSPC in other cases, since,
> failing to find log output, I ran df right after the failure.

Ok, cool, so the ENOSPC thing we understand, and the postmaster death
thing is probably something entirely different.  Which brings us to
the question: what is killing your postmaster or causing it to exit
silently and unexpectedly, but leaving no trace in any operating
system log?  You mentioned that you couldn't see any signs of the OOM
killer.  Are you in a situation to test an OOM failure so you can
confirm what that looks like on your system?  You might try typing
this into Python:

x = [42]
for i in range(1000):
  x = x + x

On my non-Linux system, it ran for a while and then was killed, and
dmesg showed:

pid 15956 (python3.6), jid 0, uid 1001, was killed: out of swap space
pid 40238 (firefox), jid 0, uid 1001, was killed: out of swap space

Admittedly it is quite hard for to distinguish between a web browser
and a program designed to eat memory as fast as possible...  Anyway on
Linux you should see stuff about killed processes and/or OOM in one of
dmesg, syslog, messages.

> But that gives me an idea: is it possible there's an issue with files being
> held opened by worker processes ?  Including by parallel workers?  Probably
> WALs, even after they're rotated ?  If there were worker processes holding
> opened lots of rotated WALs, that could cause ENOSPC, but that wouldn't be
> obvious after they die, since the space would then be freed.

Parallel workers don't do anything with WAL files, but they can create
temporary files.  If you're building humongous indexes with parallel
workers, you'll get some of those, but I don't think it'd be more than
you'd get without parallelism.  If you were using up all of your disk
space with temporary files, wouldn't this be reproducible?  I think
you said you were testing this repeatedly, so if that were the problem
I'd expect to see some non-panicky out-of-space errors when the temp
files blow out your disk space, and only rarely a panic if a
checkpoint happens to run exactly at a moment where the create index
hasn't yet written the byte that breaks the camel's back, but the
checkpoint pushes it over edge in one of these places where it panics
on failure.

-- 
Thomas Munro
https://enterprisedb.com

Re: stress test for parallel workers

From

Thomas Munro

Date:

23 July 2019, 23:48:57

On Wed, Jul 24, 2019 at 10:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > *I suspect that the only thing implicating parallelism in this failure
> > is that parallel leaders happen to print out that message if the
> > postmaster dies while they are waiting for workers; most other places
> > (probably every other backend in your cluster) just quietly exit.
> > That tells us something about what's happening, but on its own doesn't
> > tell us that parallelism plays an important role in the failure mode.
>
> I agree that there's little evidence implicating parallelism directly.
> The reason I'm suspicious about a possible OOM kill is that parallel
> queries would appear to the OOM killer to be eating more resources
> than the same workload non-parallel, so that we might be at more
> hazard of getting OOM'd just because of that.
>
> A different theory is that there's some hard-to-hit bug in the
> postmaster's processing of parallel workers that doesn't apply to
> regular backends.  I've looked for one in a desultory way but not
> really focused on it.
>
> In any case, the evidence from the buildfarm is pretty clear that
> there is *some* connection.  We've seen a lot of recent failures
> involving "postmaster exited during a parallel transaction", while
> the number of postmaster failures not involving that is epsilon.

I don't have access to the build farm history in searchable format
(I'll go and ask for that).  Do you have an example to hand?  Is this
failure always happening on Linux?

--
Thomas Munro
https://enterprisedb.com

Re: stress test for parallel workers

From

Alvaro Herrera

Date:

23 July 2019, 23:57:34

On 2019-Jul-23, Justin Pryzby wrote:

> I want to say I'm almost certain it wasn't ENOSPC in other cases, since,
> failing to find log output, I ran df right after the failure.

I'm not sure that this proves much, since I expect temporary files to be
deleted on failure; by the time you run 'df' the condition might have
already been cleared.  You'd need to be capturing diskspace telemetry
with sufficient granularity ...

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: stress test for parallel workers

From

Justin Pryzby

Date:

24 July 2019, 00:33:43

On Wed, Jul 24, 2019 at 11:32:30AM +1200, Thomas Munro wrote:
> On Wed, Jul 24, 2019 at 11:04 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > I ought to have remembered that it *was* in fact out of space this AM when this
> > core was dumped (due to having not touched it since scheduling transition to
> > this VM last week).
> >
> > I want to say I'm almost certain it wasn't ENOSPC in other cases, since,
> > failing to find log output, I ran df right after the failure.

I meant it wasn't a trivial error on my part of failing to drop the previously
loaded DB instance.  It occured to me to check inodes, which can also cause
ENOSPC.  This is mkfs -T largefile, so running out of inodes is not an
impossibility.  But seems an unlikely culprit, unless something made tens of
thousands of (small) files.  

[pryzbyj@alextelsasrv01 ~]$ df -i /var/lib/pgsql
Filesystem           Inodes IUsed IFree IUse% Mounted on
/dev/mapper/data-postgres
                      65536  5605 59931    9% /var/lib/pgsql

> Ok, cool, so the ENOSPC thing we understand, and the postmaster death
> thing is probably something entirely different.  Which brings us to
> the question: what is killing your postmaster or causing it to exit
> silently and unexpectedly, but leaving no trace in any operating
> system log?  You mentioned that you couldn't see any signs of the OOM
> killer.  Are you in a situation to test an OOM failure so you can
> confirm what that looks like on your system?

$ command time -v python -c "'x'*4999999999" |wc
Traceback (most recent call last):
  File "<string>", line 1, in <module>
MemoryError
Command exited with non-zero status 1
...
        Maximum resident set size (kbytes): 4276

$ dmesg
...
Out of memory: Kill process 10665 (python) score 478 or sacrifice child
Killed process 10665, UID 503, (python) total-vm:4024260kB, anon-rss:3845756kB, file-rss:1624kB

I wouldn't burn too much more time on it until I can reproduce it.  The
failures were all during pg_restore, so checkpointer would've been very busy.
It seems possible it for it to notice ENOSPC before workers...which would be
fsyncing WAL, where checkpointer is fsyncing data.

> Admittedly it is quite hard for to distinguish between a web browser
> and a program designed to eat memory as fast as possible...

Browsers making lots of progress here but still clearly 2nd place.

Justin

Re: stress test for parallel workers

From

Tom Lane

Date:

24 July 2019, 05:15:14

Thomas Munro <thomas.munro@gmail.com> writes:
> On Wed, Jul 24, 2019 at 10:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> In any case, the evidence from the buildfarm is pretty clear that
>> there is *some* connection.  We've seen a lot of recent failures
>> involving "postmaster exited during a parallel transaction", while
>> the number of postmaster failures not involving that is epsilon.

> I don't have access to the build farm history in searchable format
> (I'll go and ask for that).

Yeah, it's definitely handy to be able to do SQL searches in the
history.  I forget whether Dunstan or Frost is the person to ask
for access, but there's no reason you shouldn't have it.

> Do you have an example to hand?  Is this
> failure always happening on Linux?

I dug around a bit further, and while my recollection of a lot of
"postmaster exited during a parallel transaction" failures is accurate,
there is a very strong correlation I'd not noticed: it's just a few
buildfarm critters that are producing those.  To wit, I find that
string in these recent failures (checked all runs in the past 3 months):

  sysname  |    branch     |      snapshot       
-----------+---------------+---------------------
 lorikeet  | HEAD          | 2019-06-16 20:28:25
 lorikeet  | HEAD          | 2019-07-07 14:58:38
 lorikeet  | HEAD          | 2019-07-02 10:38:08
 lorikeet  | HEAD          | 2019-06-14 14:58:24
 lorikeet  | HEAD          | 2019-07-04 20:28:44
 lorikeet  | HEAD          | 2019-04-30 11:00:49
 lorikeet  | HEAD          | 2019-06-19 20:29:27
 lorikeet  | HEAD          | 2019-05-21 08:28:26
 lorikeet  | REL_11_STABLE | 2019-07-11 08:29:08
 lorikeet  | REL_11_STABLE | 2019-07-09 08:28:41
 lorikeet  | REL_12_STABLE | 2019-07-16 08:28:37
 lorikeet  | REL_12_STABLE | 2019-07-02 21:46:47
 lorikeet  | REL9_6_STABLE | 2019-07-02 20:28:14
 vulpes    | HEAD          | 2019-06-14 09:18:18
 vulpes    | HEAD          | 2019-06-27 09:17:19
 vulpes    | HEAD          | 2019-07-21 09:01:45
 vulpes    | HEAD          | 2019-06-12 09:11:02
 vulpes    | HEAD          | 2019-07-05 08:43:29
 vulpes    | HEAD          | 2019-07-15 08:43:28
 vulpes    | HEAD          | 2019-07-19 09:28:12
 wobbegong | HEAD          | 2019-06-09 20:43:22
 wobbegong | HEAD          | 2019-07-02 21:17:41
 wobbegong | HEAD          | 2019-06-04 21:06:07
 wobbegong | HEAD          | 2019-07-14 20:43:54
 wobbegong | HEAD          | 2019-06-19 21:05:04
 wobbegong | HEAD          | 2019-07-08 20:55:18
 wobbegong | HEAD          | 2019-06-28 21:18:46
 wobbegong | HEAD          | 2019-06-02 20:43:20
 wobbegong | HEAD          | 2019-07-04 21:01:37
 wobbegong | HEAD          | 2019-06-14 21:20:59
 wobbegong | HEAD          | 2019-06-23 21:36:51
 wobbegong | HEAD          | 2019-07-18 21:31:36
(32 rows)

We already knew that lorikeet has its own peculiar stability
problems, and these other two critters run different compilers
on the same Fedora 27 ppc64le platform.

So I think I've got to take back the assertion that we've got
some lurking generic problem.  This pattern looks way more
like a platform-specific issue.  Overaggressive OOM killer
would fit the facts on vulpes/wobbegong, perhaps, though
it's odd that it only happens on HEAD runs.

            regards, tom lane

Re: stress test for parallel workers

From

Thomas Munro

Date:

06 August 2019, 23:57:23

On Wed, Jul 24, 2019 at 5:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > On Wed, Jul 24, 2019 at 10:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Do you have an example to hand?  Is this
> > failure always happening on Linux?
>
> I dug around a bit further, and while my recollection of a lot of
> "postmaster exited during a parallel transaction" failures is accurate,
> there is a very strong correlation I'd not noticed: it's just a few
> buildfarm critters that are producing those.  To wit, I find that
> string in these recent failures (checked all runs in the past 3 months):
>
>   sysname  |    branch     |      snapshot
> -----------+---------------+---------------------
>  lorikeet  | HEAD          | 2019-06-16 20:28:25
>  lorikeet  | HEAD          | 2019-07-07 14:58:38
>  lorikeet  | HEAD          | 2019-07-02 10:38:08
>  lorikeet  | HEAD          | 2019-06-14 14:58:24
>  lorikeet  | HEAD          | 2019-07-04 20:28:44
>  lorikeet  | HEAD          | 2019-04-30 11:00:49
>  lorikeet  | HEAD          | 2019-06-19 20:29:27
>  lorikeet  | HEAD          | 2019-05-21 08:28:26
>  lorikeet  | REL_11_STABLE | 2019-07-11 08:29:08
>  lorikeet  | REL_11_STABLE | 2019-07-09 08:28:41
>  lorikeet  | REL_12_STABLE | 2019-07-16 08:28:37
>  lorikeet  | REL_12_STABLE | 2019-07-02 21:46:47
>  lorikeet  | REL9_6_STABLE | 2019-07-02 20:28:14
>  vulpes    | HEAD          | 2019-06-14 09:18:18
>  vulpes    | HEAD          | 2019-06-27 09:17:19
>  vulpes    | HEAD          | 2019-07-21 09:01:45
>  vulpes    | HEAD          | 2019-06-12 09:11:02
>  vulpes    | HEAD          | 2019-07-05 08:43:29
>  vulpes    | HEAD          | 2019-07-15 08:43:28
>  vulpes    | HEAD          | 2019-07-19 09:28:12
>  wobbegong | HEAD          | 2019-06-09 20:43:22
>  wobbegong | HEAD          | 2019-07-02 21:17:41
>  wobbegong | HEAD          | 2019-06-04 21:06:07
>  wobbegong | HEAD          | 2019-07-14 20:43:54
>  wobbegong | HEAD          | 2019-06-19 21:05:04
>  wobbegong | HEAD          | 2019-07-08 20:55:18
>  wobbegong | HEAD          | 2019-06-28 21:18:46
>  wobbegong | HEAD          | 2019-06-02 20:43:20
>  wobbegong | HEAD          | 2019-07-04 21:01:37
>  wobbegong | HEAD          | 2019-06-14 21:20:59
>  wobbegong | HEAD          | 2019-06-23 21:36:51
>  wobbegong | HEAD          | 2019-07-18 21:31:36
> (32 rows)
>
> We already knew that lorikeet has its own peculiar stability
> problems, and these other two critters run different compilers
> on the same Fedora 27 ppc64le platform.
>
> So I think I've got to take back the assertion that we've got
> some lurking generic problem.  This pattern looks way more
> like a platform-specific issue.  Overaggressive OOM killer
> would fit the facts on vulpes/wobbegong, perhaps, though
> it's odd that it only happens on HEAD runs.

chipmunk also:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2019-08-06%2014:16:16

I wondered if the build farm should try to report OOM kill -9 or other
signal activity affecting the postmaster.

On some systems (depending on sysctl kernel.dmesg_restrict on Linux,
security.bsd.unprivileged_read_msgbuf on FreeBSD etc) you can run
dmesg as a non-root user, and there the OOM killer's footprints or
signaled exit statuses for processes under init might normally be found,
but that seems a bit invasive for the host system (I guess you'd
filter it carefully).  Unfortunately it isn't enabled on many common
systems anyway.

Maybe there is a systemd-specific way to get the info we need without
being root?

Another idea: start the postmaster under a subreaper (Linux 3.4+
prctl(PR_SET_CHILD_SUBREAPER), FreeBSD 10.2+
procctl(PROC_REAP_ACQUIRE)) that exists just to report on its
children's exit status, so the build farm could see "pid XXX was
killed by signal 9" message if it is nuked by the OOM killer.  Perhaps
there is a common subreaper wrapper out there that would wait, print
messages like that, rince and repeat until it has no children and then
exit, or perhaps pg_ctl or even a perl script could do somethign like
that if requested.  Another thought, not explored, is the brand new
Linux pidfd stuff that can be used to wait and get exit status for a
non-child process (or the older BSD equivalent), but the paint isn't
even dry on that stuff anwyay.

--
Thomas Munro
https://enterprisedb.com

Re: stress test for parallel workers

From

Tom Lane

Date:

07 August 2019, 04:29:19

Thomas Munro <thomas.munro@gmail.com> writes:
> I wondered if the build farm should try to report OOM kill -9 or other
> signal activity affecting the postmaster.

Yeah, I've been wondering whether pg_ctl could fork off a subprocess
that would fork the postmaster, wait for the postmaster to exit, and then
report the exit status.  Where to report it *to* seems like the hard part,
but maybe an answer that worked for the buildfarm would be enough for now.

            regards, tom lane

Re: stress test for parallel workers

From

Thomas Munro

Date:

07 August 2019, 05:00:29

On Wed, Aug 7, 2019 at 4:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > I wondered if the build farm should try to report OOM kill -9 or other
> > signal activity affecting the postmaster.
>
> Yeah, I've been wondering whether pg_ctl could fork off a subprocess
> that would fork the postmaster, wait for the postmaster to exit, and then
> report the exit status.  Where to report it *to* seems like the hard part,
> but maybe an answer that worked for the buildfarm would be enough for now.

Oh, right, you don't even need subreaper tricks (I was imagining we
had a double fork somewhere we don't).

Another question is whether the build farm should be setting the Linux
oom score adjust thing.

-- 
Thomas Munro
https://enterprisedb.com

Re: stress test for parallel workers

From

Tom Lane

Date:

07 August 2019, 05:06:58

Thomas Munro <thomas.munro@gmail.com> writes:
> Another question is whether the build farm should be setting the Linux
> oom score adjust thing.

AFAIK you can't do that without being root.

            regards, tom lane

Re: stress test for parallel workers

From

Thomas Munro

Date:

07 August 2019, 05:12:44

On Wed, Aug 7, 2019 at 5:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > Another question is whether the build farm should be setting the Linux
> > oom score adjust thing.
>
> AFAIK you can't do that without being root.

Rats, yeah you need CAP_SYS_RESOURCE or root to lower it.

-- 
Thomas Munro
https://enterprisedb.com

Re: stress test for parallel workers

From

Heikki Linnakangas

Date:

07 August 2019, 13:30:46

On 07/08/2019 02:57, Thomas Munro wrote:
> On Wed, Jul 24, 2019 at 5:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> So I think I've got to take back the assertion that we've got
>> some lurking generic problem.  This pattern looks way more
>> like a platform-specific issue.  Overaggressive OOM killer
>> would fit the facts on vulpes/wobbegong, perhaps, though
>> it's odd that it only happens on HEAD runs.
> 
> chipmunk also:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2019-08-06%2014:16:16
> 
> I wondered if the build farm should try to report OOM kill -9 or other
> signal activity affecting the postmaster.

FWIW, I looked at the logs in /var/log/* on chipmunk, and found no 
evidence of OOM killings. I can see nothing unusual in the OS logs 
around the time of that failure.

- Heikki

Re: stress test for parallel workers

From

Tom Lane

Date:

07 August 2019, 13:57:05

Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On 07/08/2019 02:57, Thomas Munro wrote:
>> On Wed, Jul 24, 2019 at 5:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> So I think I've got to take back the assertion that we've got
>>> some lurking generic problem.  This pattern looks way more
>>> like a platform-specific issue.  Overaggressive OOM killer
>>> would fit the facts on vulpes/wobbegong, perhaps, though
>>> it's odd that it only happens on HEAD runs.

>> chipmunk also:
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2019-08-06%2014:16:16

> FWIW, I looked at the logs in /var/log/* on chipmunk, and found no
> evidence of OOM killings. I can see nothing unusual in the OS logs
> around the time of that failure.

Oh, that is very useful info, thanks.  That seems to mean that we
should be suspecting a segfault, assertion failure, etc inside
the postmaster.  I don't see any TRAP message in chipmunk's log,
so assertion failure seems to be ruled out, but other sorts of
process-crashing errors would fit the facts.

A stack trace from the crash would be mighty useful info along
about here.  I wonder whether chipmunk has the infrastructure
needed to create such a thing.  From memory, the buildfarm requires
gdb for that, but not sure if there are additional requirements.
Also, if you're using systemd or something else that thinks it
ought to interfere with where cores get dropped, that could be
a problem.

            regards, tom lane

Re: stress test for parallel workers

From

Heikki Linnakangas

Date:

07 August 2019, 14:30:51

On 07/08/2019 16:57, Tom Lane wrote:
> Heikki Linnakangas <hlinnaka@iki.fi> writes:
>> On 07/08/2019 02:57, Thomas Munro wrote:
>>> On Wed, Jul 24, 2019 at 5:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> So I think I've got to take back the assertion that we've got
>>>> some lurking generic problem.  This pattern looks way more
>>>> like a platform-specific issue.  Overaggressive OOM killer
>>>> would fit the facts on vulpes/wobbegong, perhaps, though
>>>> it's odd that it only happens on HEAD runs.
> 
>>> chipmunk also:
>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2019-08-06%2014:16:16
> 
>> FWIW, I looked at the logs in /var/log/* on chipmunk, and found no
>> evidence of OOM killings. I can see nothing unusual in the OS logs
>> around the time of that failure.
> 
> Oh, that is very useful info, thanks.  That seems to mean that we
> should be suspecting a segfault, assertion failure, etc inside
> the postmaster.  I don't see any TRAP message in chipmunk's log,
> so assertion failure seems to be ruled out, but other sorts of
> process-crashing errors would fit the facts.
> 
> A stack trace from the crash would be mighty useful info along
> about here.  I wonder whether chipmunk has the infrastructure
> needed to create such a thing.  From memory, the buildfarm requires
> gdb for that, but not sure if there are additional requirements.

It does have gdb installed.

> Also, if you're using systemd or something else that thinks it
> ought to interfere with where cores get dropped, that could be
> a problem.

I think they should just go to a file called "core", I don't think I've 
changed any settings related to it, at least. I tried "find / -name 
core*", but didn't find any core files, though.

- Heikki

Re: stress test for parallel workers

From

Tom Lane

Date:

07 August 2019, 14:45:25

Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On 07/08/2019 16:57, Tom Lane wrote:
>> Also, if you're using systemd or something else that thinks it
>> ought to interfere with where cores get dropped, that could be
>> a problem.

> I think they should just go to a file called "core", I don't think I've 
> changed any settings related to it, at least. I tried "find / -name 
> core*", but didn't find any core files, though.

On Linux machines the first thing to check is

cat /proc/sys/kernel/core_pattern

On a Debian machine I have handy, that just says "core", but Red Hat
tends to mess with it ...

            regards, tom lane

Re: stress test for parallel workers

From

Heikki Linnakangas

Date:

07 August 2019, 16:08:04

On 07/08/2019 17:45, Tom Lane wrote:
> Heikki Linnakangas <hlinnaka@iki.fi> writes:
>> On 07/08/2019 16:57, Tom Lane wrote:
>>> Also, if you're using systemd or something else that thinks it
>>> ought to interfere with where cores get dropped, that could be
>>> a problem.
> 
>> I think they should just go to a file called "core", I don't think I've
>> changed any settings related to it, at least. I tried "find / -name
>> core*", but didn't find any core files, though.
> 
> On Linux machines the first thing to check is
> 
> cat /proc/sys/kernel/core_pattern
> 
> On a Debian machine I have handy, that just says "core", but Red Hat
> tends to mess with it ...

It's just "core" on chipmunk, too.

- Heikki

Re: stress test for parallel workers

From

Tom Lane

Date:

07 October 2019, 04:07:48

Thomas Munro <thomas.munro@gmail.com> writes:
> On Wed, Aug 7, 2019 at 4:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Yeah, I've been wondering whether pg_ctl could fork off a subprocess
>> that would fork the postmaster, wait for the postmaster to exit, and then
>> report the exit status.  Where to report it *to* seems like the hard part,
>> but maybe an answer that worked for the buildfarm would be enough for now.

> Oh, right, you don't even need subreaper tricks (I was imagining we
> had a double fork somewhere we don't).

I got around to looking at how to do this.  Seeing that chipmunk hasn't
failed again, I'm inclined to write that off as perhaps unrelated.
That leaves us to diagnose the pg_upgrade failures on wobbegong and
vulpes.  The pg_upgrade test uses pg_ctl to start the postmaster,
and the only simple way to wedge this requirement into pg_ctl is as
attached.  Now, the attached is completely *not* suitable as a permanent
patch, because it degrades or breaks a number of pg_ctl behaviors that
rely on knowing the postmaster's actual PID rather than that of the
parent shell.  But it gets through check-world, so I think we can stick it
in transiently to see what it can teach us about the buildfarm failures.
Given wobbegong's recent failure rate, I don't think we'll have to wait
long.

Some notes about the patch:

* The core idea is to change start_postmaster's shell invocation
so that the shell doesn't just exec the postmaster, but runs a
mini shell script that runs the postmaster and then reports its
exit status.  I found that this still needed a dummy exec to force
the shell to perform the I/O redirections on itself, else pg_ctl's
TAP tests fail.  (I think what was happening was that if the shell
continued to hold open its original stdin, IPC::Run didn't believe
the command was done.)

* This means that what start_postmaster returns is not the postmaster's
own PID, but that of the parent shell.  So we have to lobotomize
wait_for_postmaster to handle the PID the same way as on Windows
(where that was already true); it can't test for exact equality
between the child process PID and what's in postmaster.pid.
(trap_sigint_during_startup is also broken, but we don't need that
to work to get through the regression tests.)

* That makes recovery/t/017_shm.pl fail, because there's a race
condition: after killing the old postmaster, the existing
postmaster.pid is enough to fool "pg_ctl start" into thinking the new
postmaster is already running.  I fixed that by making pg_ctl reject
any PID seen in a pre-existing postmaster.pid file.  That has a
nonzero probability of false match, so I would not want to stick it
in as a permanent thing on Unix ... but I wonder if it wouldn't be
an improvement over the current situation on Windows.

            regards, tom lane

diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index dd76be6..316651c 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -106,6 +106,7 @@ static char promote_file[MAXPGPATH];
 static char logrotate_file[MAXPGPATH];

 static volatile pgpid_t postmasterPID = -1;
+static pgpid_t old_postmaster_pid = 0;

 #ifdef WIN32
 static DWORD pgctl_start_type = SERVICE_AUTO_START;
@@ -490,16 +491,17 @@ start_postmaster(void)

     /*
      * Since there might be quotes to handle here, it is easier simply to pass
-     * everything to a shell to process them.  Use exec so that the postmaster
-     * has the same PID as the current child process.
+     * everything to a shell to process them.
+     *
+     * Since we aren't telling the shell to directly exec the postmaster,
+     * the returned PID is a parent process, the same as on Windows.
      */
     if (log_file != NULL)
-        snprintf(cmd, MAXPGPATH, "exec \"%s\" %s%s < \"%s\" >> \"%s\" 2>&1",
-                 exec_path, pgdata_opt, post_opts,
-                 DEVNULL, log_file);
+        snprintf(cmd, MAXPGPATH, "exec < \"%s\" >> \"%s\" 2>&1; \"%s\" %s%s; echo postmaster exit status is $?",
+                 DEVNULL, log_file, exec_path, pgdata_opt, post_opts);
     else
-        snprintf(cmd, MAXPGPATH, "exec \"%s\" %s%s < \"%s\" 2>&1",
-                 exec_path, pgdata_opt, post_opts, DEVNULL);
+        snprintf(cmd, MAXPGPATH, "exec < \"%s\" 2>&1; \"%s\" %s%s; echo postmaster exit status is $?",
+                 DEVNULL, exec_path, pgdata_opt, post_opts);

     (void) execl("/bin/sh", "/bin/sh", "-c", cmd, (char *) NULL);

@@ -586,12 +588,8 @@ wait_for_postmaster(pgpid_t pm_pid, bool do_checkpoint)
             pmpid = atol(optlines[LOCK_FILE_LINE_PID - 1]);
             pmstart = atol(optlines[LOCK_FILE_LINE_START_TIME - 1]);
             if (pmstart >= start_time - 2 &&
-#ifndef WIN32
-                pmpid == pm_pid
-#else
-            /* Windows can only reject standalone-backend PIDs */
-                pmpid > 0
-#endif
+            /* If pid is the value we saw before starting, assume it's stale */
+                pmpid > 0 && pmpid != old_postmaster_pid
                 )
             {
                 /*
@@ -621,7 +619,7 @@ wait_for_postmaster(pgpid_t pm_pid, bool do_checkpoint)
          * Check whether the child postmaster process is still alive.  This
          * lets us exit early if the postmaster fails during startup.
          *
-         * On Windows, we may be checking the postmaster's parent shell, but
+         * We may be checking the postmaster's parent shell, but
          * that's fine for this purpose.
          */
 #ifndef WIN32
@@ -823,13 +821,12 @@ do_init(void)
 static void
 do_start(void)
 {
-    pgpid_t        old_pid = 0;
     pgpid_t        pm_pid;

     if (ctl_command != RESTART_COMMAND)
     {
-        old_pid = get_pgpid(false);
-        if (old_pid != 0)
+        old_postmaster_pid = get_pgpid(false);
+        if (old_postmaster_pid != 0)
             write_stderr(_("%s: another server might be running; "
                            "trying to start server anyway\n"),
                          progname);

Re: stress test for parallel workers

From

Robert Haas

Date:

07 October 2019, 13:00:35

On Tue, Jul 23, 2019 at 7:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Parallel workers aren't ever allowed to write, in the current
> implementation, so it's not real obvious why they'd have any
> WAL log files open at all.

Parallel workers are not forbidden to write WAL, nor are they
forbidden to modify blocks. They could legally HOT-prune, for example,
though I'm not positive whether they actually do.

The prohibition is at a higher level: they can't create new tuples or
delete existing ones.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: stress test for parallel workers

From

Tom Lane

Date:

10 October 2019, 21:34:51

I wrote:
>>> Yeah, I've been wondering whether pg_ctl could fork off a subprocess
>>> that would fork the postmaster, wait for the postmaster to exit, and then
>>> report the exit status.

> [ pushed at 6a5084eed ]
> Given wobbegong's recent failure rate, I don't think we'll have to wait
> long.

Indeed, we didn't:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wobbegong&dt=2019-10-10%2020%3A54%3A46

The tail end of the system log looks like

2019-10-10 21:00:33.717 UTC [15127:306] pg_regress/date FATAL:  postmaster exited during a parallel transaction
2019-10-10 21:00:33.717 UTC [15127:307] pg_regress/date LOG:  disconnection: session time: 0:00:02.896 user=fedora
database=regressionhost=[local] 
/bin/sh: line 1: 14168 Segmentation fault      (core dumped)
"/home/fedora/build-farm-10-clang/buildroot/HEAD/pgsql.build/tmp_install/home/fedora/build-farm-clang/buildroot/HEAD/inst/bin/postgres"
-F-c listen_addresses="" -k "/tmp/pg_upgrade_check-ZrhQ4h" 
postmaster exit status is 139

So that's definitive proof that the postmaster is suffering a SIGSEGV.
Unfortunately, we weren't blessed with a stack trace, even though
wobbegong is running a buildfarm client version that is new enough
to try to collect one.  However, seeing that wobbegong is running
a pretty-recent Fedora release, the odds are that systemd-coredump
has commandeered the core dump and squirreled it someplace where
we can't find it.

Much as one could wish otherwise, systemd doesn't seem likely to
either go away or scale back its invasiveness; so I'm afraid we
are probably going to need to teach the buildfarm client how to
negotiate with systemd-coredump at some point.  I don't much want
to do that right this minute, though.

A nearer-term solution would be to reproduce this manually and
dig into the core.  Mark, are you in a position to give somebody
ssh access to wobbegong's host, or another similarly-configured VM?

(While at it, it'd be nice to investigate the infinite_recurse
failures we've been seeing on all those ppc64 critters ...)

            regards, tom lane

Re: stress test for parallel workers

From

Mark Wong

Date:

10 October 2019, 21:53:51

On Thu, Oct 10, 2019 at 05:34:51PM -0400, Tom Lane wrote:
> A nearer-term solution would be to reproduce this manually and
> dig into the core.  Mark, are you in a position to give somebody
> ssh access to wobbegong's host, or another similarly-configured VM?
> 
> (While at it, it'd be nice to investigate the infinite_recurse
> failures we've been seeing on all those ppc64 critters ...)

Yeah, whoever would like access, just send me your ssh key and login
you'd like to use, and I'll get you set up.

Regards,
Mark

-- 
Mark Wong
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/

Re: stress test for parallel workers

From

Andrew Dunstan

Date:

10 October 2019, 22:01:14

On 10/10/19 5:34 PM, Tom Lane wrote:
> I wrote:
>>>> Yeah, I've been wondering whether pg_ctl could fork off a subprocess
>>>> that would fork the postmaster, wait for the postmaster to exit, and then
>>>> report the exit status.
>> [ pushed at 6a5084eed ]
>> Given wobbegong's recent failure rate, I don't think we'll have to wait
>> long.
> Indeed, we didn't:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wobbegong&dt=2019-10-10%2020%3A54%3A46
>
> The tail end of the system log looks like
>
> 2019-10-10 21:00:33.717 UTC [15127:306] pg_regress/date FATAL:  postmaster exited during a parallel transaction
> 2019-10-10 21:00:33.717 UTC [15127:307] pg_regress/date LOG:  disconnection: session time: 0:00:02.896 user=fedora
database=regressionhost=[local]
 
> /bin/sh: line 1: 14168 Segmentation fault      (core dumped)
"/home/fedora/build-farm-10-clang/buildroot/HEAD/pgsql.build/tmp_install/home/fedora/build-farm-clang/buildroot/HEAD/inst/bin/postgres"
-F-c listen_addresses="" -k "/tmp/pg_upgrade_check-ZrhQ4h"
 
> postmaster exit status is 139
>
> So that's definitive proof that the postmaster is suffering a SIGSEGV.
> Unfortunately, we weren't blessed with a stack trace, even though
> wobbegong is running a buildfarm client version that is new enough
> to try to collect one.  However, seeing that wobbegong is running
> a pretty-recent Fedora release, the odds are that systemd-coredump
> has commandeered the core dump and squirreled it someplace where
> we can't find it.



At least on F29 I have set /proc/sys/kernel/core_pattern and it works.



>
> Much as one could wish otherwise, systemd doesn't seem likely to
> either go away or scale back its invasiveness; so I'm afraid we
> are probably going to need to teach the buildfarm client how to
> negotiate with systemd-coredump at some point.  I don't much want
> to do that right this minute, though.
>
> A nearer-term solution would be to reproduce this manually and
> dig into the core.  Mark, are you in a position to give somebody
> ssh access to wobbegong's host, or another similarly-configured VM?



I have given Mark my SSH key. That doesn't mean others interested shouldn't.


>
> (While at it, it'd be nice to investigate the infinite_recurse
> failures we've been seeing on all those ppc64 critters ...)
>
>             



Yeah.


cheers


andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: stress test for parallel workers

From

Andrew Dunstan

Date:

11 October 2019, 15:12:28

On 10/10/19 6:01 PM, Andrew Dunstan wrote:
> On 10/10/19 5:34 PM, Tom Lane wrote:
>> I wrote:
>>>>> Yeah, I've been wondering whether pg_ctl could fork off a subprocess
>>>>> that would fork the postmaster, wait for the postmaster to exit, and then
>>>>> report the exit status.
>>> [ pushed at 6a5084eed ]
>>> Given wobbegong's recent failure rate, I don't think we'll have to wait
>>> long.
>> Indeed, we didn't:
>>
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wobbegong&dt=2019-10-10%2020%3A54%3A46
>>
>> The tail end of the system log looks like
>>
>> 2019-10-10 21:00:33.717 UTC [15127:306] pg_regress/date FATAL:  postmaster exited during a parallel transaction
>> 2019-10-10 21:00:33.717 UTC [15127:307] pg_regress/date LOG:  disconnection: session time: 0:00:02.896 user=fedora
database=regressionhost=[local]
 
>> /bin/sh: line 1: 14168 Segmentation fault      (core dumped)
"/home/fedora/build-farm-10-clang/buildroot/HEAD/pgsql.build/tmp_install/home/fedora/build-farm-clang/buildroot/HEAD/inst/bin/postgres"
-F-c listen_addresses="" -k "/tmp/pg_upgrade_check-ZrhQ4h"
 
>> postmaster exit status is 139
>>
>> So that's definitive proof that the postmaster is suffering a SIGSEGV.
>> Unfortunately, we weren't blessed with a stack trace, even though
>> wobbegong is running a buildfarm client version that is new enough
>> to try to collect one.  However, seeing that wobbegong is running
>> a pretty-recent Fedora release, the odds are that systemd-coredump
>> has commandeered the core dump and squirreled it someplace where
>> we can't find it.
>
>
> At least on F29 I have set /proc/sys/kernel/core_pattern and it works.




I have done the same on this machine. wobbegong runs every hour, so
let's see what happens next. With any luck the buildfarm will give us a
stack trace without needing further action.


cheers


andrew


-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: stress test for parallel workers

From

Tom Lane

Date:

11 October 2019, 15:45:31

Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
>> At least on F29 I have set /proc/sys/kernel/core_pattern and it works.

FWIW, I'm not excited about that as a permanent solution.  It requires
root privilege, and it affects the whole machine not only the buildfarm,
and making it persist across reboots is even more invasive.

> I have done the same on this machine. wobbegong runs every hour, so
> let's see what happens next. With any luck the buildfarm will give us a
> stack trace without needing further action.

I already collected one manually.  It looks like this:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5114
5114    {
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.26-30.fc27.ppc64le
(gdb) bt
#0  sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5114
#1  <signal handler called>
#2  0x00007fff93923ca4 in sigprocmask () from /lib64/libc.so.6
#3  0x00000000103fad08 in reaper (postgres_signal_arg=<optimized out>)
    at postmaster.c:3215
#4  <signal handler called>
#5  0x00007fff93923ca4 in sigprocmask () from /lib64/libc.so.6
#6  0x00000000103f9f98 in sigusr1_handler (postgres_signal_arg=<optimized out>)
    at postmaster.c:5275
#7  <signal handler called>
#8  0x00007fff93923ca4 in sigprocmask () from /lib64/libc.so.6
#9  0x00000000103fad08 in reaper (postgres_signal_arg=<optimized out>)
    at postmaster.c:3215
#10 <signal handler called>
#11 sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5114
#12 <signal handler called>
#13 0x00007fff93923ca4 in sigprocmask () from /lib64/libc.so.6
#14 0x00000000103f9f98 in sigusr1_handler (postgres_signal_arg=<optimized out>)
    at postmaster.c:5275
#15 <signal handler called>
#16 0x00007fff93923ca4 in sigprocmask () from /lib64/libc.so.6
#17 0x00000000103fad08 in reaper (postgres_signal_arg=<optimized out>)
    at postmaster.c:3215
...
#572 <signal handler called>
#573 0x00007fff93923ca4 in sigprocmask () from /lib64/libc.so.6
#574 0x00000000103f9f98 in sigusr1_handler (
    postgres_signal_arg=<optimized out>) at postmaster.c:5275
#575 <signal handler called>
#576 0x00007fff93923ca4 in sigprocmask () from /lib64/libc.so.6
#577 0x00000000103fad08 in reaper (postgres_signal_arg=<optimized out>)
    at postmaster.c:3215
#578 <signal handler called>
#579 sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5114
#580 <signal handler called>
#581 0x00007fff93a01514 in select () from /lib64/libc.so.6
#582 0x00000000103f7ad8 in ServerLoop () at postmaster.c:1682
#583 PostmasterMain (argc=<optimized out>, argv=<optimized out>)
    at postmaster.c:1391
#584 0x0000000000000000 in ?? ()

What we've apparently got here is that signals were received
so fast that the postmaster ran out of stack space.  I remember
Andres complaining about this as a theoretical threat, but I
hadn't seen it in the wild before.

I haven't finished investigating though, as there are some things
that remain to be explained.  The dependency on having
force_parallel_mode = regress makes sense now, because the extra
traffic to launch and reap all those parallel workers would
increase the stress on the postmaster (and it seems likely that
this stack trace corresponds exactly to alternating launch and
reap signals).  But why does it only happen during the pg_upgrade
test --- plain "make check" ought to be about the same?  I also
want to investigate why clang builds seem more prone to this
than gcc builds on the same machine; that might just be down to
more or less stack consumption, but it bears looking into.

            regards, tom lane

Re: stress test for parallel workers

From

Andrew Dunstan

Date:

11 October 2019, 18:04:41

On 10/11/19 11:45 AM, Tom Lane wrote:
> Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
>>> At least on F29 I have set /proc/sys/kernel/core_pattern and it works.
> FWIW, I'm not excited about that as a permanent solution.  It requires
> root privilege, and it affects the whole machine not only the buildfarm,
> and making it persist across reboots is even more invasive.

OK, but I'm not keen to have to tussle with coredumpctl. Right now our
logic says: for every core file in the data directory try to get a
backtrace. Use of systemd-coredump means that gets blown out of the
water, and we no longer even have a simple test to see if our program
caused a core dump.

cheers

andrew

-- 

Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: stress test for parallel workers

From

Tom Lane

Date:

11 October 2019, 18:56:41

I wrote:
> What we've apparently got here is that signals were received
> so fast that the postmaster ran out of stack space.  I remember
> Andres complaining about this as a theoretical threat, but I
> hadn't seen it in the wild before.

> I haven't finished investigating though, as there are some things
> that remain to be explained.

I still don't have a good explanation for why this only seems to
happen in the pg_upgrade test sequence.  However, I did notice
something very interesting: the postmaster crashes after consuming
only about 1MB of stack space.  This is despite the prevailing
setting of "ulimit -s" being 8192 (8MB).  I also confirmed that
the value of max_stack_depth within the crashed process is 2048,
which implies that get_stack_depth_rlimit got some value larger
than 2MB from getrlimit(RLIMIT_STACK).  And yet, here we have
a crash, and the process memory map confirms that only 1MB was
allocated in the stack region.  So it's really hard to explain
that as anything except a kernel bug: sometimes, the kernel
doesn't give us as much stack as it promised it would.  And the
machine is not loaded enough for there to be any rational
resource-exhaustion excuse for that.

This matches up with the intermittent infinite_recurse failures
we've been seeing in the buildfarm.  Those are happening across
a range of systems, but they're (almost) all Linux-based ppc64,
suggesting that there's a longstanding arch-specific kernel bug
involved.  For reference, I scraped the attached list of such
failures in the last three months.  I wonder whether we can get
the attention of any kernel hackers about that.

Anyway, as to what to do about it --- it occurred to me to wonder
why we are relying on having the signal handlers block and unblock
signals manually, when we could tell sigaction() that we'd like
signals blocked.  It is reasonable to expect that the signal support
is designed to not recursively consume stack space in the face of
a series of signals, while the way we are doing it clearly opens
us up to recursive space consumption.  The stack trace I showed
before proves that the recursion happens at the points where the
signal handlers unblock signals.

As a quick hack I made the attached patch, and it seems to fix the
problem on wobbegong's host.  I don't see crashes any more, and
watching the postmaster's stack space consumption, it stays
comfortably at a tad under 200KB (probably the default initial
allocation), while without the patch it tends to blow up to 700K
or more even in runs that don't crash.

This patch isn't committable as-is because it will (I suppose)
break things on Windows; we still need the old way there for lack
of sigaction().  But that could be fixed with a few #ifdefs.
I'm also kind of tempted to move pqsignal_no_restart into
backend/libpq/pqsignal.c (where BlockSig is defined) and maybe
rename it, but I'm not sure to what.

This issue might go away if we switched to a postmaster implementation
that doesn't do work in the signal handlers, but I'm not entirely
convinced of that.  The existing handlers don't seem to consume a lot
of stack space in themselves (there's not many local variables in them).
The bulk of the stack consumption is seemingly in the platform's signal
infrastructure, so that we might still have a stack consumption issue
even with fairly trivial handlers, if we don't tell sigaction to block
signals.  In any case, this fix seems potentially back-patchable,
while we surely wouldn't risk back-patching a postmaster rewrite.

Comments?

            regards, tom lane

   sysname    |   architecture   |  operating_system  |    sys_owner    |    branch     |      snapshot       |
stage     |                                                      err
  

--------------+------------------+--------------------+-----------------+---------------+---------------------+-----------------+---------------------------------------------------------------------------------------------------------------
 cavefish     | ppc64le (POWER9) | Ubuntu             | Mark Wong       | HEAD          | 2019-07-13 03:49:38 |
pg_upgradeCheck| 2019-07-13 04:01:23.437 UTC [9365:71] DETAIL:  Failed process was running: select infinite_recurse(); 
 pintail      | ppc64le (POWER9) | Debian GNU/Linux   | Mark Wong       | REL_12_STABLE | 2019-07-13 19:36:51 | Check
       | 2019-07-13 19:39:29.013 UTC [31086:5] DETAIL:  Failed process was running: select infinite_recurse(); 
 bonito       | ppc64le (POWER9) | Fedora             | Mark Wong       | HEAD          | 2019-07-19 23:13:01 | Check
       | 2019-07-19 23:16:33.330 UTC [24191:70] DETAIL:  Failed process was running: select infinite_recurse(); 
 takin        | ppc64le          | opensuse           | Mark Wong       | HEAD          | 2019-07-24 08:24:56 | Check
       | 2019-07-24 08:28:01.735 UTC [16366:75] DETAIL:  Failed process was running: select infinite_recurse(); 
 quokka       | ppc64            | RHEL               | Sandeep Thakkar | HEAD          | 2019-07-31 02:00:07 |
pg_upgradeCheck| 2019-07-31 03:04:04.043 BST [5d40f709.776a:5] DETAIL:  Failed process was running: select
infinite_recurse();
 elasmobranch | ppc64le (POWER9) | openSUSE Leap      | Mark Wong       | HEAD          | 2019-08-01 03:13:38 | Check
       | 2019-08-01 03:19:05.394 UTC [22888:62] DETAIL:  Failed process was running: select infinite_recurse(); 
 buri         | ppc64le (POWER9) | CentOS Linux       | Mark Wong       | HEAD          | 2019-08-02 00:10:23 | Check
       | 2019-08-02 00:17:11.075 UTC [28222:73] DETAIL:  Failed process was running: select infinite_recurse(); 
 urocryon     | ppc64le          | debian             | Mark Wong       | HEAD          | 2019-08-02 05:43:46 | Check
       | 2019-08-02 05:51:51.944 UTC [2724:64] DETAIL:  Failed process was running: select infinite_recurse(); 
 batfish      | ppc64le          | Ubuntu             | Mark Wong       | HEAD          | 2019-08-04 19:02:36 |
pg_upgradeCheck| 2019-08-04 19:08:11.728 UTC [23899:79] DETAIL:  Failed process was running: select infinite_recurse(); 
 buri         | ppc64le (POWER9) | CentOS Linux       | Mark Wong       | REL_12_STABLE | 2019-08-07 00:03:29 |
pg_upgradeCheck| 2019-08-07 00:11:24.500 UTC [1405:5] DETAIL:  Failed process was running: select infinite_recurse(); 
 quokka       | ppc64            | RHEL               | Sandeep Thakkar | REL_12_STABLE | 2019-08-08 02:43:45 |
pg_upgradeCheck| 2019-08-08 03:47:38.115 BST [5d4b8d3f.cdd7:5] DETAIL:  Failed process was running: select
infinite_recurse();
 quokka       | ppc64            | RHEL               | Sandeep Thakkar | HEAD          | 2019-08-08 14:00:08 | Check
       | 2019-08-08 15:02:59.770 BST [5d4c2b88.cad9:5] DETAIL:  Failed process was running: select infinite_recurse(); 
 mereswine    | ARMv7            | Linux debian-armhf | Clarence Ho     | REL_11_STABLE | 2019-08-11 02:10:12 |
InstallCheck-C | 2019-08-11 02:36:10.159 PDT [5004:4] DETAIL:  Failed process was running: select infinite_recurse(); 
 takin        | ppc64le          | opensuse           | Mark Wong       | HEAD          | 2019-08-11 08:02:48 | Check
       | 2019-08-11 08:05:57.789 UTC [11500:67] DETAIL:  Failed process was running: select infinite_recurse(); 
 mereswine    | ARMv7            | Linux debian-armhf | Clarence Ho     | REL_12_STABLE | 2019-08-11 09:52:46 |
pg_upgradeCheck| 2019-08-11 04:21:16.756 PDT [6804:5] DETAIL:  Failed process was running: select infinite_recurse(); 
 mereswine    | ARMv7            | Linux debian-armhf | Clarence Ho     | HEAD          | 2019-08-11 11:29:27 |
pg_upgradeCheck| 2019-08-11 07:15:28.454 PDT [9954:76] DETAIL:  Failed process was running: select infinite_recurse(); 
 demoiselle   | ppc64le (POWER9) | openSUSE Leap      | Mark Wong       | HEAD          | 2019-08-11 14:51:38 |
pg_upgradeCheck| 2019-08-11 14:57:29.422 UTC [9436:70] DETAIL:  Failed process was running: select infinite_recurse(); 
 buri         | ppc64le (POWER9) | CentOS Linux       | Mark Wong       | HEAD          | 2019-08-15 00:09:57 | Check
       | 2019-08-15 00:17:43.282 UTC [2831:68] DETAIL:  Failed process was running: select infinite_recurse(); 
 urocryon     | ppc64le          | debian             | Mark Wong       | HEAD          | 2019-08-19 06:28:34 | Check
       | 2019-08-19 06:39:25.749 UTC [26357:66] DETAIL:  Failed process was running: select infinite_recurse(); 
 urocryon     | ppc64le          | debian             | Mark Wong       | HEAD          | 2019-08-21 06:34:47 | Check
       | 2019-08-21 06:37:39.089 UTC [14505:73] DETAIL:  Failed process was running: select infinite_recurse(); 
 demoiselle   | ppc64le (POWER9) | openSUSE Leap      | Mark Wong       | REL_12_STABLE | 2019-09-04 14:42:08 |
pg_upgradeCheck| 2019-09-04 14:56:15.219 UTC [11008:5] DETAIL:  Failed process was running: select infinite_recurse(); 
 pintail      | ppc64le (POWER9) | Debian GNU/Linux   | Mark Wong       | REL_12_STABLE | 2019-09-07 19:22:48 |
pg_upgradeCheck| 2019-09-07 19:27:20.789 UTC [25645:5] DETAIL:  Failed process was running: select infinite_recurse(); 
 quokka       | ppc64            | RHEL               | Sandeep Thakkar | REL_12_STABLE | 2019-09-10 02:00:15 | Check
       | 2019-09-10 03:03:17.711 BST [5d77045a.5776:5] DETAIL:  Failed process was running: select infinite_recurse(); 
 buri         | ppc64le (POWER9) | CentOS Linux       | Mark Wong       | HEAD          | 2019-09-17 23:12:33 | Check
       | 2019-09-17 23:19:45.769 UTC [20920:77] DETAIL:  Failed process was running: select infinite_recurse(); 
 shoveler     | ppc64le (POWER8) | Debian GNU/Linux   | Mark Wong       | HEAD          | 2019-09-18 13:52:55 | Check
       | 2019-09-18 13:56:11.273 UTC [563:71] DETAIL:  Failed process was running: select infinite_recurse(); 
 buri         | ppc64le (POWER9) | CentOS Linux       | Mark Wong       | HEAD          | 2019-09-19 00:01:54 | Check
       | 2019-09-19 00:09:30.734 UTC [11775:67] DETAIL:  Failed process was running: select infinite_recurse(); 
 gadwall      | ppc64le (POWER9) | Debian GNU/Linux   | Mark Wong       | HEAD          | 2019-09-21 12:26:50 | Check
       | 2019-09-21 12:31:16.199 UTC [7119:70] DETAIL:  Failed process was running: select infinite_recurse(); 
 quokka       | ppc64            | RHEL               | Sandeep Thakkar | HEAD          | 2019-09-24 14:00:11 |
pg_upgradeCheck| 2019-09-24 15:04:49.272 BST [5d8a2276.cba9:5] DETAIL:  Failed process was running: select
infinite_recurse();
 urocryon     | ppc64le          | debian             | Mark Wong       | HEAD          | 2019-09-25 06:24:24 | Check
       | 2019-09-25 06:31:54.876 UTC [26608:76] DETAIL:  Failed process was running: select infinite_recurse(); 
 pintail      | ppc64le (POWER9) | Debian GNU/Linux   | Mark Wong       | HEAD          | 2019-09-26 19:33:59 | Check
       | 2019-09-26 19:39:25.850 UTC [6259:69] DETAIL:  Failed process was running: select infinite_recurse(); 
 shoveler     | ppc64le (POWER8) | Debian GNU/Linux   | Mark Wong       | HEAD          | 2019-09-28 13:54:16 | Check
       | 2019-09-28 13:59:02.354 UTC [7586:71] DETAIL:  Failed process was running: select infinite_recurse(); 
 buri         | ppc64le (POWER9) | CentOS Linux       | Mark Wong       | REL_12_STABLE | 2019-09-28 23:14:23 |
pg_upgradeCheck| 2019-09-28 23:22:13.987 UTC [20133:5] DETAIL:  Failed process was running: select infinite_recurse(); 
 gadwall      | ppc64le (POWER9) | Debian GNU/Linux   | Mark Wong       | HEAD          | 2019-10-02 12:44:49 | Check
       | 2019-10-02 12:50:17.823 UTC [10840:76] DETAIL:  Failed process was running: select infinite_recurse(); 
 cavefish     | ppc64le (POWER9) | Ubuntu             | Mark Wong       | HEAD          | 2019-10-04 04:37:58 | Check
       | 2019-10-04 04:46:03.804 UTC [27768:69] DETAIL:  Failed process was running: select infinite_recurse(); 
 cavefish     | ppc64le (POWER9) | Ubuntu             | Mark Wong       | HEAD          | 2019-10-07 03:22:37 |
pg_upgradeCheck| 2019-10-07 03:28:05.031 UTC [2991:68] DETAIL:  Failed process was running: select infinite_recurse(); 
 bufflehead   | ppc64le (POWER8) | openSUSE Leap      | Mark Wong       | HEAD          | 2019-10-09 20:46:56 |
pg_upgradeCheck| 2019-10-09 20:51:47.408 UTC [18136:86] DETAIL:  Failed process was running: select infinite_recurse(); 
 vulpes       | ppc64le          | fedora             | Mark Wong       | HEAD          | 2019-10-11 08:53:50 | Check
       | 2019-10-11 08:57:59.370 UTC [14908:77] DETAIL:  Failed process was running: select infinite_recurse(); 
 shoveler     | ppc64le (POWER8) | Debian GNU/Linux   | Mark Wong       | HEAD          | 2019-10-11 13:54:38 |
pg_upgradeCheck| 2019-10-11 14:01:53.903 UTC [5911:76] DETAIL:  Failed process was running: select infinite_recurse(); 
(38 rows)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 85f15a5..fff83b7 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2640,8 +2640,6 @@ SIGHUP_handler(SIGNAL_ARGS)
 {
     int            save_errno = errno;

-    PG_SETMASK(&BlockSig);
-
     if (Shutdown <= SmartShutdown)
     {
         ereport(LOG,
@@ -2700,8 +2698,6 @@ SIGHUP_handler(SIGNAL_ARGS)
 #endif
     }

-    PG_SETMASK(&UnBlockSig);
-
     errno = save_errno;
 }

@@ -2714,8 +2710,6 @@ pmdie(SIGNAL_ARGS)
 {
     int            save_errno = errno;

-    PG_SETMASK(&BlockSig);
-
     ereport(DEBUG2,
             (errmsg_internal("postmaster received signal %d",
                              postgres_signal_arg)));
@@ -2880,8 +2874,6 @@ pmdie(SIGNAL_ARGS)
             break;
     }

-    PG_SETMASK(&UnBlockSig);
-
     errno = save_errno;
 }

@@ -2895,8 +2887,6 @@ reaper(SIGNAL_ARGS)
     int            pid;            /* process id of dead child process */
     int            exitstatus;        /* its exit status */

-    PG_SETMASK(&BlockSig);
-
     ereport(DEBUG4,
             (errmsg_internal("reaping dead processes")));

@@ -3212,8 +3202,6 @@ reaper(SIGNAL_ARGS)
     PostmasterStateMachine();

     /* Done with signal handler */
-    PG_SETMASK(&UnBlockSig);
-
     errno = save_errno;
 }

@@ -5114,8 +5102,6 @@ sigusr1_handler(SIGNAL_ARGS)
 {
     int            save_errno = errno;

-    PG_SETMASK(&BlockSig);
-
     /* Process background worker state change. */
     if (CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE))
     {
@@ -5272,8 +5258,6 @@ sigusr1_handler(SIGNAL_ARGS)
         signal_child(StartupPID, SIGUSR2);
     }

-    PG_SETMASK(&UnBlockSig);
-
     errno = save_errno;
 }

diff --git a/src/port/pqsignal.c b/src/port/pqsignal.c
index ecb9ca2..93a039b 100644
--- a/src/port/pqsignal.c
+++ b/src/port/pqsignal.c
@@ -65,7 +65,11 @@ pqsignal(int signo, pqsigfunc func)
  *
  * On Windows, this would be identical to pqsignal(), so don't bother.
  */
-#ifndef WIN32
+#ifndef FRONTEND
+
+extern sigset_t UnBlockSig,
+            BlockSig,
+            StartupBlockSig;

 pqsigfunc
 pqsignal_no_restart(int signo, pqsigfunc func)
@@ -74,7 +78,7 @@ pqsignal_no_restart(int signo, pqsigfunc func)
                 oact;

     act.sa_handler = func;
-    sigemptyset(&act.sa_mask);
+    act.sa_mask = BlockSig;
     act.sa_flags = 0;
 #ifdef SA_NOCLDSTOP
     if (signo == SIGCHLD)

Re: stress test for parallel workers

From

Tom Lane

Date:

11 October 2019, 19:11:57

Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
> On 10/11/19 11:45 AM, Tom Lane wrote:
>> FWIW, I'm not excited about that as a permanent solution.  It requires
>> root privilege, and it affects the whole machine not only the buildfarm,
>> and making it persist across reboots is even more invasive.

> OK, but I'm not keen to have to tussle with coredumpctl. Right now our
> logic says: for every core file in the data directory try to get a
> backtrace. Use of systemd-coredump means that gets blown out of the
> water, and we no longer even have a simple test to see if our program
> caused a core dump.

I haven't played that much with this software, but it seems you can
do "coredumpctl list <path-to-executable>" to find out what it has
for a particular executable.  You would likely need a time-based
filter too (to avoid regurgitating previous runs' failures),
but that seems do-able.

            regards, tom lane

Re: stress test for parallel workers

From

Thomas Munro

Date:

11 October 2019, 19:41:12

On Sat, Oct 12, 2019 at 7:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> This matches up with the intermittent infinite_recurse failures
> we've been seeing in the buildfarm.  Those are happening across
> a range of systems, but they're (almost) all Linux-based ppc64,
> suggesting that there's a longstanding arch-specific kernel bug
> involved.  For reference, I scraped the attached list of such
> failures in the last three months.  I wonder whether we can get
> the attention of any kernel hackers about that.

Yeah, I don't know anything about this stuff, but I was also beginning
to wonder if something is busted in the arch-specific fault.c code
that checks if stack expansion is valid[1], in a way that fails with a
rapidly growing stack, well timed incoming signals, and perhaps
Docker/LXC (that's on Mark's systems IIUC, not sure about the ARM
boxes that failed or if it could be relevant here).  Perhaps the
arbitrary tolerances mentioned in that comment are relevant.

[1] https://github.com/torvalds/linux/blob/master/arch/powerpc/mm/fault.c#L244

Re: stress test for parallel workers

From

Tom Lane

Date:

11 October 2019, 20:13:46

Thomas Munro <thomas.munro@gmail.com> writes:
> Yeah, I don't know anything about this stuff, but I was also beginning
> to wonder if something is busted in the arch-specific fault.c code
> that checks if stack expansion is valid[1], in a way that fails with a
> rapidly growing stack, well timed incoming signals, and perhaps
> Docker/LXC (that's on Mark's systems IIUC, not sure about the ARM
> boxes that failed or if it could be relevant here).  Perhaps the
> arbitrary tolerances mentioned in that comment are relevant.
> [1] https://github.com/torvalds/linux/blob/master/arch/powerpc/mm/fault.c#L244

Hm, the bit about "we'll allow up to 1MB unconditionally" sure seems
to match up with the observations here.  I also wonder about the
arbitrary definition of "a long way" as 2KB.  Could it be that that
misbehaves in the face of a userland function with more than 2KB of
local variables?

It's not very clear how those things would lead to an intermittent
failure though.  In the case of the postmaster crashes, we now see
that timing of signal receipts is relevant.  For infinite_recurse,
maybe it only fails if an sinval interrupt happens at the wrong time?
(This theory would predict that commit 798070ec0 made the problem
way more prevalent than it had been ... need to go see if the
buildfarm history supports that.)

            regards, tom lane

Re: stress test for parallel workers

From

Mark Wong

Date:

11 October 2019, 20:28:53

On Sat, Oct 12, 2019 at 08:41:12AM +1300, Thomas Munro wrote:
> On Sat, Oct 12, 2019 at 7:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > This matches up with the intermittent infinite_recurse failures
> > we've been seeing in the buildfarm.  Those are happening across
> > a range of systems, but they're (almost) all Linux-based ppc64,
> > suggesting that there's a longstanding arch-specific kernel bug
> > involved.  For reference, I scraped the attached list of such
> > failures in the last three months.  I wonder whether we can get
> > the attention of any kernel hackers about that.
> 
> Yeah, I don't know anything about this stuff, but I was also beginning
> to wonder if something is busted in the arch-specific fault.c code
> that checks if stack expansion is valid[1], in a way that fails with a
> rapidly growing stack, well timed incoming signals, and perhaps
> Docker/LXC (that's on Mark's systems IIUC, not sure about the ARM
> boxes that failed or if it could be relevant here).  Perhaps the
> arbitrary tolerances mentioned in that comment are relevant.

This specific one (wobbegon) is OpenStack/KVM[2], for what it's worth...

"... cluster is an OpenStack based cluster offering POWER8 & POWER9 LE
instances running on KVM ..."

But to keep you on your toes, some of my ppc animals are Docker within
other OpenStack/KVM instance...

Regards,
Mark

[1] https://github.com/torvalds/linux/blob/master/arch/powerpc/mm/fault.c#L244
[2] https://osuosl.org/services/powerdev/

-- 
Mark Wong
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/

Re: stress test for parallel workers

From

Andres Freund

Date:

11 October 2019, 20:31:41

Hi,

On 2019-10-11 14:56:41 -0400, Tom Lane wrote:
> I still don't have a good explanation for why this only seems to
> happen in the pg_upgrade test sequence.  However, I did notice
> something very interesting: the postmaster crashes after consuming
> only about 1MB of stack space.  This is despite the prevailing
> setting of "ulimit -s" being 8192 (8MB).  I also confirmed that
> the value of max_stack_depth within the crashed process is 2048,
> which implies that get_stack_depth_rlimit got some value larger
> than 2MB from getrlimit(RLIMIT_STACK).  And yet, here we have
> a crash, and the process memory map confirms that only 1MB was
> allocated in the stack region.  So it's really hard to explain
> that as anything except a kernel bug: sometimes, the kernel
> doesn't give us as much stack as it promised it would.  And the
> machine is not loaded enough for there to be any rational
> resource-exhaustion excuse for that.

Linux expands stack space only on demand, thus it's possible to run out
of stack space while there ought to be stack space. Unfortunately that
during a stack expansion, which means there's no easy place to report
that.  I've seen this be hit in production on busy machines.

I wonder if the machine is configured with overcommit_memory=2,
i.e. don't overcommit.  cat /proc/sys/vm/overcommit_memory would tell.
What does grep -E '^(Mem|Commit)' /proc/meminfo show while it's
happening?

What does the signal information say? You can see it with
p $_siginfo
after receiving the signal. A SIGSEGV here, I assume.

IIRC si_code and si_errno should indicate whether ENOMEM is the reason.

> This matches up with the intermittent infinite_recurse failures
> we've been seeing in the buildfarm.  Those are happening across
> a range of systems, but they're (almost) all Linux-based ppc64,
> suggesting that there's a longstanding arch-specific kernel bug
> involved.  For reference, I scraped the attached list of such
> failures in the last three months.  I wonder whether we can get
> the attention of any kernel hackers about that.

Most of them are operated by Mark, right? So it could also just be high
memory pressure on those.
[1;5B

> Anyway, as to what to do about it --- it occurred to me to wonder
> why we are relying on having the signal handlers block and unblock
> signals manually, when we could tell sigaction() that we'd like
> signals blocked.  It is reasonable to expect that the signal support
> is designed to not recursively consume stack space in the face of
> a series of signals, while the way we are doing it clearly opens
> us up to recursive space consumption.  The stack trace I showed
> before proves that the recursion happens at the points where the
> signal handlers unblock signals.

Yea, that seems like it might be good. But we have to be careful too, as
there's some thing were do want to be interruptable from within a signal
handler. We start some processes from within one after all...

Greetings,

Andres Freund

Re: stress test for parallel workers

From

Tom Lane

Date:

11 October 2019, 20:40:41

Andres Freund <andres@anarazel.de> writes:
> On 2019-10-11 14:56:41 -0400, Tom Lane wrote:
>> ... So it's really hard to explain
>> that as anything except a kernel bug: sometimes, the kernel
>> doesn't give us as much stack as it promised it would.  And the
>> machine is not loaded enough for there to be any rational
>> resource-exhaustion excuse for that.

> Linux expands stack space only on demand, thus it's possible to run out
> of stack space while there ought to be stack space. Unfortunately that
> during a stack expansion, which means there's no easy place to report
> that.  I've seen this be hit in production on busy machines.

As I said, this machine doesn't seem busy enough for that to be a
tenable excuse; there's nobody but me logged in, and the buildfarm
critter isn't running.

> I wonder if the machine is configured with overcommit_memory=2,
> i.e. don't overcommit.  cat /proc/sys/vm/overcommit_memory would tell.

$ cat /proc/sys/vm/overcommit_memory
0

> What does grep -E '^(Mem|Commit)' /proc/meminfo show while it's
> happening?

idle:

$ grep -E '^(Mem|Commit)' /proc/meminfo
MemTotal:        2074816 kB
MemFree:           36864 kB
MemAvailable:    1779584 kB
CommitLimit:     1037376 kB
Committed_AS:     412480 kB

a few captures while regression tests are running:

$ grep -E '^(Mem|Commit)' /proc/meminfo
MemTotal:        2074816 kB
MemFree:            8512 kB
MemAvailable:    1819264 kB
CommitLimit:     1037376 kB
Committed_AS:     371904 kB
$ grep -E '^(Mem|Commit)' /proc/meminfo
MemTotal:        2074816 kB
MemFree:           32640 kB
MemAvailable:    1753792 kB
CommitLimit:     1037376 kB
Committed_AS:     585984 kB
$ grep -E '^(Mem|Commit)' /proc/meminfo
MemTotal:        2074816 kB
MemFree:           56640 kB
MemAvailable:    1695744 kB
CommitLimit:     1037376 kB
Committed_AS:     568768 kB


> What does the signal information say? You can see it with
> p $_siginfo
> after receiving the signal. A SIGSEGV here, I assume.

(gdb) p $_siginfo
$1 = {si_signo = 11, si_errno = 0, si_code = 128, _sifields = {_pad = {0 <repeats 28 times>}, _kill = {si_pid = 0,
si_uid= 0},  
    _timer = {si_tid = 0, si_overrun = 0, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _rt = {si_pid = 0, si_uid = 0,
si_sigval= { 
        sival_int = 0, sival_ptr = 0x0}}, _sigchld = {si_pid = 0, si_uid = 0, si_status = 0, si_utime = 0, si_stime =
0},_sigfault = { 
      si_addr = 0x0}, _sigpoll = {si_band = 0, si_fd = 0}}}

> Yea, that seems like it might be good. But we have to be careful too, as
> there's some thing were do want to be interruptable from within a signal
> handler. We start some processes from within one after all...

The proposed patch has zero effect on what the signal mask will be inside
a signal handler, only on the transient state during handler entry/exit.

            regards, tom lane

Re: stress test for parallel workers

From

Thomas Munro

Date:

11 October 2019, 21:03:22

On Sat, Oct 12, 2019 at 9:40 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2019-10-11 14:56:41 -0400, Tom Lane wrote:
> >> ... So it's really hard to explain
> >> that as anything except a kernel bug: sometimes, the kernel
> >> doesn't give us as much stack as it promised it would.  And the
> >> machine is not loaded enough for there to be any rational
> >> resource-exhaustion excuse for that.
>
> > Linux expands stack space only on demand, thus it's possible to run out
> > of stack space while there ought to be stack space. Unfortunately that
> > during a stack expansion, which means there's no easy place to report
> > that.  I've seen this be hit in production on busy machines.
>
> As I said, this machine doesn't seem busy enough for that to be a
> tenable excuse; there's nobody but me logged in, and the buildfarm
> critter isn't running.

Yeah.  As I speculated in the other thread[1], the straightforward
can't-allocate-any-more-space-but-no-other-way-to-tell-you-that case,
ie, the explanation that doesn't involve a bug in Linux or PostgreSQL,
seems unlikely unless we also see other more obvious signs of
occasional overcommit problems (ie not during stack expansion) on
those hosts, doesn't it?  How likely is it that this 1-2MB of stack
space is the straw that breaks the camels back, every time?

[1] https://www.postgresql.org/message-id/CA%2BhUKGJ_MkqdEH-LmmebhNLSFeyWwvYVXfPaz3A2_p27EQfZwA%40mail.gmail.com

Re: stress test for parallel workers

From

Tom Lane

Date:

11 October 2019, 21:13:04

I wrote:
> It's not very clear how those things would lead to an intermittent
> failure though.  In the case of the postmaster crashes, we now see
> that timing of signal receipts is relevant.  For infinite_recurse,
> maybe it only fails if an sinval interrupt happens at the wrong time?
> (This theory would predict that commit 798070ec0 made the problem
> way more prevalent than it had been ... need to go see if the
> buildfarm history supports that.)

That seems to fit, roughly: commit 798070ec0 moved errors.sql to execute
as part of a parallel group on 2019-04-11, and the first failure of the
infinite_recurse test happened on 2019-04-27.  Since then we've averaged
about one such failure every four days, which makes a sixteen-day gap a
little more than you'd expect, but not a huge amount more.  Anyway,
I do not see any other commits in between that would plausibly have
affected this.

In other news, I reproduced the problem with gcc on wobbegong's host,
and confirmed that the gcc build uses less stack space: one recursive
cycle of reaper() and sigusr1_handler() consumes 14768 bytes with clang,
but just 9296 bytes with gcc.  So the evident difference in failure rate
between wobbegong and vulpes is nicely explained by that.  Still no
theory about pg_upgrade versus vanilla "make check" though.  I did manage
to make it happen during "make check" by dint of reducing the "ulimit -s"
setting, so it's *possible* for it to happen there, it just doesn't.
Weird.

            regards, tom lane

Re: stress test for parallel workers

From

Tom Lane

Date:

12 October 2019, 21:25:52

I've now also been able to reproduce the "infinite_recurse" segfault
on wobbegong's host (or, since I was using a gcc build, I guess I
should say vulpes' host).  The first-order result is that it's the
same problem with the kernel not giving us as much stack space as
we expect: there's only 1179648 bytes in the stack segment in the
core dump, though we should certainly have been allowed at least 8MB.

The next interesting thing is that looking closely at the identified
spot of the SIGSEGV, there's nothing there that should be touching
the stack at all:

(gdb) x/4i $pc
=> 0x10201df0 <core_yylex+1072>:        ld      r9,0(r30)
   0x10201df4 <core_yylex+1076>:        ld      r8,128(r30)
   0x10201df8 <core_yylex+1080>:        ld      r10,152(r30)
   0x10201dfc <core_yylex+1084>:        ld      r9,0(r9)

(r30 is not pointing at the stack, but at a valid heap location.)
This code is the start of the switch case at scan.l:1064, so the
most recent successfully-executed instructions were the switch jump,
and they don't involve the stack either.

The reported sp,

(gdb) i reg sp
sp             0x7fffe6940890   140737061849232

is a good 2192 bytes above the bottom of the allocated stack space,
which is 0x7fffe6940000 according to gdb.  So we really ought to
have plenty of margin here.  What's going on?

What I suspect, given the difficulty of reproducing this, is that
what really happened is that the kernel tried to deliver a SIGUSR1
signal to us just at this point.  The kernel source code that
Thomas pointed to comments that

     * The kernel signal delivery code writes up to about 1.5kB
     * below the stack pointer (r1) before decrementing it.

There's more than 1.5kB available below sp, but what if that comment
is a lie?  In particular, I'm wondering if that number dates to PPC32
and needs to be doubled, or nearly so, to describe PPC64 reality.
If that were the case, then the signal code would not have been
able to fit its requirement, and would probably have come here to
ask for more stack space, and the hard-wired 2048 test a little
further down would have decided that that was a wild stack access.

In short, my current belief is that Linux PPC64 fails when trying
to deliver a signal if there's right around 2KB of stack remaining,
even though it should be able to expand the stack and press on.

It may well be that the reason is just that this heuristic in
bad_stack_expansion() is out of date.  Or there might be a similarly
bogus value somewhere in the signal-delivery code.

            regards, tom lane

Re: stress test for parallel workers

From

Tom Lane

Date:

13 October 2019, 00:06:32

I wrote:
> In short, my current belief is that Linux PPC64 fails when trying
> to deliver a signal if there's right around 2KB of stack remaining,
> even though it should be able to expand the stack and press on.

I figured I should try to remove some variables from the equation
by demonstrating this claim without involving Postgres.  The attached
test program eats some stack space and then waits to be sent SIGUSR1.
For some values of "some stack space", it dumps core:

[tgl@postgresql-fedora ~]$ gcc -g -Wall -O1 stacktest.c
[tgl@postgresql-fedora ~]$ ./a.out 1240000 &
[1] 11796
[tgl@postgresql-fedora ~]$ kill -USR1 11796
[tgl@postgresql-fedora ~]$ signal delivered, stack base 0x7fffdc160000 top 0x7fffdc031420 (1240032 used)

[1]+  Done                    ./a.out 1240000
[tgl@postgresql-fedora ~]$ ./a.out 1242000 &
[1] 11797
[tgl@postgresql-fedora ~]$ kill -USR1 11797
[tgl@postgresql-fedora ~]$
[1]+  Segmentation fault      (core dumped) ./a.out 1242000
[tgl@postgresql-fedora ~]$ uname -a
Linux postgresql-fedora.novalocal 4.18.19-100.fc27.ppc64le #1 SMP Wed Nov 14 21:53:32 UTC 2018 ppc64le ppc64le ppc64le
GNU/Linux

I don't think any further proof is required that this is
a kernel bug.  Where would be a good place to file it?

            regards, tom lane

#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <unistd.h>

static char *stack_base_ptr;
static char *stack_top_ptr;

static volatile sig_atomic_t sig_occurred = 0;

static void
sigusr1_handler(int signal_arg)
{
    sig_occurred = 1;
}

static void
consume_stack(long stackdepth)
{
    char        stack_cur;

    if ((stack_base_ptr - &stack_cur) < stackdepth)
        consume_stack(stackdepth);
    else
    {
        stack_top_ptr = &stack_cur;
        while (!sig_occurred)
            ;
    }
}

int
main(int argc, char **argv)
{
    long        stackdepth = atol(argv[1]);
    char        stack_base;
    struct sigaction act;

    act.sa_handler = sigusr1_handler;
    sigemptyset(&act.sa_mask);
    act.sa_flags = 0;
    if (sigaction(SIGUSR1, &act, NULL) < 0)
        perror("sigaction");

    stack_base_ptr = (char *) (((size_t) &stack_base + 65535) & ~65535UL);

    consume_stack(stackdepth);

    if (sig_occurred)
        printf("signal delivered, stack base %p top %p (%zu used)\n",
               stack_base_ptr, stack_top_ptr, stack_base_ptr - stack_top_ptr);
    else
        printf("no signal delivered\n");

    return 0;
}

Re: stress test for parallel workers

From

Thomas Munro

Date:

13 October 2019, 00:44:59

On Sun, Oct 13, 2019 at 1:06 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I don't think any further proof is required that this is
> a kernel bug.  Where would be a good place to file it?

linuxppc-dev@lists.ozlabs.org might be the right place.

https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: stress test for parallel workers

From

Andres Freund

Date:

13 October 2019, 13:31:54

Hi,

On 2019-10-13 13:44:59 +1300, Thomas Munro wrote:
> On Sun, Oct 13, 2019 at 1:06 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I don't think any further proof is required that this is
> > a kernel bug.  Where would be a good place to file it?
> 
> linuxppc-dev@lists.ozlabs.org might be the right place.
> 
> https://lists.ozlabs.org/listinfo/linuxppc-dev

Probably requires reproducing on a pretty recent kernel first, to have a
decent chance of being investigated...

Greetings,

Andres Freund

Re: stress test for parallel workers

From

Tom Lane

Date:

13 October 2019, 14:29:45

Andres Freund <andres@anarazel.de> writes:
> Probably requires reproducing on a pretty recent kernel first, to have a
> decent chance of being investigated...

How recent do you think it needs to be?  The machine I was testing on
yesterday is under a year old:

uname -m = ppc64le
uname -r = 4.18.19-100.fc27.ppc64le
uname -s = Linux
uname -v = #1 SMP Wed Nov 14 21:53:32 UTC 2018

The latest-by-version-number ppc64 kernel I can find in the buildfarm
is bonito,

uname -m = ppc64le
uname -r = 4.19.15-300.fc29.ppc64le
uname -s = Linux
uname -v = #1 SMP Mon Jan 14 16:21:04 UTC 2019

and that's certainly shown it too.

            regards, tom lane

Re: stress test for parallel workers

From

Andres Freund

Date:

13 October 2019, 14:50:53

Hi,

On 2019-10-13 10:29:45 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Probably requires reproducing on a pretty recent kernel first, to have a
> > decent chance of being investigated...
> 
> How recent do you think it needs to be?  The machine I was testing on
> yesterday is under a year old:
>
> uname -r = 4.18.19-100.fc27.ppc64le
> ...
> uname -r = 4.19.15-300.fc29.ppc64le

My experience reporting kernel bugs is that the latest released version,
or even just the tip of the git tree, is your best bet :/. And that
reports using distro kernels - with all their out of tree changes - are
also less likely to receive a response.  IIRC there's a fedora repo with
upstream kernels.

Greetings,

Andres Freund

Re: stress test for parallel workers

From

Tom Lane

Date:

13 October 2019, 14:57:33

Andres Freund <andres@anarazel.de> writes:
> On 2019-10-13 10:29:45 -0400, Tom Lane wrote:
>> How recent do you think it needs to be?

> My experience reporting kernel bugs is that the latest released version,
> or even just the tip of the git tree, is your best bet :/.

Considering that we're going to point them at chapter and verse in
Torvald's own repo, I do not think they can claim that we're worried
about obsolete code.

            regards, tom lane

Re: stress test for parallel workers

From

Tom Lane

Date:

13 October 2019, 16:07:13

Filed at

https://bugzilla.kernel.org/show_bug.cgi?id=205183

We'll see what happens ...

            regards, tom lane

Re: stress test for parallel workers

From

Tom Lane

Date:

14 October 2019, 15:50:31

I wrote:
> Filed at
> https://bugzilla.kernel.org/show_bug.cgi?id=205183
> We'll see what happens ...

Further to this --- I went back and looked at the outlier events
where we saw an infinite_recurse failure on a non-Linux-PPC64
platform.  There were only three:

 mereswine    | ARMv7            | Linux debian-armhf | Clarence Ho     | REL_11_STABLE | 2019-08-11 02:10:12 |
InstallCheck-C | 2019-08-11 02:36:10.159 PDT [5004:4] DETAIL:  Failed process was running: select infinite_recurse(); 
 mereswine    | ARMv7            | Linux debian-armhf | Clarence Ho     | REL_12_STABLE | 2019-08-11 09:52:46 |
pg_upgradeCheck| 2019-08-11 04:21:16.756 PDT [6804:5] DETAIL:  Failed process was running: select infinite_recurse(); 
 mereswine    | ARMv7            | Linux debian-armhf | Clarence Ho     | HEAD          | 2019-08-11 11:29:27 |
pg_upgradeCheck| 2019-08-11 07:15:28.454 PDT [9954:76] DETAIL:  Failed process was running: select infinite_recurse(); 

Looking closer at these, though, they were *not* SIGSEGV failures,
but SIGKILLs.  Seeing that they were all on the same machine on the
same day, I'm thinking we can write them off as a transiently
misconfigured OOM killer.

So, pending some other theory emerging from the kernel hackers, we're
down to it's-a-PPC64-kernel-bug.  That leaves me wondering what if
anything we want to do about it.  Even if it's fixed reasonably promptly
in Linux upstream, and then we successfully nag assorted vendors to
incorporate the fix quickly, that's still going to leave us with frequent
buildfarm failures on Mark's flotilla of not-the-very-shiniest Linux
versions.

Should we move the infinite_recurse test to happen alone in a parallel
group just to stop these failures?  That's annoying from a parallelism
standpoint, but I don't see any other way to avoid these failures.

            regards, tom lane

Re: stress test for parallel workers

From

Justin Pryzby

Date:

22 October 2019, 22:03:58

I want to give some conclusion to our occurance of this, which now I think was
neither an instance nor indicitive of any bug.  Summary: postgres was being
kill -9 by a deployment script, after it "timed out".  Thus no log messages.

I initially experienced this while testing migration of a customer's DB using
an ansible script, which did "async: 2222, poll: 0".  Which I guess it takes to
mean "let the process run for 2222 seconds, after which send it SIGKILL".  I
probably made multiple attempts to migrate (for example to fix tablespaces or
dependences on postgis), and the process was started during one test and never
stopped nor restarted during following tests, until it finally hit 2222sec and
stopped ungracefully.

This just happened again, during a differnet migration, so I reproduced it like:

ansible --sudo --sudo-user pryzbyj 'database.customer.*' --background=1 --poll=1 -m command -a
'/usr/pgsql-12/bin/postmaster-D /home/pryzbyj/pg12.dat -c port=5678  -c unix-socket-directories=/tmp -c
client_min_messages=debug-c log_temp_files=0 -c log_lock_waits=1'
 

Connect to customer and verify it was killed uncleanly:

[pryzbyj@database ~]$ /usr/pgsql-12/bin/postmaster -D ./pg12.dat -c port=5678 -c unix-socket-directories=/tmp -c
client_min_messages=debug-c log_temp_files=0 -c log_lock_waits=1 -c logging_collector=off
 
2019-10-22 17:57:58.251 EDT [5895] FATAL:  lock file "postmaster.pid" already exists
2019-10-22 17:57:58.251 EDT [5895] HINT:  Is another postmaster (PID 5608) running in data directory
"/home/pryzbyj/./pg12.dat"?
[pryzbyj@database ~]$ /usr/pgsql-12/bin/postmaster -D ./pg12.dat -c port=5678 -c unix-socket-directories=/tmp -c
client_min_messages=debug-c log_temp_files=0 -c log_lock_waits=1 -c logging_collector=off
 
2019-10-22 17:58:01.312 EDT [5962] LOG:  starting PostgreSQL 12.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5
20150623(Red Hat 4.8.5-39), 64-bit
 
2019-10-22 17:58:01.312 EDT [5962] LOG:  listening on IPv6 address "::1", port 5678
2019-10-22 17:58:01.312 EDT [5962] LOG:  listening on IPv4 address "127.0.0.1", port 5678
2019-10-22 17:58:01.328 EDT [5962] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5678"
2019-10-22 17:58:01.353 EDT [5963] LOG:  database system was interrupted; last known up at 2019-10-22 17:57:48 EDT
2019-10-22 17:58:01.460 EDT [5963] LOG:  database system was not properly shut down; automatic recovery in progress
2019-10-22 17:58:01.478 EDT [5963] LOG:  invalid record length at 6/E829D128: wanted 24, got 0
2019-10-22 17:58:01.478 EDT [5963] LOG:  redo is not required
2019-10-22 17:58:01.526 EDT [5962] LOG:  database system is ready to accept connections


On Tue, Jul 23, 2019 at 11:27:03AM -0500, Justin Pryzby wrote:
> Does anyone have a stress test for parallel workers ?
> 
> On a customer's new VM, I got this several times while (trying to) migrate their DB:
> 
> < 2019-07-23 10:33:51.552 CDT postgres >FATAL:  postmaster exited during a parallel transaction
> < 2019-07-23 10:33:51.552 CDT postgres >STATEMENT:  CREATE UNIQUE INDEX
unused0_huawei_umts_nodeb_locell_201907_unique_idxON child.unused0_huawei_umts_nodeb_locell_201907 USING btree ...
 
> 
> There's nothing in dmesg nor in postgres logs.
> 
> At first I thought it's maybe because of a network disconnection, then I
> thought it's because we ran out of space (wal), then they had a faulty blade.
> After that, I'd tried and failed to reproduce it a number of times, but it's
> just recurred during what was intended to be their final restore.  I've set
> max_parallel_workers_per_gather=0, but I'm planning to try to diagnose an issue
> in another instance.  Ideally a minimal test, since I'm apparently going to
> have to run under gdb to see how it's dying, or even what process is failing.
> 
> DMI: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015
> CentOS release 6.9 (Final)
> Linux alextelsasrv01 2.6.32-754.17.1.el6.x86_64 #1 SMP Tue Jul 2 12:42:48 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
> version | PostgreSQL 11.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-23), 64-bit

Re: stress test for parallel workers

From

Thomas Munro

Date:

11 December 2019, 02:22:41

On Tue, Oct 15, 2019 at 4:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Filed at
> > https://bugzilla.kernel.org/show_bug.cgi?id=205183

For the curious-and-not-subscribed, there's now a kernel patch
proposed for this.  We guessed pretty close, but the problem wasn't
those dodgy looking magic numbers, it was that the bad stack expansion
check only allows for user space to expand the stack
(FAULT_FLAG_USER), and here the kernel itself wants to build a stack
frame.

Re: stress test for parallel workers

From

Thomas Munro

Date:

28 July 2020, 01:35:45

On Wed, Dec 11, 2019 at 3:22 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Tue, Oct 15, 2019 at 4:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > > Filed at
> > > https://bugzilla.kernel.org/show_bug.cgi?id=205183
>
> For the curious-and-not-subscribed, there's now a kernel patch
> proposed for this.  We guessed pretty close, but the problem wasn't
> those dodgy looking magic numbers, it was that the bad stack expansion
> check only allows for user space to expand the stack
> (FAULT_FLAG_USER), and here the kernel itself wants to build a stack
> frame.

Hehe, the dodgy looking magic numbers *were* wrong:

- * The kernel signal delivery code writes up to about 1.5kB
+ * The kernel signal delivery code writes a bit over 4KB

https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200724092528.1578671-2-mpe@ellerman.id.au/

Re: stress test for parallel workers

From

Tom Lane

Date:

28 July 2020, 03:27:40

Thomas Munro <thomas.munro@gmail.com> writes:
> Hehe, the dodgy looking magic numbers *were* wrong:
> - * The kernel signal delivery code writes up to about 1.5kB
> + * The kernel signal delivery code writes a bit over 4KB
> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200724092528.1578671-2-mpe@ellerman.id.au/

... and, having seemingly not learned a thing, they just replaced
them with new magic numbers.  Mumble sizeof() mumble.

Anyway, I guess the interesting question for us is how long it
will take for this fix to propagate into real-world systems.
I don't have much of a clue about the Linux kernel workflow,
anybody want to venture a guess?

            regards, tom lane

Re: stress test for parallel workers

From

Thomas Munro

Date:

11 August 2020, 04:38:58

On Tue, Jul 28, 2020 at 3:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Anyway, I guess the interesting question for us is how long it
> will take for this fix to propagate into real-world systems.
> I don't have much of a clue about the Linux kernel workflow,
> anybody want to venture a guess?

Me neither.  It just hit Torvalds' tree[1] marked "Cc:
stable@vger.kernel.org # v2.6.27+".  I looked at the time for a couple
of other PowerPC-related commits of similar complexity involving some
of the same names to get from there to a Debian stable kernel package
and it seemed to be under a couple of months.

[1] https://github.com/torvalds/linux/commit/63dee5df43a31f3844efabc58972f0a206ca4534

Re: stress test for parallel workers

From

Tom Lane

Date:

19 August 2020, 16:14:41

Thomas Munro <thomas.munro@gmail.com> writes:
> On Tue, Jul 28, 2020 at 3:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Anyway, I guess the interesting question for us is how long it
>> will take for this fix to propagate into real-world systems.
>> I don't have much of a clue about the Linux kernel workflow,
>> anybody want to venture a guess?

> Me neither.  It just hit Torvalds' tree[1] marked "Cc:
> stable@vger.kernel.org # v2.6.27+".  I looked at the time for a couple
> of other PowerPC-related commits of similar complexity involving some
> of the same names to get from there to a Debian stable kernel package
> and it seemed to be under a couple of months.
> [1] https://github.com/torvalds/linux/commit/63dee5df43a31f3844efabc58972f0a206ca4534

For our archives' sake: today I got seemingly-automated mail informing me
that this patch has been merged into the 4.19-stable, 5.4-stable,
5.7-stable, and 5.8-stable kernel branches; but not 4.4-stable,
4.9-stable, or 4.14-stable, because it failed to apply.

            regards, tom lane

Re: stress test for parallel workers

From

Tom Lane

Date:

24 August 2020, 13:43:53

I wrote:
> For our archives' sake: today I got seemingly-automated mail informing me
> that this patch has been merged into the 4.19-stable, 5.4-stable,
> 5.7-stable, and 5.8-stable kernel branches; but not 4.4-stable,
> 4.9-stable, or 4.14-stable, because it failed to apply.

And this morning's mail brought news that the latter three branches
are now patched as well.  So I guess at this point it's down to
platform vendors as to whether or how fast they absorb such changes.

It might help for us to file platform-specific bug reports asking
for the change to be merged.

            regards, tom lane

Re: stress test for parallel workers

From

Thomas Munro

Date:

15 October 2020, 22:11:34

On Tue, Aug 25, 2020 at 1:43 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
> > For our archives' sake: today I got seemingly-automated mail informing me
> > that this patch has been merged into the 4.19-stable, 5.4-stable,
> > 5.7-stable, and 5.8-stable kernel branches; but not 4.4-stable,
> > 4.9-stable, or 4.14-stable, because it failed to apply.
>
> And this morning's mail brought news that the latter three branches
> are now patched as well.  So I guess at this point it's down to
> platform vendors as to whether or how fast they absorb such changes.

Today I upgraded a Debian buster box and saw a new kernel image roll
in.  Lo and behold:

$ zgrep 'stack expansion'
/usr/share/doc/linux-image-4.19.0-11-amd64/changelog.gz
    - [powerpc*] Allow 4224 bytes of stack expansion for the signal frame