Re: stress test for parallel workers - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: stress test for parallel workers
Date
Msg-id CA+hUKGLch1bNWdG-G8YaeJbyVsper6hG86Ugx9tSWG3=a1R89Q@mail.gmail.com
Whole thread Raw
In response to Re: stress test for parallel workers  (Justin Pryzby <pryzby@telsasoft.com>)
Responses Re: stress test for parallel workers  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: stress test for parallel workers  (Thomas Munro <thomas.munro@gmail.com>)
Re: stress test for parallel workers  (Justin Pryzby <pryzby@telsasoft.com>)
List pgsql-hackers
On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> #2  0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555
>         edata = <value optimized out>
>         elevel = 22
>         oldcontext = 0x27e15d0
>         econtext = 0x0
>         __func__ = "errfinish"
> #3  0x00000000006f7e94 in CheckPointReplicationOrigin () at origin.c:588
>         save_errno = <value optimized out>
>         tmppath = 0x9c4518 "pg_logical/replorigin_checkpoint.tmp"
>         path = 0x9c4300 "pg_logical/replorigin_checkpoint"
>         tmpfd = 64
>         i = <value optimized out>
>         magic = 307747550
>         crc = 4294967295
>         __func__ = "CheckPointReplicationOrigin"

> Supposedly it's trying to do this:
>
> |       ereport(PANIC,
> |                       (errcode_for_file_access(),
> |                        errmsg("could not write to file \"%s\": %m",
> |                                       tmppath)));
>
> And since there's consistently nothing in logs, I'm guessing there's a
> legitimate write error (legitimate from PG perspective).  Storage here is ext4
> plus zfs tablespace on top of LVM on top of vmware thin volume.

If you have that core, it might be interesting to go to frame 2 and
print *edata or edata->saved_errno.  If the errno is EIO, it's a bit
strange if that's not showing up in some form in kernel logs or dmesg
or something; if it's ENOSPC I guess it'd be normal that it doesn't
show up anywhere and there is nothing in the PostgreSQL logs if
they're on the same full filesystem, but then you would probably
already have mentioned that your filesystem was out of space.  Could
it have been fleetingly full due to some other thing happening on the
system that rapidly expands and contracts?

I'm confused by the evidence, though.  If this PANIC is the origin of
the problem, how do we get to postmaster-death based exit in a
parallel leader*, rather than quickdie() (the kind of exit that
happens when the postmaster sends SIGQUIT to every process and they
say "terminating connection because of crash of another server
process", because some backend crashed or panicked).  Perhaps it would
be clearer what's going on if you could put the PostgreSQL log onto a
different filesystem, so we get a better chance of collecting
evidence?  But then... the parallel leader process was apparently able
to log something -- maybe it was just lucky, but you said this
happened this way more than once.  I'm wondering how it could be that
you got some kind of IO failure and weren't able to log the PANIC
message AND your postmaster was killed, and you were able to log a
message about that.  Perhaps we're looking at evidence from two
unrelated failures.

*I suspect that the only thing implicating parallelism in this failure
is that parallel leaders happen to print out that message if the
postmaster dies while they are waiting for workers; most other places
(probably every other backend in your cluster) just quietly exit.
That tells us something about what's happening, but on its own doesn't
tell us that parallelism plays an important role in the failure mode.


--
Thomas Munro
https://enterprisedb.com



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: getting ERROR "relation 16401 has no triggers" with partitionforeign key alter
Next
From: Fabien COELHO
Date:
Subject: Re: make \d pg_toast.foo show its indices ; and, \d toast show itsmain table ; and \d relkind=I show its partitions