Re: stress test for parallel workers - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: stress test for parallel workers |
Date | |
Msg-id | CA+hUKGLch1bNWdG-G8YaeJbyVsper6hG86Ugx9tSWG3=a1R89Q@mail.gmail.com Whole thread Raw |
In response to | Re: stress test for parallel workers (Justin Pryzby <pryzby@telsasoft.com>) |
Responses |
Re: stress test for parallel workers
Re: stress test for parallel workers Re: stress test for parallel workers |
List | pgsql-hackers |
On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > #2 0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555 > edata = <value optimized out> > elevel = 22 > oldcontext = 0x27e15d0 > econtext = 0x0 > __func__ = "errfinish" > #3 0x00000000006f7e94 in CheckPointReplicationOrigin () at origin.c:588 > save_errno = <value optimized out> > tmppath = 0x9c4518 "pg_logical/replorigin_checkpoint.tmp" > path = 0x9c4300 "pg_logical/replorigin_checkpoint" > tmpfd = 64 > i = <value optimized out> > magic = 307747550 > crc = 4294967295 > __func__ = "CheckPointReplicationOrigin" > Supposedly it's trying to do this: > > | ereport(PANIC, > | (errcode_for_file_access(), > | errmsg("could not write to file \"%s\": %m", > | tmppath))); > > And since there's consistently nothing in logs, I'm guessing there's a > legitimate write error (legitimate from PG perspective). Storage here is ext4 > plus zfs tablespace on top of LVM on top of vmware thin volume. If you have that core, it might be interesting to go to frame 2 and print *edata or edata->saved_errno. If the errno is EIO, it's a bit strange if that's not showing up in some form in kernel logs or dmesg or something; if it's ENOSPC I guess it'd be normal that it doesn't show up anywhere and there is nothing in the PostgreSQL logs if they're on the same full filesystem, but then you would probably already have mentioned that your filesystem was out of space. Could it have been fleetingly full due to some other thing happening on the system that rapidly expands and contracts? I'm confused by the evidence, though. If this PANIC is the origin of the problem, how do we get to postmaster-death based exit in a parallel leader*, rather than quickdie() (the kind of exit that happens when the postmaster sends SIGQUIT to every process and they say "terminating connection because of crash of another server process", because some backend crashed or panicked). Perhaps it would be clearer what's going on if you could put the PostgreSQL log onto a different filesystem, so we get a better chance of collecting evidence? But then... the parallel leader process was apparently able to log something -- maybe it was just lucky, but you said this happened this way more than once. I'm wondering how it could be that you got some kind of IO failure and weren't able to log the PANIC message AND your postmaster was killed, and you were able to log a message about that. Perhaps we're looking at evidence from two unrelated failures. *I suspect that the only thing implicating parallelism in this failure is that parallel leaders happen to print out that message if the postmaster dies while they are waiting for workers; most other places (probably every other backend in your cluster) just quietly exit. That tells us something about what's happening, but on its own doesn't tell us that parallelism plays an important role in the failure mode. -- Thomas Munro https://enterprisedb.com
pgsql-hackers by date: