Thread: shm_mq_set_sender() crash
Latest from lorikeet: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2016-08-26%2008%3A37%3A27 TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File: "/home/andrew/bf64/root/REL9_6_STABLE/pgsql.build/../pgsql/src/backend/storage/ipc/shm_mq.c",Line: 220) Thoughts? regards, tom lane
On Fri, Aug 26, 2016 at 6:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Latest from lorikeet: > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2016-08-26%2008%3A37%3A27 > > TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File: "/home/andrew/bf64/root/REL9_6_STABLE/pgsql.build/../pgsql/src/backend/storage/ipc/shm_mq.c",Line: 220) > Do you think, it is due to some recent change or we are just seeing now as it could be timing specific issue? So here what seems to be happening is that during worker startup, we are trying to set the sender for a shared memory queue and the same is already set. Now, one theoretical possibility of the same could be that the two workers get the same ParallelWorkerNumber which is then used to access the shm queue (refer ParallelWorkerMain/ExecParallelGetReceiver). We are setting the ParallelWorkerNumber in below code which seems to be doing what it is suppose to do: LaunchParallelWorkers() { .. for (i = 0; i < pcxt->nworkers; ++i) { memcpy(worker.bgw_extra, &i, sizeof(int)); if (!any_registrations_failed && RegisterDynamicBackgroundWorker(&worker, &pcxt->worker[i].bgwhandle)) .. } Can some reordering impact the above code? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, Aug 27, 2016 at 3:43 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Aug 26, 2016 at 6:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Latest from lorikeet: >> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2016-08-26%2008%3A37%3A27 >> >> TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File: "/home/andrew/bf64/root/REL9_6_STABLE/pgsql.build/../pgsql/src/backend/storage/ipc/shm_mq.c",Line: 220) >> > > Do you think, it is due to some recent change or we are just seeing > now as it could be timing specific issue? > > So here what seems to be happening is that during worker startup, we > are trying to set the sender for a shared memory queue and the same is > already set. Now, one theoretical possibility of the same could be > that the two workers get the same ParallelWorkerNumber which is then > used to access the shm queue (refer > ParallelWorkerMain/ExecParallelGetReceiver). We are setting the > ParallelWorkerNumber in below code which seems to be doing what it is > suppose to do: > > LaunchParallelWorkers() > { > .. > for (i = 0; i < pcxt->nworkers; ++i) > { > memcpy(worker.bgw_extra, &i, sizeof(int)); > if (!any_registrations_failed && > RegisterDynamicBackgroundWorker(&worker, > &pcxt->worker[i].bgwhandle)) > .. > } > > Can some reordering impact the above code? I don't think so. Your guess that ParallelWorkerNumber is getting messed up somehow seems like a good one, but I don't see anything wrong with that code. There's actually a pretty long chain here. That code copies the value of the local variable i into worker.bgw_extra. Then, RegisterDynamicBackgroundWorker copies the whole structure into shared memory. Then, running inside the postmaster, BackgroundWorkerStateChange copies it into the postmaster address space. But, since this is Windows, that copy doesn't actually passed to the worker; instead, BackgroundWorkerEntry() copies the data from shared memory into the new worker processes' MyBgworkerEntry. Then BackgroundWorkerMain() copies the data from there to ParallelWorkerNumber. In theory any of those places could be going wrong somehow, though none of them can be completely busted because they all work at least most of the time. Of course, it's also possible that the ParallelWorkerNumber code is entirely correct and something overwrote the null bytes that were supposed to be found at that location. It would be very useful to see (a) the value of ParallelWorkerNumber and (b) the contents of vmq->mq_sender, and in particular whether that's actually a valid pointer to a PGPROC in the ProcArray. But unless we can reproduce this I don't see how to manage that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Of course, it's also possible that the ParallelWorkerNumber code is > entirely correct and something overwrote the null bytes that were > supposed to be found at that location. It would be very useful to see > (a) the value of ParallelWorkerNumber and (b) the contents of > vmq->mq_sender, and in particular whether that's actually a valid > pointer to a PGPROC in the ProcArray. But unless we can reproduce > this I don't see how to manage that. Is it worth replacing that Assert with a test-and-elog that would print those values? Given that we've seen only one such instance in the buildfarm, this might've been just a cosmic ray bit-flip. So one part of me says not to worry too much until we see it again. OTOH, if it is real but rare, missing an opportunity to diagnose would be bad. regards, tom lane
On Thu, Sep 15, 2016 at 5:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Of course, it's also possible that the ParallelWorkerNumber code is >> entirely correct and something overwrote the null bytes that were >> supposed to be found at that location. It would be very useful to see >> (a) the value of ParallelWorkerNumber and (b) the contents of >> vmq->mq_sender, and in particular whether that's actually a valid >> pointer to a PGPROC in the ProcArray. But unless we can reproduce >> this I don't see how to manage that. > > Is it worth replacing that Assert with a test-and-elog that would > print those values? > > Given that we've seen only one such instance in the buildfarm, this > might've been just a cosmic ray bit-flip. So one part of me says > not to worry too much until we see it again. OTOH, if it is real > but rare, missing an opportunity to diagnose would be bad. I wonder if we could persuade somebody to run pgbench on a Windows machine with a similar environment, at least some concurrency, and force_parallel_mode=on. Assuming this is a generic initialize-the-parallel-stuff bug and not something specific to a particular query, that might well trigger it a lot quicker than waiting for it to recur in the BF. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Sep 15, 2016 at 06:21:30PM -0400, Robert Haas wrote: > On Thu, Sep 15, 2016 at 5:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > >> Of course, it's also possible that the ParallelWorkerNumber code is > >> entirely correct and something overwrote the null bytes that were > >> supposed to be found at that location. It would be very useful to see > >> (a) the value of ParallelWorkerNumber and (b) the contents of > >> vmq->mq_sender, and in particular whether that's actually a valid > >> pointer to a PGPROC in the ProcArray. But unless we can reproduce > >> this I don't see how to manage that. > > > > Is it worth replacing that Assert with a test-and-elog that would > > print those values? > > > > Given that we've seen only one such instance in the buildfarm, this > > might've been just a cosmic ray bit-flip. So one part of me says > > not to worry too much until we see it again. OTOH, if it is real > > but rare, missing an opportunity to diagnose would be bad. > > I wonder if we could persuade somebody to run pgbench on a Windows > machine with a similar environment, at least some concurrency, and > force_parallel_mode=on. Assuming this is a generic > initialize-the-parallel-stuff bug and not something specific to a > particular query, that might well trigger it a lot quicker than > waiting for it to recur in the BF. For the sake of this thread in the archives, the cause was almost surely a Cygwin signal handling bug: https://postgr.es/m/20170803034740.GA2641942@rfd.leadboat.com https://marc.info/?t=150183296400001 https://marc.info/?t=150291861700011 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers