On Sat, Aug 27, 2016 at 3:43 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Aug 26, 2016 at 6:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Latest from lorikeet:
>> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2016-08-26%2008%3A37%3A27
>>
>> TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File:
"/home/andrew/bf64/root/REL9_6_STABLE/pgsql.build/../pgsql/src/backend/storage/ipc/shm_mq.c",Line: 220)
>>
>
> Do you think, it is due to some recent change or we are just seeing
> now as it could be timing specific issue?
>
> So here what seems to be happening is that during worker startup, we
> are trying to set the sender for a shared memory queue and the same is
> already set. Now, one theoretical possibility of the same could be
> that the two workers get the same ParallelWorkerNumber which is then
> used to access the shm queue (refer
> ParallelWorkerMain/ExecParallelGetReceiver). We are setting the
> ParallelWorkerNumber in below code which seems to be doing what it is
> suppose to do:
>
> LaunchParallelWorkers()
> {
> ..
> for (i = 0; i < pcxt->nworkers; ++i)
> {
> memcpy(worker.bgw_extra, &i, sizeof(int));
> if (!any_registrations_failed &&
> RegisterDynamicBackgroundWorker(&worker,
> &pcxt->worker[i].bgwhandle))
> ..
> }
>
> Can some reordering impact the above code?
I don't think so. Your guess that ParallelWorkerNumber is getting
messed up somehow seems like a good one, but I don't see anything
wrong with that code. There's actually a pretty long chain here.
That code copies the value of the local variable i into
worker.bgw_extra. Then, RegisterDynamicBackgroundWorker copies the
whole structure into shared memory. Then, running inside the
postmaster, BackgroundWorkerStateChange copies it into the postmaster
address space. But, since this is Windows, that copy doesn't actually
passed to the worker; instead, BackgroundWorkerEntry() copies the data
from shared memory into the new worker processes' MyBgworkerEntry.
Then BackgroundWorkerMain() copies the data from there to
ParallelWorkerNumber. In theory any of those places could be going
wrong somehow, though none of them can be completely busted because
they all work at least most of the time.
Of course, it's also possible that the ParallelWorkerNumber code is
entirely correct and something overwrote the null bytes that were
supposed to be found at that location. It would be very useful to see
(a) the value of ParallelWorkerNumber and (b) the contents of
vmq->mq_sender, and in particular whether that's actually a valid
pointer to a PGPROC in the ProcArray. But unless we can reproduce
this I don't see how to manage that.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company