Thread: shm_mq_set_sender() crash

shm_mq_set_sender() crash

From
Tom Lane
Date:
Latest from lorikeet:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2016-08-26%2008%3A37%3A27

TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File:
"/home/andrew/bf64/root/REL9_6_STABLE/pgsql.build/../pgsql/src/backend/storage/ipc/shm_mq.c",Line: 220)
 

Thoughts?
        regards, tom lane



Re: shm_mq_set_sender() crash

From
Amit Kapila
Date:
On Fri, Aug 26, 2016 at 6:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Latest from lorikeet:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2016-08-26%2008%3A37%3A27
>
> TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File:
"/home/andrew/bf64/root/REL9_6_STABLE/pgsql.build/../pgsql/src/backend/storage/ipc/shm_mq.c",Line: 220)
 
>

Do you think, it is due to some recent change or we are just seeing
now as it could be timing specific issue?

So here what seems to be happening is that during worker startup, we
are trying to set the sender for a shared memory queue and the same is
already set.  Now, one theoretical possibility of the same could be
that the two workers get the same ParallelWorkerNumber which is then
used to access the shm queue (refer
ParallelWorkerMain/ExecParallelGetReceiver).  We are setting the
ParallelWorkerNumber in below code which seems to be doing what it is
suppose to do:

LaunchParallelWorkers()
{
..
for (i = 0; i < pcxt->nworkers; ++i)
{
memcpy(worker.bgw_extra, &i, sizeof(int));
if (!any_registrations_failed &&
RegisterDynamicBackgroundWorker(&worker,
&pcxt->worker[i].bgwhandle))
..
}

Can some reordering impact the above code?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: shm_mq_set_sender() crash

From
Robert Haas
Date:
On Sat, Aug 27, 2016 at 3:43 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Aug 26, 2016 at 6:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Latest from lorikeet:
>> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2016-08-26%2008%3A37%3A27
>>
>> TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File:
"/home/andrew/bf64/root/REL9_6_STABLE/pgsql.build/../pgsql/src/backend/storage/ipc/shm_mq.c",Line: 220)
 
>>
>
> Do you think, it is due to some recent change or we are just seeing
> now as it could be timing specific issue?
>
> So here what seems to be happening is that during worker startup, we
> are trying to set the sender for a shared memory queue and the same is
> already set.  Now, one theoretical possibility of the same could be
> that the two workers get the same ParallelWorkerNumber which is then
> used to access the shm queue (refer
> ParallelWorkerMain/ExecParallelGetReceiver).  We are setting the
> ParallelWorkerNumber in below code which seems to be doing what it is
> suppose to do:
>
> LaunchParallelWorkers()
> {
> ..
> for (i = 0; i < pcxt->nworkers; ++i)
> {
> memcpy(worker.bgw_extra, &i, sizeof(int));
> if (!any_registrations_failed &&
> RegisterDynamicBackgroundWorker(&worker,
> &pcxt->worker[i].bgwhandle))
> ..
> }
>
> Can some reordering impact the above code?

I don't think so.  Your guess that ParallelWorkerNumber is getting
messed up somehow seems like a good one, but I don't see anything
wrong with that code.  There's actually a pretty long chain here.
That code copies the value of the local variable i into
worker.bgw_extra.  Then, RegisterDynamicBackgroundWorker copies the
whole structure into shared memory.  Then, running inside the
postmaster, BackgroundWorkerStateChange copies it into the postmaster
address space.  But, since this is Windows, that copy doesn't actually
passed to the worker; instead, BackgroundWorkerEntry() copies the data
from shared memory into the new worker processes' MyBgworkerEntry.
Then BackgroundWorkerMain() copies the data from there to
ParallelWorkerNumber.  In theory any of those places could be going
wrong somehow, though none of them can be completely busted because
they all work at least most of the time.

Of course, it's also possible that the ParallelWorkerNumber code is
entirely correct and something overwrote the null bytes that were
supposed to be found at that location.  It would be very useful to see
(a) the value of ParallelWorkerNumber and (b) the contents of
vmq->mq_sender, and in particular whether that's actually a valid
pointer to a PGPROC in the ProcArray.  But unless we can reproduce
this I don't see how to manage that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: shm_mq_set_sender() crash

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Of course, it's also possible that the ParallelWorkerNumber code is
> entirely correct and something overwrote the null bytes that were
> supposed to be found at that location.  It would be very useful to see
> (a) the value of ParallelWorkerNumber and (b) the contents of
> vmq->mq_sender, and in particular whether that's actually a valid
> pointer to a PGPROC in the ProcArray.  But unless we can reproduce
> this I don't see how to manage that.

Is it worth replacing that Assert with a test-and-elog that would
print those values?

Given that we've seen only one such instance in the buildfarm, this
might've been just a cosmic ray bit-flip.  So one part of me says
not to worry too much until we see it again.  OTOH, if it is real
but rare, missing an opportunity to diagnose would be bad.
        regards, tom lane



Re: shm_mq_set_sender() crash

From
Robert Haas
Date:
On Thu, Sep 15, 2016 at 5:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Of course, it's also possible that the ParallelWorkerNumber code is
>> entirely correct and something overwrote the null bytes that were
>> supposed to be found at that location.  It would be very useful to see
>> (a) the value of ParallelWorkerNumber and (b) the contents of
>> vmq->mq_sender, and in particular whether that's actually a valid
>> pointer to a PGPROC in the ProcArray.  But unless we can reproduce
>> this I don't see how to manage that.
>
> Is it worth replacing that Assert with a test-and-elog that would
> print those values?
>
> Given that we've seen only one such instance in the buildfarm, this
> might've been just a cosmic ray bit-flip.  So one part of me says
> not to worry too much until we see it again.  OTOH, if it is real
> but rare, missing an opportunity to diagnose would be bad.

I wonder if we could persuade somebody to run pgbench on a Windows
machine with a similar environment, at least some concurrency, and
force_parallel_mode=on.  Assuming this is a generic
initialize-the-parallel-stuff bug and not something specific to a
particular query, that might well trigger it a lot quicker than
waiting for it to recur in the BF.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] shm_mq_set_sender() crash

From
Noah Misch
Date:
On Thu, Sep 15, 2016 at 06:21:30PM -0400, Robert Haas wrote:
> On Thu, Sep 15, 2016 at 5:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> >> Of course, it's also possible that the ParallelWorkerNumber code is
> >> entirely correct and something overwrote the null bytes that were
> >> supposed to be found at that location.  It would be very useful to see
> >> (a) the value of ParallelWorkerNumber and (b) the contents of
> >> vmq->mq_sender, and in particular whether that's actually a valid
> >> pointer to a PGPROC in the ProcArray.  But unless we can reproduce
> >> this I don't see how to manage that.
> >
> > Is it worth replacing that Assert with a test-and-elog that would
> > print those values?
> >
> > Given that we've seen only one such instance in the buildfarm, this
> > might've been just a cosmic ray bit-flip.  So one part of me says
> > not to worry too much until we see it again.  OTOH, if it is real
> > but rare, missing an opportunity to diagnose would be bad.
> 
> I wonder if we could persuade somebody to run pgbench on a Windows
> machine with a similar environment, at least some concurrency, and
> force_parallel_mode=on.  Assuming this is a generic
> initialize-the-parallel-stuff bug and not something specific to a
> particular query, that might well trigger it a lot quicker than
> waiting for it to recur in the BF.

For the sake of this thread in the archives, the cause was almost surely a
Cygwin signal handling bug:
 https://postgr.es/m/20170803034740.GA2641942@rfd.leadboat.com https://marc.info/?t=150183296400001
https://marc.info/?t=150291861700011


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers