Re: DSM robustness failure (was Re: Peripatus/failures) - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: DSM robustness failure (was Re: Peripatus/failures)
Date
Msg-id CAA4eK1JrKXyhRVWJeUY1XcdCsNFZEfMPvbPUjTk+F6BN2uvuRw@mail.gmail.com
Whole thread Raw
In response to Re: DSM robustness failure (was Re: Peripatus/failures)  (Thomas Munro <thomas.munro@enterprisedb.com>)
List pgsql-hackers
On Thu, Oct 18, 2018 at 2:33 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
>
> On Thu, Oct 18, 2018 at 5:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > The below code seems to be problemetic:
> > dsm_cleanup_using_control_segment()
> > {
> > ..
> > if (!dsm_control_segment_sane(old_control, mapped_size))
> > {
> > dsm_impl_op(DSM_OP_DETACH, old_control_handle, 0, &impl_private,
> > &mapped_address, &mapped_size, LOG);
> > ..
> > }
> >
> > Here, don't we need to use dsm_control_* variables instead of local
> > variable mapped_* variables?
>
> I was a little fuzzy on when exactly
> dsm_cleanup_using_control_segment() and dsm_postmaster_shutdown() run,
> but after some more testing I think I have this straight now.  You can
> test by setting dsm_control->magic to 42 in a debugger and trying
> three cases:
>
> 1.  Happy shutdown: dsm_postmaster_shutdown() complains on shutdown.
> 2.  kill -9 a non-postmaster process: dsm_postmaster_shutdown()
> complains during auto-restart.
> 3.  kill -9 the postmaster, manually start up again:
> dsm_cleanup_using_control_segment() runs.  It ignores the old segment
> quietly if it doesn't pass the sanity test.
>
> So to answer your question: no, dsm_cleanup_using_control_segment() is
> case 3.  This entirely new postmaster process has never had the
> segment mapped in, so the dsm_control_* variables are not relevant
> here.
>
> Hmm.... but if you're running N other independent clusters on the same
> machine that started up after this cluster crashed in case 3, I think
> there is an N-in-four-billion chance that the segment with that ID now
> belongs to another cluster and happens to be its DSM control segment,
> and therefore passes the magic-number sanity test, and then we'll nuke
> it and all the segments it references.  Am I missing something?
>

Unless the previous cluster (which crashed) has removed the segment,
how will new cluster succeed in getting the same segment.  Won't it
get the EExist and retry to get the segment with another id?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Resetting PGPROC atomics in ProcessInit()
Next
From: Amit Kapila
Date:
Subject: Re: Resetting PGPROC atomics in ProcessInit()