Re: DSM robustness failure (was Re: Peripatus/failures) - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: DSM robustness failure (was Re: Peripatus/failures)
Date
Msg-id CAEepm=2dyAcmZOUv8VsgWKiSRjjF1X0oRNecna94+nwTbyoGTQ@mail.gmail.com
Whole thread Raw
In response to Re: DSM robustness failure (was Re: Peripatus/failures)  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Thu, Oct 18, 2018 at 2:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Larry's REL_10_STABLE failure logs are interesting:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=peripatus&dt=2018-10-17%2020%3A42%3A17
>
> 2018-10-17 15:48:08.849 CDT [55240:7] LOG:  dynamic shared memory control segment is corrupt
> 2018-10-17 15:48:08.849 CDT [55240:8] LOG:  sem_destroy failed: Invalid argument
> 2018-10-17 15:48:08.850 CDT [55240:9] LOG:  sem_destroy failed: Invalid argument
> 2018-10-17 15:48:08.850 CDT [55240:10] LOG:  sem_destroy failed: Invalid argument
> 2018-10-17 15:48:08.850 CDT [55240:11] LOG:  sem_destroy failed: Invalid argument
> ... lots more ...
> 2018-10-17 15:48:08.862 CDT [55240:122] LOG:  sem_destroy failed: Invalid argument
> 2018-10-17 15:48:08.862 CDT [55240:123] LOG:  sem_destroy failed: Invalid argument
> TRAP: FailedAssertion("!(dsm_control_mapped_size == 0)", File: "dsm.c", Line: 182)
>
> So at least in this case, not only did we lose the DSM segment but also
> all of our semaphores.  Is it conceivable that Python somehow destroyed
> those objects, rather than stomping on the contents of the DSM segment?
> If not, how do we explain this log?

One idea:  In the backend I'm looking at there is a contiguous run of
read/write mappings from the the location of the semaphore array
through to the DSM control segment.  That means that a single runaway
loop/memcpy/memset etc could overwrite both of those.  Eventually it
would run off the end of contiguously mapped space and SEGV, and we do
indeed see a segfault from that Python code before the trouble begins.

> Also, why is there branch-specific variation?  The fact that v11 and HEAD
> aren't whinging about lost semaphores is not hard to understand --- we
> stopped using SysV semas.  But why don't the older branches look like v10
> here?

I think v10 is where we switched to POSIX unnamed (= sem_destroy()),
so it's 10, 11 and master that should be the same in this respect, no?

-- 
Thomas Munro
http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: speeding up planning with partitions
Next
From: "Imai, Yoshikazu"
Date:
Subject: RE: Small performance tweak to run-time partition pruning