Re: DSM robustness failure (was Re: Peripatus/failures) - Mailing list pgsql-hackers

From Tom Lane
Subject Re: DSM robustness failure (was Re: Peripatus/failures)
Date
Msg-id 23944.1539826604@sss.pgh.pa.us
Whole thread Raw
In response to Re: DSM robustness failure (was Re: Peripatus/failures)  (Thomas Munro <thomas.munro@enterprisedb.com>)
Responses Re: DSM robustness failure (was Re: Peripatus/failures)  (Thomas Munro <thomas.munro@enterprisedb.com>)
List pgsql-hackers
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> On Thu, Oct 18, 2018 at 1:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> ... However, I'm still slightly interested in how it
>> was that that broke DSM so thoroughly ...

> Me too.  Frustratingly, that vm object might still exist on Larry's
> machine if it hasn't been rebooted (since we failed to shm_unlink()
> it), so if we knew its name we could write a program to shm_open(),
> mmap(), dump out to a file for analysis and then we could work out
> which of the sanity tests it failed and maybe get some clues.

Larry's REL_10_STABLE failure logs are interesting:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=peripatus&dt=2018-10-17%2020%3A42%3A17

2018-10-17 15:48:08.849 CDT [55240:7] LOG:  dynamic shared memory control segment is corrupt
2018-10-17 15:48:08.849 CDT [55240:8] LOG:  sem_destroy failed: Invalid argument
2018-10-17 15:48:08.850 CDT [55240:9] LOG:  sem_destroy failed: Invalid argument
2018-10-17 15:48:08.850 CDT [55240:10] LOG:  sem_destroy failed: Invalid argument
2018-10-17 15:48:08.850 CDT [55240:11] LOG:  sem_destroy failed: Invalid argument
... lots more ...
2018-10-17 15:48:08.862 CDT [55240:122] LOG:  sem_destroy failed: Invalid argument
2018-10-17 15:48:08.862 CDT [55240:123] LOG:  sem_destroy failed: Invalid argument
TRAP: FailedAssertion("!(dsm_control_mapped_size == 0)", File: "dsm.c", Line: 182)

So at least in this case, not only did we lose the DSM segment but also
all of our semaphores.  Is it conceivable that Python somehow destroyed
those objects, rather than stomping on the contents of the DSM segment?
If not, how do we explain this log?

Also, why is there branch-specific variation?  The fact that v11 and HEAD
aren't whinging about lost semaphores is not hard to understand --- we
stopped using SysV semas.  But why don't the older branches look like v10
here?

            regards, tom lane


pgsql-hackers by date:

Previous
From: Larry Rosenman
Date:
Subject: Re: DSM robustness failure (was Re: Peripatus/failures)
Next
From: Haribabu Kommi
Date:
Subject: Re: Pluggable Storage - Andres's take