DSM robustness failure (was Re: Peripatus/failures) - Mailing list pgsql-hackers

From Tom Lane
Subject DSM robustness failure (was Re: Peripatus/failures)
Date
Msg-id 6153.1539806400@sss.pgh.pa.us
Whole thread Raw
In response to Re: Peripatus/failures  (Larry Rosenman <ler@lerctr.org>)
Responses Re: DSM robustness failure (was Re: Peripatus/failures)  (Thomas Munro <thomas.munro@enterprisedb.com>)
List pgsql-hackers
Larry Rosenman <ler@lerctr.org> writes:
> That got it further, but still fails at PLCheck-C (at least on 9.3).
> It's still running the other branches.

Hmm.  I'm not sure why plpython is crashing for you, but this is exposing
a robustness problem in the DSM logic:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=peripatus&dt=2018-10-17%2018%3A22%3A50

The postmaster is suffering an Assert failure while trying to clean up
after the crash:

2018-10-17 13:43:23.203 CDT [51974:8] pg_regress LOG:  statement: SELECT import_succeed();
2018-10-17 13:43:24.228 CDT [46467:2] LOG:  server process (PID 51974) was terminated by signal 11: Segmentation fault
2018-10-17 13:43:24.228 CDT [46467:3] DETAIL:  Failed process was running: SELECT import_succeed();
2018-10-17 13:43:24.229 CDT [46467:4] LOG:  terminating any other active server processes
2018-10-17 13:43:24.229 CDT [46778:2] WARNING:  terminating connection because of crash of another server process
2018-10-17 13:43:24.229 CDT [46778:3] DETAIL:  The postmaster has commanded this server process to roll back the
currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 
2018-10-17 13:43:24.229 CDT [46778:4] HINT:  In a moment you should be able to reconnect to the database and repeat
yourcommand. 
2018-10-17 13:43:24.235 CDT [46467:5] LOG:  all server processes terminated; reinitializing
2018-10-17 13:43:24.235 CDT [46467:6] LOG:  dynamic shared memory control segment is corrupt
TRAP: FailedAssertion("!(dsm_control_mapped_size == 0)", File: "dsm.c", Line: 181)

It looks to me like what's happening is

(1) crashing process corrupts the DSM control segment somehow.

(2) dsm_postmaster_shutdown notices that, bleats to the log, and
figures its job is done.

(3) dsm_postmaster_startup crashes on Assert because
dsm_control_mapped_size isn't 0, because the old seg is still mapped.

I would argue that both dsm_postmaster_shutdown and dsm_postmaster_startup
are broken here; the former because it makes no attempt to unmap
the old control segment (which it oughta be able to do no matter how badly
broken the contents are), and the latter because it should not let
garbage old state prevent it from establishing a valid new segment.

BTW, the header comment on dsm_postmaster_startup is a lie, which
is probably not unrelated to its failure to consider this situation.

            regards, tom lane


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Large writable variables
Next
From: Andres Freund
Date:
Subject: Re: Large writable variables