Re: "could not reattach to shared memory" on buildfarm member dory - Mailing list pgsql-hackers

From Noah Misch
Subject Re: "could not reattach to shared memory" on buildfarm member dory
Date
Msg-id 20180819020007.GD3795674@rfd.leadboat.com
Whole thread Raw
In response to Re: "could not reattach to shared memory" on buildfarm member dory  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: "could not reattach to shared memory" on buildfarm member dory
List pgsql-hackers
On Tue, May 01, 2018 at 11:31:50AM -0400, Tom Lane wrote:
> Well, at this point the only thing that's entirely clear is that none
> of the ideas I had work.  I think we are going to be forced to pursue
> Noah's idea of doing an end-to-end retry.  Somebody else will need to
> take point on that; I lack a Windows environment and have already done
> a lot more blind patch-pushing than I like in this effort.

Having tried this, I find a choice between performance and complexity.  Both
of my designs use proc_exit(4) to indicate failure to reattach.  The simpler,
slower design has WIN32 internal_forkexec() block until the child reports (via
SetEvent()) that it reattached to shared memory.  This caused a fivefold
reduction in process creation performance[1].  The less-simple, faster design
stashes the Port structure and retry count in the BackendList entry, which
reaper() uses to retry the fork upon seeing status 4.  Notably, this requires
new code for regular backends, for bgworkers, and for others.  It's currently
showing a 30% performance _increase_ on the same benchmark; I can't explain
that increase and doubt it will last, but I think it's plausible for the
less-simple design to be performance-neutral.

I see these options:

1. Use the simpler design with a GUC, disabled by default, to control whether
   the new code is active.  Mention the GUC in a new errhint() for the "could
   not reattach to shared memory" error.

2. Like (1), but enable the GUC by default.

3. Like (1), but follow up with a patch to enable the GUC by default in v12
   only.

4. In addition to (1), enable retries if the GUC is set _or_ this postmaster
   has seen at least one child fail to reattach.

5. Use the less-simple design, with retries enabled unconditionally.

I think I prefer (3), with (1) being a close second.  My hesitation on (3) is
that parallel query has made startup time count even if you use a connection
pool, and all the Windows users not needing these retries will see parallel
query become that much slower.  I dislike (5) for its impact on
platform-independent postmaster code.  Other opinions?

I'm attaching a mostly-finished patch for the slower design.  I tested
correctness with -DREATTACH_FORCE_FAIL_PERCENT=99.  I'm also attaching a
proof-of-concept patch for the faster design.  In this proof of concept, the
postmaster does not close its copy of a backend socket until the backend
exits.  Also, bgworkers can change from BGWH_STARTED back to
BGWH_NOT_YET_STARTED; core code tolerates this, but external code may not.
Those would justify paying some performance to fix.  The proof of concept
handles bgworkers and regular backends, but it does not handle the startup
process, checkpointer, etc.  That doesn't affect benchmarking, of course.

nm

[1] This (2 forks per transaction) dropped from 139tps to 27tps:
    echo 'select 1' >script
    env PGOPTIONS="--default_transaction_isolation=repeatable\\ read --force_parallel_mode=on" pgbench -T15 -j30 -c30
--connect-n -fscript 

Attachment

pgsql-hackers by date:

Previous
From: "Jonathan S. Katz"
Date:
Subject: Re: Fix for REFRESH MATERIALIZED VIEW ownership error message
Next
From: "Jonathan S. Katz"
Date:
Subject: Re: Fix for REFRESH MATERIALIZED VIEW ownership error message