Home > mailing lists

Re: [HACKERS] strange parallel query behavior after OOM crashes - Mailing list pgsql-hackers

From	Robert Haas
Subject	Re: [HACKERS] strange parallel query behavior after OOM crashes
Date	April 10, 2017 21:16:24
Msg-id	CA+TgmoaUY6AdfMUWYH0TmfpsW5J3m=wd22OjziVbayFrYuRhKQ@mail.gmail.com Whole thread Raw
Responses	Re: [HACKERS] strange parallel query behavior after OOM crashes (Neha Khatri <nehakhatri5@gmail.com>)
List	pgsql-hackers

Tree view

[ Adding Julien, whose patch this was. ]

On Thu, Apr 6, 2017 at 5:34 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> While performing StartupDatabase, both master and standby server
> behave in similar way till postmaster spawns startup process.
> In master, startup process completes its job and dies. As a result,
> reaper is called which in turn calls maybe_start_bgworker().
> In standby, after getting a valid snapshot, startup process sends
> postmaster a signal to enable connections. Signal handler in
> postmaster calls maybe_start_bgworker().
> In maybe_start_bgworker(), if we find a crashed bgworker(crashed_at !=
> 0) with a NEVER RESTART flag, we call ForgetBackgroundWorker().to
> forget the bgworker process.
>
> I've attached the patch for adding an argument in
> ForgetBackgroundWorker() to indicate a crashed situation. Based on
> that, we can take the necessary actions. I've not included the Assert
> statement in this patch.

This doesn't seem right, because this will potentially pass wasCrashed
= true even if there has been no crash-and-restart cycle.  The place
where you're passing "true" only knows that rw->rw_crashed_at != 0 &&
rw->rw_worker.bgw_restart_time == BGW_NEVER_RESTART, but that doesn't
prove much of anything.  Some *other* background worker could have
crashed, rather than the one at issue.  Even there's only one worker
involved, the test for whether to call HandleChildCrash() involves
other factors:

        if ((rw->rw_worker.bgw_flags & BGWORKER_SHMEM_ACCESS) != 0)
        {
            if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus))
            {
                HandleChildCrash(pid, exitstatus, namebuf);
                return true;
            }
        }

        if (!ReleasePostmasterChildSlot(rw->rw_child_slot) &&
            (rw->rw_worker.bgw_flags & BGWORKER_SHMEM_ACCESS) != 0)
        {
            HandleChildCrash(pid, exitstatus, namebuf);
            return true;
        }

After some study, I think the solution here is as follows:

1. Forget BGW_NEVER_RESTART workers in
ResetBackgroundWorkerCrashTimes() rather than leaving them around to
be cleaned up after the conclusion of the restart, so that they go
away before rather than after shared memory is reset.

2. Don't allow BGWORKER_CLASS_PARALLEL workers to be anything other
than BGW_NEVER_RESTART, so that (1) suffices to guarantee that, after
recovering from a crash, 0 is the correct starting value for
parallel_register_count.  (Alternatively, we could iterate through the
worker list and compute the correct starting value for
parallel_register_count, but that seems silly since we only ever
expect BGW_NEVER_RESTART here anyway.)

The attached patch seems to fix the problem for me.  I'll commit it
tomorrow unless somebody sees a problem with it; if that happens, I'll
provide some other kind of update tomorrow.  [For clarity, for
official status update purposes, I am setting a next-update date of
tomorrow, 2017-04-11.]

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

forget-before-crash-v1.patch

pgsql-hackers by date:

From: Peter Eisentraut
Date: 10 April 2017, 21:14:03
Subject: [HACKERS] contrib/bloom wal-check not run by default

From: "Hans Buschmann"
Date: 10 April 2017, 21:21:39
Subject: [HACKERS] RMT: Use Visual Studio 2015 for Compiling and linking the Windows version in PG10

Re: [HACKERS] strange parallel query behavior after OOM crashes - Mailing list pgsql-hackers

Attachment

Previous

Next