Re: strange parallel query behavior after OOM crashes - Mailing list pgsql-hackers

From Kuntal Ghosh
Subject Re: strange parallel query behavior after OOM crashes
Date
Msg-id CAGz5QCL6h-cZS9v=yrbd3FZDDGpXdyMw4icgbx3eE6F2P_eOVA@mail.gmail.com
Whole thread Raw
In response to Re: strange parallel query behavior after OOM crashes  (Thomas Munro <thomas.munro@enterprisedb.com>)
List pgsql-hackers
On Fri, Mar 31, 2017 at 12:32 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Fri, Mar 31, 2017 at 7:38 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> Hi,
>>
>> While doing some benchmarking, I've ran into a fairly strange issue with OOM
>> breaking LaunchParallelWorkers() after the restart. What I see happening is
>> this:
>>
>> 1) a query is executed, and at the end of LaunchParallelWorkers we get
>>
>>     nworkers=8 nworkers_launched=8
>>
>> 2) the query does a Hash Aggregate, but ends up eating much more memory due
>> to n_distinct underestimate (see [1] from 2015 for details), and gets killed
>> by OOM
>>
>> 3) the server restarts, the query is executed again, but this time we get in
>> LaunchParallelWorkers
>>
>>     nworkers=8 nworkers_launched=0
>>
>> There's nothing else running on the server, and there definitely should be
>> free parallel workers.
>>
>> 4) The query gets killed again, and on the next execution we get
>>
>>     nworkers=8 nworkers_launched=8
>>
>> again, although not always. I wonder whether the exact impact depends on OOM
>> killing the leader or worker, for example.
>
> I don't know what's going on but I think I have seen this once or
> twice myself while hacking on test code that crashed.  I wonder if the
> DSM_CREATE_NULL_IF_MAXSEGMENTS case could be being triggered because
> the DSM control is somehow confused?
>
I think I've run into the same problem while working on parallelizing
plans containing InitPlans. You can reproduce that scenario by
following steps:

1. Put an Assert(0) in ParallelQueryMain(), start server and execute
any parallel query.In LaunchParallelWorkers, you can see      nworkers = n nworkers_launched = n (n>0)
But, all the workers will crash because of the assert statement.
2. the server restarts automatically, initialize
BackgroundWorkerData->parallel_register_count and
BackgroundWorkerData->parallel_terminate_count in the shared memory.
After that, it calls ForgetBackgroundWorker and it increments
parallel_terminate_count. In LaunchParallelWorkers, we have the
following condition:
if ((BackgroundWorkerData->parallel_register_count -                    BackgroundWorkerData->parallel_terminate_count)
>=      max_parallel_workers)
 
DO NOT launch any parallel worker.
Hence, nworkers = n nworkers_launched = 0.

I thought because of my stupid mistake the parallel worker is
crashing, so, this is supposed to happen. Sorry for that.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: [PATCH] Reduce src/test/recovery verbosity
Next
From: Andres Freund
Date:
Subject: Re: Logical decoding on standby