Thread: Re: strange parallel query behavior after OOM crashes
On Fri, Mar 31, 2017 at 6:50 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Mar 30, 2017 at 4:35 PM, Kuntal Ghosh > <kuntalghosh.2007@gmail.com> wrote: >> 2. the server restarts automatically, initialize >> BackgroundWorkerData->parallel_register_count and >> BackgroundWorkerData->parallel_terminate_count in the shared memory. >> After that, it calls ForgetBackgroundWorker and it increments >> parallel_terminate_count. > > Hmm. So this seems like the root of the problem. Presumably those > things need to be reset AFTER forgetting any background workers from > before the crash. > IMHO, the fix would be not to increase the terminated parallel worker count whenever ForgetBackgroundWorker is called due to a bgworker crash. I've attached a patch for the same. PFA. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
Attachment
Looking further in this context, number of active parallel workers is:
parallel_register_count - parallel_terminate_count
Can active workers ever be greater than max_parallel_workers, I think no. Then why should there be greater than check in the following condition:
if (parallel && (BackgroundWorkerData->parallel_register_count -
BackgroundWorkerData->parallel_terminate_count) >= max_parallel_workers)
I feel there should be an assert if
(BackgroundWorkerData->parallel_register_count - BackgroundWorkerData->parallel_terminate_count) > max_parallel_workers)
And the check could be
if (parallel && (active_parallel_workers == max_parallel_workers))
then do not register new parallel wokers and return.
There should be no tolerance for the case when active_parallel_workers > max_parallel_workers. After all that is the purpose of max_parallel_workers.
Is it like multiple backends trying to register parallel workers at the same time, for which the greater than check should be present?
Thoughts?
Regards,
Neha
Neha
On Mon, Apr 3, 2017 at 6:08 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > On Fri, Mar 31, 2017 at 6:50 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Mar 30, 2017 at 4:35 PM, Kuntal Ghosh >> <kuntalghosh.2007@gmail.com> wrote: >>> 2. the server restarts automatically, initialize >>> BackgroundWorkerData->parallel_register_count and >>> BackgroundWorkerData->parallel_terminate_count in the shared memory. >>> After that, it calls ForgetBackgroundWorker and it increments >>> parallel_terminate_count. >> >> Hmm. So this seems like the root of the problem. Presumably those >> things need to be reset AFTER forgetting any background workers from >> before the crash. >> > IMHO, the fix would be not to increase the terminated parallel worker > count whenever ForgetBackgroundWorker is called due to a bgworker > crash. I've attached a patch for the same. PFA. While I'm not opposed to that approach, I don't think this is a good way to implement it. If you want to pass an explicit flag to ForgetBackgroundWorker telling it whether or not it should performing the increment, fine. But with what you've got here, you're essentially relying on "spooky action at a distance". It would be easy for future code changes to break this, not realizing that somebody's got a hard dependency on 0 having a specific meaning. BTW, if this isn't on the open items list, it should be. It's presumably the fault of b460f5d6693103076dc554aa7cbb96e1e53074f9. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 04/04/2017 06:52 PM, Robert Haas wrote: > On Mon, Apr 3, 2017 at 6:08 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: >> On Fri, Mar 31, 2017 at 6:50 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Thu, Mar 30, 2017 at 4:35 PM, Kuntal Ghosh >>> <kuntalghosh.2007@gmail.com> wrote: >>>> 2. the server restarts automatically, initialize >>>> BackgroundWorkerData->parallel_register_count and >>>> BackgroundWorkerData->parallel_terminate_count in the shared memory. >>>> After that, it calls ForgetBackgroundWorker and it increments >>>> parallel_terminate_count. >>> >>> Hmm. So this seems like the root of the problem. Presumably those >>> things need to be reset AFTER forgetting any background workers from >>> before the crash. >>> >> IMHO, the fix would be not to increase the terminated parallel worker >> count whenever ForgetBackgroundWorker is called due to a bgworker >> crash. I've attached a patch for the same. PFA. > > While I'm not opposed to that approach, I don't think this is a good > way to implement it. If you want to pass an explicit flag to > ForgetBackgroundWorker telling it whether or not it should performing > the increment, fine. But with what you've got here, you're > essentially relying on "spooky action at a distance". It would be > easy for future code changes to break this, not realizing that > somebody's got a hard dependency on 0 having a specific meaning. > I'm probably missing something, but I don't quite understand how these values actually survive the crash. I mean, what I observed is OOM followed by a restart, so shouldn't BackgroundWorkerShmemInit() simply reset the values back to 0? Or do we call ForgetBackgroundWorker() after the crash for some reason? In any case, the comment right before BackgroundWorkerArray says this: * These counters can of course overflow, but it's not important here * since the subtraction will still give the right number. which means that this assert + Assert(BackgroundWorkerData->parallel_register_count >= + BackgroundWorkerData->parallel_terminate_count); is outright broken, just like any other attempts to rely on simple comparisons of these two counters, no? > BTW, if this isn't on the open items list, it should be. It's > presumably the fault of b460f5d6693103076dc554aa7cbb96e1e53074f9. > Unrelated, but perhaps we should also add this to open items: https://www.postgresql.org/message-id/flat/57bbc52c-5d40-bb80-2992-7e1027d0b67c%402ndquadrant.com (although that's more a 9.6 material). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Apr 4, 2017 at 11:22 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > On 04/04/2017 06:52 PM, Robert Haas wrote: >> >> On Mon, Apr 3, 2017 at 6:08 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> >> wrote: >>> >>> On Fri, Mar 31, 2017 at 6:50 PM, Robert Haas <robertmhaas@gmail.com> >>> wrote: >>>> >>>> On Thu, Mar 30, 2017 at 4:35 PM, Kuntal Ghosh >>>> <kuntalghosh.2007@gmail.com> wrote: >>>>> >>>>> 2. the server restarts automatically, initialize >>>>> BackgroundWorkerData->parallel_register_count and >>>>> BackgroundWorkerData->parallel_terminate_count in the shared memory. >>>>> After that, it calls ForgetBackgroundWorker and it increments >>>>> parallel_terminate_count. >>>> >>>> >>>> Hmm. So this seems like the root of the problem. Presumably those >>>> things need to be reset AFTER forgetting any background workers from >>>> before the crash. >>>> >>> IMHO, the fix would be not to increase the terminated parallel worker >>> count whenever ForgetBackgroundWorker is called due to a bgworker >>> crash. I've attached a patch for the same. PFA. >> >> >> While I'm not opposed to that approach, I don't think this is a good >> way to implement it. If you want to pass an explicit flag to >> ForgetBackgroundWorker telling it whether or not it should performing >> the increment, fine. But with what you've got here, you're >> essentially relying on "spooky action at a distance". It would be >> easy for future code changes to break this, not realizing that >> somebody's got a hard dependency on 0 having a specific meaning. >> > > I'm probably missing something, but I don't quite understand how these > values actually survive the crash. I mean, what I observed is OOM followed > by a restart, so shouldn't BackgroundWorkerShmemInit() simply reset the > values back to 0? Or do we call ForgetBackgroundWorker() after the crash for > some reason? AFAICU, during crash recovery, we wait for all non-syslogger children to exit, then reset shmem(call BackgroundWorkerShmemInit) and perform StartupDataBase. While starting the startup process we check if any bgworker is scheduled for a restart. If a bgworker is crashed and not meant for a restart(parallel worker in our case), we call ForgetBackgroundWorker() to discard it. > > In any case, the comment right before BackgroundWorkerArray says this: > > * These counters can of course overflow, but it's not important here > * since the subtraction will still give the right number. > > which means that this assert > > + Assert(BackgroundWorkerData->parallel_register_count >= > + BackgroundWorkerData->parallel_terminate_count); > > is outright broken, just like any other attempts to rely on simple > comparisons of these two counters, no? > IIUC, the assert statements ensures that register count should always be greater than or equal to the terminate count. Whereas, the comment refers to the fact that register count and terminate count indicate the total number of parallel workers spawned and terminated respectively since the server has been started. At every session, we're not resetting the counts, hence they can overflow. But, their substraction should give you the number of active parallel worker at any instance. So, I'm not able to see how the assert is broken w.r.t the aforesaid comment. Am I missing something here? -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On Tue, Apr 4, 2017 at 12:16 PM, Neha Khatri <nehakhatri5@gmail.com> wrote: > I feel there should be an assert if > (BackgroundWorkerData->parallel_register_count - > BackgroundWorkerData->parallel_terminate_count) > max_parallel_workers) > Backend 1 > SET max_parallel_worker = 8; Backend 1 > Execute a long running parallel query q1 with number of parallel worker spawned is say, 4. Backend 2> SET max_parallel_worker = 3; Backend 2 > Execute a long running parallel query q2 with number of parallel worker spawned > 0. The above assert statement will bring down the server unnecessarily while executing q2. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On 04/05/2017 09:05 AM, Kuntal Ghosh wrote: > On Tue, Apr 4, 2017 at 11:22 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> On 04/04/2017 06:52 PM, Robert Haas wrote: >>> >>> On Mon, Apr 3, 2017 at 6:08 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> >>> wrote: >>>> >>>> On Fri, Mar 31, 2017 at 6:50 PM, Robert Haas <robertmhaas@gmail.com> >>>> wrote: >>>>> >>>>> On Thu, Mar 30, 2017 at 4:35 PM, Kuntal Ghosh >>>>> <kuntalghosh.2007@gmail.com> wrote: >>>>>> >>>>>> 2. the server restarts automatically, initialize >>>>>> BackgroundWorkerData->parallel_register_count and >>>>>> BackgroundWorkerData->parallel_terminate_count in the shared memory. >>>>>> After that, it calls ForgetBackgroundWorker and it increments >>>>>> parallel_terminate_count. >>>>> >>>>> >>>>> Hmm. So this seems like the root of the problem. Presumably those >>>>> things need to be reset AFTER forgetting any background workers from >>>>> before the crash. >>>>> >>>> IMHO, the fix would be not to increase the terminated parallel worker >>>> count whenever ForgetBackgroundWorker is called due to a bgworker >>>> crash. I've attached a patch for the same. PFA. >>> >>> >>> While I'm not opposed to that approach, I don't think this is a good >>> way to implement it. If you want to pass an explicit flag to >>> ForgetBackgroundWorker telling it whether or not it should performing >>> the increment, fine. But with what you've got here, you're >>> essentially relying on "spooky action at a distance". It would be >>> easy for future code changes to break this, not realizing that >>> somebody's got a hard dependency on 0 having a specific meaning. >>> >> >> I'm probably missing something, but I don't quite understand how these >> values actually survive the crash. I mean, what I observed is OOM followed >> by a restart, so shouldn't BackgroundWorkerShmemInit() simply reset the >> values back to 0? Or do we call ForgetBackgroundWorker() after the crash for >> some reason? > AFAICU, during crash recovery, we wait for all non-syslogger children > to exit, then reset shmem(call BackgroundWorkerShmemInit) and perform > StartupDataBase. While starting the startup process we check if any > bgworker is scheduled for a restart. If a bgworker is crashed and not > meant for a restart(parallel worker in our case), we call > ForgetBackgroundWorker() to discard it. > OK, so essentially we reset the counters, getting parallel_register_count = 0 parallel_terminate_count = 0 and then ForgetBackgroundWworker() runs and increments the terminate_count, breaking the logic? And you're using (parallel_register_count > 0) to detect whether we're still in the init phase or not. Hmmm. >> >> In any case, the comment right before BackgroundWorkerArray says this: >> >> * These counters can of course overflow, but it's not important here >> * since the subtraction will still give the right number. >> >> which means that this assert >> >> + Assert(BackgroundWorkerData->parallel_register_count >= >> + BackgroundWorkerData->parallel_terminate_count); >> >> is outright broken, just like any other attempts to rely on simple >> comparisons of these two counters, no? >> > IIUC, the assert statements ensures that register count should always > be greater than or equal to the terminate count. > Whereas, the comment refers to the fact that register count and > terminate count indicate the total number of parallel workers spawned > and terminated respectively since the server has been started. At > every session, we're not resetting the counts, hence they can > overflow. But, their substraction should give you the number of active > parallel worker at any instance. > So, I'm not able to see how the assert is broken w.r.t the aforesaid > comment. Am I missing something here? > The comment says that the counters are allowed to overflow, i.e. after a long uptime you might get these values parallel_register_count = UINT_MAX + 1 = 1 parallel_terminate_count = UINT_MAX which is fine, because the C handles the overflow during subtraction and so parallel_register_count - parallel_terminate_count = 1 But the assert is not doing subtraction, it's comparing the values directly: Assert(parallel_register_count >= parallel_terminate_count); and the (perfectly valid) values trivially violate this comparison. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Apr 5, 2017 at 3:07 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > > On 04/05/2017 09:05 AM, Kuntal Ghosh wrote: >> >> AFAICU, during crash recovery, we wait for all non-syslogger children >> to exit, then reset shmem(call BackgroundWorkerShmemInit) and perform >> StartupDataBase. While starting the startup process we check if any >> bgworker is scheduled for a restart. If a bgworker is crashed and not >> meant for a restart(parallel worker in our case), we call >> ForgetBackgroundWorker() to discard it. >> > > OK, so essentially we reset the counters, getting > > parallel_register_count = 0 > parallel_terminate_count = 0 > > and then ForgetBackgroundWworker() runs and increments the terminate_count, > breaking the logic? > > And you're using (parallel_register_count > 0) to detect whether we're still > in the init phase or not. Hmmm. > Yes. But, as Robert suggested up in the thread, we should not use (parallel_register_count = 0) as an alternative to define a bgworker crash. Hence, I've added an argument named 'wasCrashed' in ForgetBackgroundWorker to indicate a bgworker crash. >>> >>> In any case, the comment right before BackgroundWorkerArray says this: >>> >>> * These counters can of course overflow, but it's not important here >>> * since the subtraction will still give the right number. >>> >>> which means that this assert >>> >>> + Assert(BackgroundWorkerData->parallel_register_count >= >>> + BackgroundWorkerData->parallel_terminate_count); >>> >>> is outright broken, just like any other attempts to rely on simple >>> comparisons of these two counters, no? >>> >> IIUC, the assert statements ensures that register count should always >> be greater than or equal to the terminate count. >> Whereas, the comment refers to the fact that register count and >> terminate count indicate the total number of parallel workers spawned >> and terminated respectively since the server has been started. At >> every session, we're not resetting the counts, hence they can >> overflow. But, their substraction should give you the number of active >> parallel worker at any instance. >> So, I'm not able to see how the assert is broken w.r.t the aforesaid >> comment. Am I missing something here? >> > > The comment says that the counters are allowed to overflow, i.e. after a > long uptime you might get these values > > parallel_register_count = UINT_MAX + 1 = 1 > parallel_terminate_count = UINT_MAX > > which is fine, because the C handles the overflow during subtraction and so > > parallel_register_count - parallel_terminate_count = 1 > > But the assert is not doing subtraction, it's comparing the values directly: > > Assert(parallel_register_count >= parallel_terminate_count); > > and the (perfectly valid) values trivially violate this comparison. > Thanks for the explanation. So, we can't use the above assert statement. Even the following assert statement will not be helpful: Assert(parallel_register_count - parallel_terminate_count >= 0); Because, it'll fail to track the scenario when parallel_register_count is not overflowed, still less than parallel_terminate_count. :( It seems to me the only thing we can make sure is (parallel_register_count == parallel_terminate_count == 0) in ForgetBackgroundWorker in case of a crash. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On 04/05/2017 12:36 PM, Kuntal Ghosh wrote: > On Wed, Apr 5, 2017 at 3:07 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> >> >> On 04/05/2017 09:05 AM, Kuntal Ghosh wrote: >>> >>> AFAICU, during crash recovery, we wait for all non-syslogger children >>> to exit, then reset shmem(call BackgroundWorkerShmemInit) and perform >>> StartupDataBase. While starting the startup process we check if any >>> bgworker is scheduled for a restart. If a bgworker is crashed and not >>> meant for a restart(parallel worker in our case), we call >>> ForgetBackgroundWorker() to discard it. >>> >> >> OK, so essentially we reset the counters, getting >> >> parallel_register_count = 0 >> parallel_terminate_count = 0 >> >> and then ForgetBackgroundWworker() runs and increments the terminate_count, >> breaking the logic? >> >> And you're using (parallel_register_count > 0) to detect whether we're still >> in the init phase or not. Hmmm. >> > Yes. But, as Robert suggested up in the thread, we should not use > (parallel_register_count = 0) as an alternative to define a bgworker > crash. Hence, I've added an argument named 'wasCrashed' in > ForgetBackgroundWorker to indicate a bgworker crash. > Sure, and I agree that having an explicit flag is the right solution. I'm just trying to make sure I understand what's happening. >> >> The comment says that the counters are allowed to overflow, i.e. after a >> long uptime you might get these values >> >> parallel_register_count = UINT_MAX + 1 = 1 >> parallel_terminate_count = UINT_MAX >> >> which is fine, because the C handles the overflow during subtraction and so >> >> parallel_register_count - parallel_terminate_count = 1 >> >> But the assert is not doing subtraction, it's comparing the values directly: >> >> Assert(parallel_register_count >= parallel_terminate_count); >> >> and the (perfectly valid) values trivially violate this comparison. >> > Thanks for the explanation. So, we can't use the above assert > statement. Even the following assert statement will not be helpful: > Assert(parallel_register_count - parallel_terminate_count >= 0); > Because, it'll fail to track the scenario when parallel_register_count > is not overflowed, still less than parallel_terminate_count. :( > Actually, that assert would work, because C does handle overflows on uint values during subtraction just fine. That is, (UINT_MAX+x) - UINT_MAX = x Also, the comment about overflows before BackgroundWorkerArray claims this is the case. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Apr 5, 2017 at 4:13 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >>> >>> The comment says that the counters are allowed to overflow, i.e. after a >>> long uptime you might get these values >>> >>> parallel_register_count = UINT_MAX + 1 = 1 >>> parallel_terminate_count = UINT_MAX >>> >>> which is fine, because the C handles the overflow during subtraction and >>> so >>> >>> parallel_register_count - parallel_terminate_count = 1 >>> >>> But the assert is not doing subtraction, it's comparing the values >>> directly: >>> >>> Assert(parallel_register_count >= parallel_terminate_count); >>> >>> and the (perfectly valid) values trivially violate this comparison. >>> >> Thanks for the explanation. So, we can't use the above assert >> statement. Even the following assert statement will not be helpful: >> Assert(parallel_register_count - parallel_terminate_count >= 0); >> Because, it'll fail to track the scenario when parallel_register_count >> is not overflowed, still less than parallel_terminate_count. :( >> > > Actually, that assert would work, because C does handle overflows on uint > values during subtraction just fine. That is, > > (UINT_MAX+x) - UINT_MAX = x > > Also, the comment about overflows before BackgroundWorkerArray claims this > is the case. > Agreed on the overflowed case. But, my concern is when an overflow has not yet happened: Suppose, uint parallel_register_count = 1; /* Didn't overflow* / uint parallel_terminate_count = 2; /* Didn't overflow */ Assert(parallel_register_count - parallel_terminate_count >= 0); We want the assert statement to fail here, but I think it won't since -1 has a valid representation in unsigned int and it is greater than 0, no? -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On 04/05/2017 01:09 PM, Kuntal Ghosh wrote: > On Wed, Apr 5, 2017 at 4:13 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >>>> >>>> The comment says that the counters are allowed to overflow, i.e. after a >>>> long uptime you might get these values >>>> >>>> parallel_register_count = UINT_MAX + 1 = 1 >>>> parallel_terminate_count = UINT_MAX >>>> >>>> which is fine, because the C handles the overflow during subtraction and >>>> so >>>> >>>> parallel_register_count - parallel_terminate_count = 1 >>>> >>>> But the assert is not doing subtraction, it's comparing the values >>>> directly: >>>> >>>> Assert(parallel_register_count >= parallel_terminate_count); >>>> >>>> and the (perfectly valid) values trivially violate this comparison. >>>> >>> Thanks for the explanation. So, we can't use the above assert >>> statement. Even the following assert statement will not be helpful: >>> Assert(parallel_register_count - parallel_terminate_count >= 0); >>> Because, it'll fail to track the scenario when parallel_register_count >>> is not overflowed, still less than parallel_terminate_count. :( >>> >> >> Actually, that assert would work, because C does handle overflows on uint >> values during subtraction just fine. That is, >> >> (UINT_MAX+x) - UINT_MAX = x >> >> Also, the comment about overflows before BackgroundWorkerArray claims this >> is the case. >> > Agreed on the overflowed case. But, my concern is when an overflow has > not yet happened: > > Suppose, > uint parallel_register_count = 1; /* Didn't overflow* / > uint parallel_terminate_count = 2; /* Didn't overflow */ > > Assert(parallel_register_count - parallel_terminate_count >= 0); > We want the assert statement to fail here, but I think it won't since > -1 has a valid representation in unsigned int and it is greater than > 0, no? I don't follow. How exactly do you get into this situation, assuming you actually enforce the (register_count > terminate_count) invariant consistently? In the modulo arithmetic of course. But now that I'm thinking about it, the subtraction actually happens in unsigned ints, so the result will be unsigned int too, i.e. always >=0. Whether it makes sense depends on the invariant being true. But I think this would work: Assert(parallel_register_count - parallel_terminate_count <= max_parallel_workers); If the difference gets 'negative' and wraps around, it'll be close to UINT_MAX. But man, this unsigned int arithmetic makes my brain hurt ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Apr 5, 2017 at 12:35 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > On Tue, Apr 4, 2017 at 11:22 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> On 04/04/2017 06:52 PM, Robert Haas wrote: >>> >>> On Mon, Apr 3, 2017 at 6:08 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> >>> wrote: >>>> >>>> On Fri, Mar 31, 2017 at 6:50 PM, Robert Haas <robertmhaas@gmail.com> >>>> wrote: >>>>> >>>>> On Thu, Mar 30, 2017 at 4:35 PM, Kuntal Ghosh >>>>> <kuntalghosh.2007@gmail.com> wrote: >>>>>> >>>>>> 2. the server restarts automatically, initialize >>>>>> BackgroundWorkerData->parallel_register_count and >>>>>> BackgroundWorkerData->parallel_terminate_count in the shared memory. >>>>>> After that, it calls ForgetBackgroundWorker and it increments >>>>>> parallel_terminate_count. >>>>> >>>>> >>>>> Hmm. So this seems like the root of the problem. Presumably those >>>>> things need to be reset AFTER forgetting any background workers from >>>>> before the crash. >>>>> >>>> IMHO, the fix would be not to increase the terminated parallel worker >>>> count whenever ForgetBackgroundWorker is called due to a bgworker >>>> crash. I've attached a patch for the same. PFA. >>> >>> >>> While I'm not opposed to that approach, I don't think this is a good >>> way to implement it. If you want to pass an explicit flag to >>> ForgetBackgroundWorker telling it whether or not it should performing >>> the increment, fine. But with what you've got here, you're >>> essentially relying on "spooky action at a distance". It would be >>> easy for future code changes to break this, not realizing that >>> somebody's got a hard dependency on 0 having a specific meaning. >>> >> >> I'm probably missing something, but I don't quite understand how these >> values actually survive the crash. I mean, what I observed is OOM followed >> by a restart, so shouldn't BackgroundWorkerShmemInit() simply reset the >> values back to 0? Or do we call ForgetBackgroundWorker() after the crash for >> some reason? > AFAICU, during crash recovery, we wait for all non-syslogger children > to exit, then reset shmem(call BackgroundWorkerShmemInit) and perform > StartupDataBase. While starting the startup process we check if any > bgworker is scheduled for a restart. > In general, your theory appears right, but can you check how it behaves in standby server because there is a difference in how the startup process behaves during master and standby startup? In master, it stops after recovery whereas in standby it will keep on running to receive WAL. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Apr 4, 2017 at 1:52 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > In any case, the comment right before BackgroundWorkerArray says this: > > * These counters can of course overflow, but it's not important here > * since the subtraction will still give the right number. > > which means that this assert > > + Assert(BackgroundWorkerData->parallel_register_count >= > + BackgroundWorkerData->parallel_terminate_count); > > is outright broken, just like any other attempts to rely on simple > comparisons of these two counters, no? Yeah, that's busted. Good catch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Apr 5, 2017 at 6:36 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > Yes. But, as Robert suggested up in the thread, we should not use > (parallel_register_count = 0) as an alternative to define a bgworker > crash. Hence, I've added an argument named 'wasCrashed' in > ForgetBackgroundWorker to indicate a bgworker crash. Did you intend to attach that patch to this email? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Apr 5, 2017 at 7:31 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Apr 5, 2017 at 6:36 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: >> Yes. But, as Robert suggested up in the thread, we should not use >> (parallel_register_count = 0) as an alternative to define a bgworker >> crash. Hence, I've added an argument named 'wasCrashed' in >> ForgetBackgroundWorker to indicate a bgworker crash. > > Did you intend to attach that patch to this email? > Actually, I'm confused how we should ensure (register_count > terminate_count) invariant. I think there can be a system crash what Tomas has suggested up in the thread. Assert(parallel_register_count - parallel_terminate_count <= max_parallel_workers); Backend 1 > SET max_parallel_worker = 8; Backend 1 > Execute a long running parallel query q1 with number of parallel worker spawned is say, 4. Backend 2> SET max_parallel_worker = 3; Backend 2 > Try to execute any parallel query q2 with number of parallel worker spawned > 0. The above assert statement will bring down the server unnecessarily while executing q2. If the assert statement was not there, it could have gone ahead without launching any workers. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On Wed, Apr 5, 2017 at 10:09 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: >> Did you intend to attach that patch to this email? >> > Actually, I'm confused how we should ensure (register_count > > terminate_count) invariant. I think there can be a system crash what > Tomas has suggested up in the thread. > > Assert(parallel_register_count - parallel_terminate_count <= > max_parallel_workers); > Backend 1 > SET max_parallel_worker = 8; > Backend 1 > Execute a long running parallel query q1 with number of > parallel worker spawned is say, 4. At this point, parallel_register_count should be equal to parallel_terminate_count. 4 workers were started, and 4 have terminated. > Backend 2> SET max_parallel_worker = 3; > Backend 2 > Try to execute any parallel query q2 with number of > parallel worker spawned > 0. Now here parallel_register_count should get bumped up to 4+(# of workers now launched) at the beginning and then parallel_terminate_count at the end. No problem. What's going wrong, here? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 04/05/2017 04:09 PM, Kuntal Ghosh wrote: > On Wed, Apr 5, 2017 at 7:31 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, Apr 5, 2017 at 6:36 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: >>> Yes. But, as Robert suggested up in the thread, we should not use >>> (parallel_register_count = 0) as an alternative to define a bgworker >>> crash. Hence, I've added an argument named 'wasCrashed' in >>> ForgetBackgroundWorker to indicate a bgworker crash. >> >> Did you intend to attach that patch to this email? >> > Actually, I'm confused how we should ensure (register_count > > terminate_count) invariant. I think there can be a system crash what > Tomas has suggested up in the thread. > > Assert(parallel_register_count - parallel_terminate_count <= > max_parallel_workers); > Backend 1 > SET max_parallel_worker = 8; > Backend 1 > Execute a long running parallel query q1 with number of > parallel worker spawned is say, 4. > Backend 2> SET max_parallel_worker = 3; > Backend 2 > Try to execute any parallel query q2 with number of > parallel worker spawned > 0. > > The above assert statement will bring down the server unnecessarily > while executing q2. If the assert statement was not there, it could > have gone ahead without launching any workers. > Ah, right. I forgot max_parallel_workers may be changed in session. I think we can use max_worker_processes instead: Assert(parallel_register_count - parallel_terminate_count <= max_worker_processes); The whole point is that if parallel_terminate_count exceeds parallel_register_count, the subtraction wraps around to a value close to UINT_MAX. All we need is an maximum possible delta between the two values to detect this, and max_worker_processes seems fine. We could also use UINT_MAX/2 for example. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 04/05/2017 04:15 PM, Robert Haas wrote: > On Wed, Apr 5, 2017 at 10:09 AM, Kuntal Ghosh > <kuntalghosh.2007@gmail.com> wrote: >>> Did you intend to attach that patch to this email? >>> >> Actually, I'm confused how we should ensure (register_count > >> terminate_count) invariant. I think there can be a system crash what >> Tomas has suggested up in the thread. >> >> Assert(parallel_register_count - parallel_terminate_count <= >> max_parallel_workers); >> Backend 1 > SET max_parallel_worker = 8; >> Backend 1 > Execute a long running parallel query q1 with number of >> parallel worker spawned is say, 4. > > At this point, parallel_register_count should be equal to > parallel_terminate_count. 4 workers were started, and 4 have > terminated. > No, the workers were started, but are still running, so register_count = 4 terminate_count = 0 >> Backend 2> SET max_parallel_worker = 3; >> Backend 2 > Try to execute any parallel query q2 with number of >> parallel worker spawned > 0. > > Now here parallel_register_count should get bumped up to 4+(# of > workers now launched) at the beginning and then > parallel_terminate_count at the end. No problem. > > What's going wrong, here? > Well, assuming the other workers are still running, you get register_count = 4 terminate_count = 0 and the difference is greater than max_parallel_workers = 3. Which violates the assert. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Apr 5, 2017 at 7:45 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Apr 5, 2017 at 10:09 AM, Kuntal Ghosh > <kuntalghosh.2007@gmail.com> wrote: >>> Did you intend to attach that patch to this email? >>> >> Actually, I'm confused how we should ensure (register_count > >> terminate_count) invariant. I think there can be a system crash what >> Tomas has suggested up in the thread. >> >> Assert(parallel_register_count - parallel_terminate_count <= >> max_parallel_workers); >> Backend 1 > SET max_parallel_worker = 8; >> Backend 1 > Execute a long running parallel query q1 with number of >> parallel worker spawned is say, 4. > > At this point, parallel_register_count should be equal to > parallel_terminate_count. 4 workers were started, and 4 have > terminated. > Actually, I'm referring to the case when q1 is still running. In that case, parallel_register_count = 4 and parallel_terminate_count = 0. >> Backend 2> SET max_parallel_worker = 3; Now, parallel_register_count - parallel_terminate_count = 4 > max_parallel_worker. >> Backend 2 > Try to execute any parallel query q2 with number of >> parallel worker spawned > 0. > Hence, the assert will fail here. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On 04/05/2017 04:26 PM, Kuntal Ghosh wrote: > On Wed, Apr 5, 2017 at 7:45 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, Apr 5, 2017 at 10:09 AM, Kuntal Ghosh >> <kuntalghosh.2007@gmail.com> wrote: >>>> Did you intend to attach that patch to this email? >>>> >>> Actually, I'm confused how we should ensure (register_count > >>> terminate_count) invariant. I think there can be a system crash what >>> Tomas has suggested up in the thread. >>> >>> Assert(parallel_register_count - parallel_terminate_count <= >>> max_parallel_workers); >>> Backend 1 > SET max_parallel_worker = 8; >>> Backend 1 > Execute a long running parallel query q1 with number of >>> parallel worker spawned is say, 4. >> >> At this point, parallel_register_count should be equal to >> parallel_terminate_count. 4 workers were started, and 4 have >> terminated. >> > Actually, I'm referring to the case when q1 is still running. In that > case, parallel_register_count = 4 and parallel_terminate_count = 0. > >>> Backend 2> SET max_parallel_worker = 3; > Now, parallel_register_count - parallel_terminate_count = 4 > > max_parallel_worker. > >>> Backend 2 > Try to execute any parallel query q2 with number of >>> parallel worker spawned > 0. >> > Hence, the assert will fail here. > Actually, you probably don't even need two sessions to trigger the assert. All you need is to tweak the max_parallel_workers and then reload the config while the parallel query is running. Then ForgetBackgroundWorker() will see the new value. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Apr 5, 2017 at 5:34 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
On Tue, Apr 4, 2017 at 12:16 PM, Neha Khatri <nehakhatri5@gmail.com> wrote:
> I feel there should be an assert if
> (BackgroundWorkerData->parallel_register_count -
> BackgroundWorkerData->parallel_terminate_count) > max_parallel_workers)
>
Backend 1 > SET max_parallel_worker = 8;
Backend 1 > Execute a long running parallel query q1 with number of
parallel worker spawned is say, 4.
Backend 2> SET max_parallel_worker = 3;
Backend 2 > Execute a long running parallel query q2 with number of
parallel worker spawned > 0.
The above assert statement will bring down the server unnecessarily
while executing q2.
Right, with multiple backends trying to fiddle with max_parallel_workers, that might bring the server down with the said assert:
Assert(parallel_register_count - parallel_terminate_count <= max_parallel_workers)
The problem here seem to be the change in the max_parallel_workers value while the parallel workers are still under execution. So this poses two questions:
1. From usecase point of view, why could there be a need to tweak the max_parallel_workers exactly at the time when the parallel workers are at play.
2. Could there be a restriction on tweaking of max_parallel_workers while the parallel workers are at play? At least do not allow setting the max_parallel_workers less than the current # of active parallel workers.
Regards,
Neha
Neha
On Wed, Apr 5, 2017 at 8:17 PM, Neha Khatri <nehakhatri5@gmail.com> wrote: > The problem here seem to be the change in the max_parallel_workers value > while the parallel workers are still under execution. So this poses two > questions: > > 1. From usecase point of view, why could there be a need to tweak the > max_parallel_workers exactly at the time when the parallel workers are at > play. > 2. Could there be a restriction on tweaking of max_parallel_workers while > the parallel workers are at play? At least do not allow setting the > max_parallel_workers less than the current # of active parallel workers. Well, that would be letting the tail wag the dog. The maximum value of max_parallel_workers is only 1024, and what we're really worried about here is seeing a value near PG_UINT32_MAX, which leaves a lot of daylight. How about just creating a #define that's used by guc.c as the maximum for the GUC, and here we assert that we're <= that value? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Apr 5, 2017 at 6:49 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Apr 5, 2017 at 12:35 PM, Kuntal Ghosh > <kuntalghosh.2007@gmail.com> wrote: >> On Tue, Apr 4, 2017 at 11:22 PM, Tomas Vondra >>> I'm probably missing something, but I don't quite understand how these >>> values actually survive the crash. I mean, what I observed is OOM followed >>> by a restart, so shouldn't BackgroundWorkerShmemInit() simply reset the >>> values back to 0? Or do we call ForgetBackgroundWorker() after the crash for >>> some reason? >> AFAICU, during crash recovery, we wait for all non-syslogger children >> to exit, then reset shmem(call BackgroundWorkerShmemInit) and perform >> StartupDataBase. While starting the startup process we check if any >> bgworker is scheduled for a restart. >> > > In general, your theory appears right, but can you check how it > behaves in standby server because there is a difference in how the > startup process behaves during master and standby startup? In master, > it stops after recovery whereas in standby it will keep on running to > receive WAL. > While performing StartupDatabase, both master and standby server behave in similar way till postmaster spawns startup process. In master, startup process completes its job and dies. As a result, reaper is called which in turn calls maybe_start_bgworker(). In standby, after getting a valid snapshot, startup process sends postmaster a signal to enable connections. Signal handler in postmaster calls maybe_start_bgworker(). In maybe_start_bgworker(), if we find a crashed bgworker(crashed_at != 0) with a NEVER RESTART flag, we call ForgetBackgroundWorker().to forget the bgworker process. I've attached the patch for adding an argument in ForgetBackgroundWorker() to indicate a crashed situation. Based on that, we can take the necessary actions. I've not included the Assert statement in this patch. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com