Thread: [HACKERS] intermittent failures in Cygwin from select_parallel tests
Buildfarm member lorikeet is failing occasionally with a failed assertion during the select_parallel regression tests like this: 2017-06-03 05:12:37.382 EDT [59327d84.1160:38] LOG: statement: select count(*) from tenk1, tenk2 where tenk1.hundred> 1 and tenk2.thousand=0; TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File: "/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/shm_mq.c",Line: 221) I'll see if I can find out why, but if anyone has any ideas why this might be happening (started about 3 weeks ago) thatwould be helpful. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: > Buildfarm member lorikeet is failing occasionally with a failed > assertion during the select_parallel regression tests like this: > 2017-06-03 05:12:37.382 EDT [59327d84.1160:38] LOG: statement: select count(*) from tenk1, tenk2 where tenk1.hundred> 1 and tenk2.thousand=0; > TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File: "/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/shm_mq.c",Line: 221) > I'll see if I can find out why, but if anyone has any ideas why this might be happening (started about 3 weeks ago) thatwould be helpful. Well, this seems like it has to indicate an incorrect call of shm_mq_set_sender. I have no great insights as to what might be causing that, but I sure find it to be pretty unacceptable coding practice that the call sites are not checking for failure returns from shm_toc_lookup. regards, tom lane
On Mon, Jun 5, 2017 at 10:40 AM, Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote: > Buildfarm member lorikeet is failing occasionally with a failed > assertion during the select_parallel regression tests like this: > > > 2017-06-03 05:12:37.382 EDT [59327d84.1160:38] LOG: statement: select count(*) from tenk1, tenk2 where tenk1.hundred> 1 and tenk2.thousand=0; > TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File: "/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/shm_mq.c",Line: 221) > > I'll see if I can find out why, but if anyone has any ideas why this might be happening (started about 3 weeks ago) thatwould be helpful. I don't *think* we've made any relevant code changes lately. The only thing that I can see as looking at all relevant is b6dd1271281ce856ab774fc0b491a92878e3b501, but that doesn't really seem like it can be to blame. One thought is that the only places where shm_mq_set_sender() should be getting invoked during the main regression tests are ParallelWorkerMain() and ExecParallelGetReceiver, and both of those places using ParallelWorkerNumber to figure out what address to pass. So if ParallelWorkerNumber were getting set to the same value in two different parallel workers - e.g. because the postmaster went nuts and launched two processes instead of only one - or if ParallelWorkerNumber were not getting initialized at all or were getting initialized to some completely bogus value, it could cause this symptom. What ought to be happening, if there are N workers launched by a parallel query, is that ParallelWorkerNumber should be different in every worker, over the range 0 to N-1. I think if I were you my first step would be to verify that ParallelWorkerNumber is in that range in the crashed backend, and if it is, my second step would be to add some debugging code to ParallelWorkerMain() to print it out in every worker that gets launched and make sure they're all in range and different. All of the above might be going in the wrong direction entirely, but it's the first thing that comes to mind for me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Jun 5, 2017 at 10:40 AM, Andrew Dunstan > <andrew.dunstan@2ndquadrant.com> wrote: >> Buildfarm member lorikeet is failing occasionally with a failed >> assertion during the select_parallel regression tests like this: > I don't *think* we've made any relevant code changes lately. The only > thing that I can see as looking at all relevant is > b6dd1271281ce856ab774fc0b491a92878e3b501, but that doesn't really seem > like it can be to blame. Yeah, I don't believe that either. That could have introduced a hard failure (if something were relying on initializing a field before where I put the memsets) but it's hard to see how it could produce an intermittent and platform-specific one. > One thought is that the only places where shm_mq_set_sender() should > be getting invoked during the main regression tests are > ParallelWorkerMain() and ExecParallelGetReceiver, and both of those > places using ParallelWorkerNumber to figure out what address to pass. > So if ParallelWorkerNumber were getting set to the same value in two > different parallel workers - e.g. because the postmaster went nuts and > launched two processes instead of only one - or if > ParallelWorkerNumber were not getting initialized at all or were > getting initialized to some completely bogus value, it could cause > this symptom. Hmm. With some generous assumptions it'd be possible to think that aa1351f1eec4adae39be59ce9a21410f9dd42118 triggered this. That commit was present in 20 successful lorikeet runs before the first of these failures, which is a bit more than the MTBF after that, but not a huge amount more. That commit in itself looks innocent enough, but could it have exposed some latent bug in bgworker launching? regards, tom lane
On Tue, Jun 6, 2017 at 2:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> One thought is that the only places where shm_mq_set_sender() should >> be getting invoked during the main regression tests are >> ParallelWorkerMain() and ExecParallelGetReceiver, and both of those >> places using ParallelWorkerNumber to figure out what address to pass. >> So if ParallelWorkerNumber were getting set to the same value in two >> different parallel workers - e.g. because the postmaster went nuts and >> launched two processes instead of only one - or if >> ParallelWorkerNumber were not getting initialized at all or were >> getting initialized to some completely bogus value, it could cause >> this symptom. > > Hmm. With some generous assumptions it'd be possible to think that > aa1351f1eec4adae39be59ce9a21410f9dd42118 triggered this. That commit was > present in 20 successful lorikeet runs before the first of these failures, > which is a bit more than the MTBF after that, but not a huge amount more. > > That commit in itself looks innocent enough, but could it have exposed > some latent bug in bgworker launching? Hmm, that's a really interesting idea, but I can't quite put together a plausible theory around it. I mean, it seems like that commit could make launching bgworkers faster, which could conceivably tickle some heretofore-latent timing-related bug. But it wouldn't, IIUC, make the first worker start any faster than before - it would just make them more closely-spaced thereafter, and it's not very obvious how that would cause a problem. Another idea is that the commit in question is managing to corrupt BackgroundWorkerList somehow. maybe_start_bgworkers() is using slist_foreach_modify(), but previously it always returned after calling do_start_bgworker, and now it doesn't. So if do_start_bgworker() did something that could modify the list structure, then perhaps maybe_start_bgworkers() would get confused. I don't really think that this theory has any legs, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Jun 6, 2017 at 2:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Hmm. With some generous assumptions it'd be possible to think that >> aa1351f1eec4adae39be59ce9a21410f9dd42118 triggered this. That commit was >> present in 20 successful lorikeet runs before the first of these failures, >> which is a bit more than the MTBF after that, but not a huge amount more. >> That commit in itself looks innocent enough, but could it have exposed >> some latent bug in bgworker launching? > Hmm, that's a really interesting idea, but I can't quite put together > a plausible theory around it. Yeah, me either, but we're really theorizing in advance of the data here. Andrew, could you apply the attached patch on lorikeet and run the regression tests enough times to get a couple of failures? Then grepping the postmaster log for 'parallel worker' should give you results like 2017-06-06 16:20:12.393 EDT [31216] LOG: starting PID 31216, parallel worker for PID 31215, worker number 0 2017-06-06 16:20:12.400 EDT [31216] LOG: stopping PID 31216, parallel worker for PID 31215, worker number 0 2017-06-06 16:20:12.406 EDT [31217] LOG: starting PID 31217, parallel worker for PID 31215, worker number 3 2017-06-06 16:20:12.406 EDT [31218] LOG: starting PID 31218, parallel worker for PID 31215, worker number 2 2017-06-06 16:20:12.406 EDT [31219] LOG: starting PID 31219, parallel worker for PID 31215, worker number 1 2017-06-06 16:20:12.406 EDT [31220] LOG: starting PID 31220, parallel worker for PID 31215, worker number 0 2017-06-06 16:20:12.412 EDT [31218] LOG: stopping PID 31218, parallel worker for PID 31215, worker number 2 2017-06-06 16:20:12.412 EDT [31219] LOG: stopping PID 31219, parallel worker for PID 31215, worker number 1 2017-06-06 16:20:12.412 EDT [31220] LOG: stopping PID 31220, parallel worker for PID 31215, worker number 0 2017-06-06 16:20:12.412 EDT [31217] LOG: stopping PID 31217, parallel worker for PID 31215, worker number 3 ... etc etc ... If it looks different from that in a crash case, we'll have something to go on. (I'm tempted to add something like this permanently, at DEBUG1 or DEBUG2 or so.) regards, tom lane diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c index cb22174..d3cb26c 100644 *** a/src/backend/access/transam/parallel.c --- b/src/backend/access/transam/parallel.c *************** ParallelWorkerMain(Datum main_arg) *** 950,955 **** --- 950,961 ---- Assert(ParallelWorkerNumber == -1); memcpy(&ParallelWorkerNumber, MyBgworkerEntry->bgw_extra, sizeof(int)); + /* Log parallel worker startup. */ + ereport(LOG, + (errmsg("starting PID %d, %s, worker number %d", + MyProcPid, MyBgworkerEntry->bgw_name, + ParallelWorkerNumber))); + /* Set up a memory context and resource owner. */ Assert(CurrentResourceOwner == NULL); CurrentResourceOwner = ResourceOwnerCreate(NULL, "parallel toplevel"); *************** ParallelWorkerMain(Datum main_arg) *** 1112,1117 **** --- 1118,1129 ---- /* Report success. */ pq_putmessage('X', NULL, 0); + + /* Log parallel worker shutdown. */ + ereport(LOG, + (errmsg("stopping PID %d, %s, worker number %d", + MyProcPid, MyBgworkerEntry->bgw_name, + ParallelWorkerNumber))); } /* -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jun 6, 2017 at 4:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > (I'm tempted to add something like this permanently, at DEBUG1 or DEBUG2 > or so.) I don't mind adding it permanently, but I think that's too high. Somebody running a lot of parallel queries could easily get enough messages to drown out the stuff they actually want to see at that level. I wouldn't object to DEBUG3 though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jun 7, 2017 at 12:37 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Jun 6, 2017 at 2:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> One thought is that the only places where shm_mq_set_sender() should >>> be getting invoked during the main regression tests are >>> ParallelWorkerMain() and ExecParallelGetReceiver, and both of those >>> places using ParallelWorkerNumber to figure out what address to pass. >>> So if ParallelWorkerNumber were getting set to the same value in two >>> different parallel workers - e.g. because the postmaster went nuts and >>> launched two processes instead of only one - or if >>> ParallelWorkerNumber were not getting initialized at all or were >>> getting initialized to some completely bogus value, it could cause >>> this symptom. >> >> Hmm. With some generous assumptions it'd be possible to think that >> aa1351f1eec4adae39be59ce9a21410f9dd42118 triggered this. That commit was >> present in 20 successful lorikeet runs before the first of these failures, >> which is a bit more than the MTBF after that, but not a huge amount more. >> >> That commit in itself looks innocent enough, but could it have exposed >> some latent bug in bgworker launching? > > Hmm, that's a really interesting idea, but I can't quite put together > a plausible theory around it. I mean, it seems like that commit could > make launching bgworkers faster, which could conceivably tickle some > heretofore-latent timing-related bug. But it wouldn't, IIUC, make the > first worker start any faster than before - it would just make them > more closely-spaced thereafter, and it's not very obvious how that > would cause a problem. > > Another idea is that the commit in question is managing to corrupt > BackgroundWorkerList somehow. > I don't think so because this problem has been reported previously as well [1][2] even before the commit in question. [1] - https://www.postgresql.org/message-id/1ce5a19f-3b1d-bb1c-4561-0158176f65f1%40dunslane.net [2] - https://www.postgresql.org/message-id/25861.1472215822%40sss.pgh.pa.us -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jun 7, 2017 at 6:36 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > I don't think so because this problem has been reported previously as > well [1][2] even before the commit in question. > > [1] - https://www.postgresql.org/message-id/1ce5a19f-3b1d-bb1c-4561-0158176f65f1%40dunslane.net > [2] - https://www.postgresql.org/message-id/25861.1472215822%40sss.pgh.pa.us Oh, good point. So this is a longstanding bug that has possibly just gotten easier to hit. I still think figuring out what's going on with the ParallelWorkerNumbers is a good plan. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Jun 7, 2017 at 6:36 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> I don't think so because this problem has been reported previously as >> well [1][2] even before the commit in question. >> >> [1] - https://www.postgresql.org/message-id/1ce5a19f-3b1d-bb1c-4561-0158176f65f1%40dunslane.net >> [2] - https://www.postgresql.org/message-id/25861.1472215822%40sss.pgh.pa.us > Oh, good point. So this is a longstanding bug that has possibly just > gotten easier to hit. > I still think figuring out what's going on with the > ParallelWorkerNumbers is a good plan. Also, your suggestion in https://www.postgresql.org/message-id/CA%2BTgmob29v0zASBNfgO1Mq9yJ7_TRoAjL%3DO%2B2rXS0gBZezv%2BrQ%40mail.gmail.com for a quicker path to reproducing it might still be valid. Although at this point, it seems like something we've changed recently has made it occur often enough in the buildfarm that repeating the standard regression tests should be good enough. regards, tom lane
Yesterday lorikeet failed the select_parallel test in a new way: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2017-06-13%2020%3A28%3A33 2017-06-13 16:44:57.247 EDT [59404ec9.2e78:1] ERROR: could not map dynamic shared memory segment 2017-06-13 16:44:57.248 EDT [59404dec.2d9c:5] LOG: worker process: parallel worker for PID 10072 (PID 11896) exited withexit code 1 2017-06-13 16:44:57.273 EDT [59404ec6.2758:64] LOG: statement: select stringu1::int2 from tenk1 where unique1 = 1; TRAP: FailedAssertion("!(BackgroundWorkerData->parallel_register_count - BackgroundWorkerData->parallel_terminate_count <=1024)", File: "/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c", Line: 974) 2017-06-13 16:45:02.652 EDT [59404dec.2d9c:6] LOG: server process (PID 10072) was terminated by signal 6: Aborted 2017-06-13 16:45:02.652 EDT [59404dec.2d9c:7] DETAIL: Failed process was running: select stringu1::int2 from tenk1 whereunique1 = 1; 2017-06-13 16:45:02.652 EDT [59404dec.2d9c:8] LOG: terminating any other active server processes So the first problem here is the lack of supporting information for the 'could not map' failure. AFAICS, this must mean either that dsm_attach() returned without ever calling dsm_impl_op() at all, or that dsm_impl_op()'s Windows codepath encountered ERROR_ALREADY_EXISTS or ERROR_ACCESS_DENIED. It's far from clear why those cases should be treated as a silent fail. It's even less clear why dsm_attach's early exit cases don't produce any messages. But since we're left not knowing what happened, the messaging design here is clearly inadequate. The subsequent assertion crash came from here: /* * If this is a parallel worker, check whether there are already too many * parallel workers; if so, don't registeranother one. Our view of * parallel_terminate_count may be slightly stale, but that doesn't really * matter:we would have gotten the same result if we'd arrived here * slightly earlier anyway. There's no help for it, either,since the * postmaster must not take locks; a memory barrier wouldn't guarantee * anything useful. */ if(parallel && (BackgroundWorkerData->parallel_register_count - BackgroundWorkerData->parallel_terminate_count)>= max_parallel_workers) { Assert(BackgroundWorkerData->parallel_register_count- BackgroundWorkerData->parallel_terminate_count <= MAX_PARALLEL_WORKER_LIMIT); LWLockRelease(BackgroundWorkerLock); return false; } What I suspect happened is that parallel_register_count was less than parallel_terminate_count; since those counters are declared as uint32, the negative difference would have been treated as a large unsigned value, triggering the assertion. It's not very clear how that happened, but my bet is that the postmaster incremented parallel_terminate_count more than once while cleaning up after the crashed worker. It looks to me like there's nothing stopping BackgroundWorkerStateChange from incrementing it and then the eventual ForgetBackgroundWorker call from incrementing it again. I haven't traced through things to identify why this might only occur in a worker-failure scenario, but surely we want to make sure that the counter increment happens once and only once per worker. I'm also noting that ForgetBackgroundWorker is lacking a memory barrier between incrementing parallel_terminate_count and marking the slot free. Maybe it doesn't need one, but since there is one in BackgroundWorkerStateChange, it looks weird to not have it here. regards, tom lane
On Wed, Jun 14, 2017 at 3:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > So the first problem here is the lack of supporting information for the > 'could not map' failure. Hmm. I think I believed at the time I wrote dsm_attach() that somebody might want to try to soldier on after failing to map a DSM, but that doesn't seem very likely any more. That theory is supported by this comment: /* * If we didn't find the handle we're looking for in the control segment, * it probably means that everyone elsewho had it mapped, including the * original creator, died before we got to this point. It's up to the * callerto decide what to do about that. */ But in practice, every current caller handles a failure here by bailing out. > AFAICS, this must mean either that dsm_attach() > returned without ever calling dsm_impl_op() at all, or that > dsm_impl_op()'s Windows codepath encountered ERROR_ALREADY_EXISTS or > ERROR_ACCESS_DENIED. It's far from clear why those cases should be > treated as a silent fail. There's a good reason for that, though. See 419113dfdc4c729f6c763cc30a9b02ee68a7da94. > It's even less clear why dsm_attach's early > exit cases don't produce any messages. But since we're left not knowing > what happened, the messaging design here is clearly inadequate. I don't know what you mean by this. The function only has one early-exit case, the comment for which I quoted above. > The subsequent assertion crash came from here: > > /* > * If this is a parallel worker, check whether there are already too many > * parallel workers; if so, don't register another one. Our view of > * parallel_terminate_count may be slightly stale, but that doesn't really > * matter: we would have gotten the same result if we'd arrived here > * slightly earlier anyway. There's no help for it, either, since the > * postmaster must not take locks; a memory barrier wouldn't guarantee > * anything useful. > */ > if (parallel && (BackgroundWorkerData->parallel_register_count - > BackgroundWorkerData->parallel_terminate_count) >= > max_parallel_workers) > { > Assert(BackgroundWorkerData->parallel_register_count - > BackgroundWorkerData->parallel_terminate_count <= > MAX_PARALLEL_WORKER_LIMIT); > LWLockRelease(BackgroundWorkerLock); > return false; > } > > What I suspect happened is that parallel_register_count was less than > parallel_terminate_count; since those counters are declared as uint32, > the negative difference would have been treated as a large unsigned value, > triggering the assertion. Right, and that's exactly the point of having that assertion. > It's not very clear how that happened, but my > bet is that the postmaster incremented parallel_terminate_count more than > once while cleaning up after the crashed worker. It looks to me like > there's nothing stopping BackgroundWorkerStateChange from incrementing it > and then the eventual ForgetBackgroundWorker call from incrementing it > again. I haven't traced through things to identify why this might only > occur in a worker-failure scenario, but surely we want to make sure that > the counter increment happens once and only once per worker. Yeah -- if that can happen, it's definitely a bug. > I'm also noting that ForgetBackgroundWorker is lacking a memory barrier > between incrementing parallel_terminate_count and marking the slot free. > Maybe it doesn't need one, but since there is one in > BackgroundWorkerStateChange, it looks weird to not have it here. I noticed that the other day and thought about mentioning it, but didn't quite muster up the energy. It seems unlikely to me to be the cause of any of the problems we're seeing, but I think it can't hurt to fix the omission. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Jun 14, 2017 at 3:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> So the first problem here is the lack of supporting information for the >> 'could not map' failure. > Hmm. I think I believed at the time I wrote dsm_attach() that > somebody might want to try to soldier on after failing to map a DSM, > but that doesn't seem very likely any more. Well, if they do, they shouldn't be passing elevel == ERROR. >> AFAICS, this must mean either that dsm_attach() >> returned without ever calling dsm_impl_op() at all, or that >> dsm_impl_op()'s Windows codepath encountered ERROR_ALREADY_EXISTS or >> ERROR_ACCESS_DENIED. It's far from clear why those cases should be >> treated as a silent fail. > There's a good reason for that, though. See > 419113dfdc4c729f6c763cc30a9b02ee68a7da94. But surely the silent treatment should only apply to DSM_OP_CREATE? We're not going to retry anything else. >> It's even less clear why dsm_attach's early >> exit cases don't produce any messages. But since we're left not knowing >> what happened, the messaging design here is clearly inadequate. > I don't know what you mean by this. The function only has one > early-exit case, the comment for which I quoted above. OK, s/cases/case/, but the problem remains. We don't know what happened. We cannot have more than one case in this code where nothing gets logged. >> It's not very clear how that happened, but my >> bet is that the postmaster incremented parallel_terminate_count more than >> once while cleaning up after the crashed worker. It looks to me like >> there's nothing stopping BackgroundWorkerStateChange from incrementing it >> and then the eventual ForgetBackgroundWorker call from incrementing it >> again. I haven't traced through things to identify why this might only >> occur in a worker-failure scenario, but surely we want to make sure that >> the counter increment happens once and only once per worker. > Yeah -- if that can happen, it's definitely a bug. My first thought about fixing it is that we should remove that code from BackgroundWorkerStateChange altogether. The parallel_terminate_count increment should happen in, and only in, ForgetBackgroundWorker. There seems little reason to risk bugs by trying to do it a bit earlier. regards, tom lane
I wrote: > But surely the silent treatment should only apply to DSM_OP_CREATE? Oh ... scratch that, it *does* only apply to DSM_OP_CREATE. The lack of any other message before the 'could not map' failure must, then, mean that dsm_attach() couldn't find an entry in shared memory that it wanted to attach to. But how could that happen? regards, tom lane
On Thu, Jun 15, 2017 at 3:31 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> But surely the silent treatment should only apply to DSM_OP_CREATE? > > Oh ... scratch that, it *does* only apply to DSM_OP_CREATE. > > The lack of any other message before the 'could not map' failure must, > then, mean that dsm_attach() couldn't find an entry in shared memory > that it wanted to attach to. > Yes, I also think so. > But how could that happen? > I could think of a couple of reasons due to which it can happen. (a) the value of segment handle passed by the master backend to worker backend somehow got messed up. (b) all other workers along with master backend exited before one of the workers try to attach. (c) the master backend has not actually created any dsm segment (like when max segments have reached) but still invoked some workers. (d) some corner case bug in dsm code due to which it can't attach to a valid segment handle. Now, of these, I have checked that (c) can't happen because we ensure that if the segment is not created than we make workers as zero. I think (b) shouldn't happen because we wait for all workers to exit before the query is finished. Now, I think (a) and (d) are slightly related and I have looked around in the relevant code but didn't find any obvious problem, however, it seems to me that it might be something which happens on Cygwin environment differently. For example, I think the value of seg->handle can be different from what we expect in dsm_create. Basically, random returns long and we are storing it in dsm_handle (uint32), so considering long is 8 bytes on Cygwin [1] and 4 bytes on Windows the value could wrap but even if that happens it is not clear how that can cause what we are seeing in this case. [1] - https://stackoverflow.com/questions/384502/what-is-the-bit-size-of-long-on-64-bit-windows -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jun 14, 2017 at 6:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> But surely the silent treatment should only apply to DSM_OP_CREATE? > > Oh ... scratch that, it *does* only apply to DSM_OP_CREATE. > > The lack of any other message before the 'could not map' failure must, > then, mean that dsm_attach() couldn't find an entry in shared memory > that it wanted to attach to. But how could that happen? Well, as Amit points out, there are entirely legitimate ways for that to happen. If the leader finishes the whole query itself before the worker reaches the dsm_attach() call, it will call dsm_detach(), destroying the segment, and the worker will hit this ERROR. That shouldn't happen very often in the real world, because we ought not to select a parallel plan in the first place unless the query is going to take a while to run, but the select_parallel test quite deliberately disarms all of the guards that would tend to discourage such plans. Of course, as Amit also points out, it could also be the result of some bug, but I'm not sure we have any reason to think so. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Jun 14, 2017 at 6:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The lack of any other message before the 'could not map' failure must, >> then, mean that dsm_attach() couldn't find an entry in shared memory >> that it wanted to attach to. But how could that happen? > Well, as Amit points out, there are entirely legitimate ways for that > to happen. If the leader finishes the whole query itself before the > worker reaches the dsm_attach() call, it will call dsm_detach(), > destroying the segment, and the worker will hit this ERROR. That > shouldn't happen very often in the real world, because we ought not to > select a parallel plan in the first place unless the query is going to > take a while to run, but the select_parallel test quite deliberately > disarms all of the guards that would tend to discourage such plans. But we know, from the subsequent failed assertion, that the leader was still trying to launch parallel workers. So that particular theory doesn't hold water. > Of course, as Amit also points out, it could also be the result of > some bug, but I'm not sure we have any reason to think so. The fact that we've only seen this on cygwin leads the mind in the direction of platform-specific problems. Both this case and lorikeet's earlier symptoms could be explained if the parameters passed from leader to workers somehow got corrupted occasionally; so that's what I've been thinking about, but I'm not seeing anything. Would someone confirm my recollection that the cygwin build does *not* use EXEC_BACKEND, but relies on a cygwin-provided emulation of fork()? How does that emulation work, anyway? regards, tom lane
On Thu, Jun 15, 2017 at 10:05 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Well, as Amit points out, there are entirely legitimate ways for that >> to happen. If the leader finishes the whole query itself before the >> worker reaches the dsm_attach() call, it will call dsm_detach(), >> destroying the segment, and the worker will hit this ERROR. That >> shouldn't happen very often in the real world, because we ought not to >> select a parallel plan in the first place unless the query is going to >> take a while to run, but the select_parallel test quite deliberately >> disarms all of the guards that would tend to discourage such plans. > > But we know, from the subsequent failed assertion, that the leader was > still trying to launch parallel workers. So that particular theory > doesn't hold water. Is there any chance that it's already trying to launch parallel workers for the *next* query? >> Of course, as Amit also points out, it could also be the result of >> some bug, but I'm not sure we have any reason to think so. > > The fact that we've only seen this on cygwin leads the mind in the > direction of platform-specific problems. Both this case and lorikeet's > earlier symptoms could be explained if the parameters passed from leader > to workers somehow got corrupted occasionally; so that's what I've been > thinking about, but I'm not seeing anything. Could be -- but it could also be timing-related. If we are in fact using cygwin's fork emulation, the documentation for it explains that it's slow: https://www.cygwin.com/faq.html#faq.api.fork Interestingly, it also mentions that making it work requires suspending the parent while the child is starting up, which probably does not happen on any other platform. Of course it also makes my theory that the child doesn't reach dsm_attach() before the parent finishes the query pretty unlikely. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jun 15, 2017 at 7:42 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jun 15, 2017 at 10:05 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Well, as Amit points out, there are entirely legitimate ways for that >>> to happen. If the leader finishes the whole query itself before the >>> worker reaches the dsm_attach() call, it will call dsm_detach(), >>> destroying the segment, and the worker will hit this ERROR. That >>> shouldn't happen very often in the real world, because we ought not to >>> select a parallel plan in the first place unless the query is going to >>> take a while to run, but the select_parallel test quite deliberately >>> disarms all of the guards that would tend to discourage such plans. >> >> But we know, from the subsequent failed assertion, that the leader was >> still trying to launch parallel workers. So that particular theory >> doesn't hold water. > > Is there any chance that it's already trying to launch parallel > workers for the *next* query? > Yes, I think it is for next query. If you refer the log below from lorikeet: 2017-06-13 16:44:57.179 EDT [59404ec6.2758:63] LOG: statement: EXPLAIN (analyze, timing off, summary off, costs off) SELECT * FROM tenk1; 2017-06-13 16:44:57.247 EDT [59404ec9.2e78:1] ERROR: could not map dynamic shared memory segment 2017-06-13 16:44:57.248 EDT [59404dec.2d9c:5] LOG: worker process: parallel worker for PID 10072 (PID 11896) exited with exit code 1 2017-06-13 16:44:57.273 EDT [59404ec6.2758:64] LOG: statement: select stringu1::int2 from tenk1 where unique1 = 1; TRAP: FailedAssertion("!(BackgroundWorkerData->parallel_register_count - BackgroundWorkerData->parallel_terminate_count <= 1024)", File: "/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c", Line: 974) 2017-06-13 16:45:02.652 EDT [59404dec.2d9c:6] LOG: server process (PID 10072) was terminated by signal 6: Aborted 2017-06-13 16:45:02.652 EDT [59404dec.2d9c:7] DETAIL: Failed process was running: select stringu1::int2 from tenk1 where unique1 = 1; 2017-06-13 16:45:02.652 EDT [59404dec.2d9c:8] LOG: terminating any other active server processes Error "could not map dynamic shared memory segment" is due to query "EXPLAIN .. SELECT * FROM tenk1" and Assertion failure is due to another statement "select stringu1::int2 from tenk1 where unique1 = 1;". -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Jun 15, 2017 at 10:05 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> But we know, from the subsequent failed assertion, that the leader was >> still trying to launch parallel workers. So that particular theory >> doesn't hold water. > Is there any chance that it's already trying to launch parallel > workers for the *next* query? Oh! Yeah, you might be right, because the trace includes a statement LOG entry from the leader in between: 2017-06-13 16:44:57.179 EDT [59404ec6.2758:63] LOG: statement: EXPLAIN (analyze, timing off, summary off, costs off) SELECT* FROM tenk1; 2017-06-13 16:44:57.247 EDT [59404ec9.2e78:1] ERROR: could not map dynamic shared memory segment 2017-06-13 16:44:57.248 EDT [59404dec.2d9c:5] LOG: worker process: parallel worker for PID 10072 (PID 11896) exited withexit code 1 2017-06-13 16:44:57.273 EDT [59404ec6.2758:64] LOG: statement: select stringu1::int2 from tenk1 where unique1 = 1; TRAP: FailedAssertion("!(BackgroundWorkerData->parallel_register_count - BackgroundWorkerData->parallel_terminate_count <=1024)", File: "/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c", Line: 974) 2017-06-13 16:45:02.652 EDT [59404dec.2d9c:6] LOG: server process (PID 10072) was terminated by signal 6: Aborted It's fairly hard to read this other than as telling us that the worker was launched for the EXPLAIN (although really? why aren't we skipping that if EXEC_FLAG_EXPLAIN_ONLY?), and then we see a new LOG entry for the next statement before the leader hits its assertion failure. > Could be -- but it could also be timing-related. If we are in fact > using cygwin's fork emulation, the documentation for it explains that > it's slow: https://www.cygwin.com/faq.html#faq.api.fork > Interestingly, it also mentions that making it work requires > suspending the parent while the child is starting up, which probably > does not happen on any other platform. Of course it also makes my > theory that the child doesn't reach dsm_attach() before the parent > finishes the query pretty unlikely. Well, if this was a worker launched during InitPlan() for an EXPLAIN, the leader would have shut down the query almost immediately after launching the worker. So it does fit pretty well as long as you're willing to believe that the leader got to run before the child. But what this theory doesn't explain is: why haven't we seen this before? It now seems like it ought to come up often, since there are several EXPLAINs for parallel queries in that test. regards, tom lane
On Thu, Jun 15, 2017 at 10:32 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Thu, Jun 15, 2017 at 10:05 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> But we know, from the subsequent failed assertion, that the leader was >>> still trying to launch parallel workers. So that particular theory >>> doesn't hold water. > >> Is there any chance that it's already trying to launch parallel >> workers for the *next* query? > > Oh! Yeah, you might be right, because the trace includes a statement > LOG entry from the leader in between: > > 2017-06-13 16:44:57.179 EDT [59404ec6.2758:63] LOG: statement: EXPLAIN (analyze, timing off, summary off, costs off) SELECT* FROM tenk1; > 2017-06-13 16:44:57.247 EDT [59404ec9.2e78:1] ERROR: could not map dynamic shared memory segment > 2017-06-13 16:44:57.248 EDT [59404dec.2d9c:5] LOG: worker process: parallel worker for PID 10072 (PID 11896) exited withexit code 1 > 2017-06-13 16:44:57.273 EDT [59404ec6.2758:64] LOG: statement: select stringu1::int2 from tenk1 where unique1 = 1; > TRAP: FailedAssertion("!(BackgroundWorkerData->parallel_register_count - BackgroundWorkerData->parallel_terminate_count<= 1024)", File: "/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c",Line: 974) > 2017-06-13 16:45:02.652 EDT [59404dec.2d9c:6] LOG: server process (PID 10072) was terminated by signal 6: Aborted > > It's fairly hard to read this other than as telling us that the worker was > launched for the EXPLAIN (although really? why aren't we skipping that if > EXEC_FLAG_EXPLAIN_ONLY?), ... Uh, because ANALYZE was used? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Jun 15, 2017 at 10:32 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> It's fairly hard to read this other than as telling us that the worker was >> launched for the EXPLAIN (although really? why aren't we skipping that if >> EXEC_FLAG_EXPLAIN_ONLY?), ... > Uh, because ANALYZE was used? ... er, -ENOCAFFEINE. Nonetheless, there are no checks of EXEC_FLAG_EXPLAIN_ONLY in any parallel-query code, so I think a bet is being missed somewhere. regards, tom lane
On Thu, Jun 15, 2017 at 10:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Thu, Jun 15, 2017 at 10:32 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> It's fairly hard to read this other than as telling us that the worker was >>> launched for the EXPLAIN (although really? why aren't we skipping that if >>> EXEC_FLAG_EXPLAIN_ONLY?), ... > >> Uh, because ANALYZE was used? > > ... er, -ENOCAFFEINE. Nonetheless, there are no checks of > EXEC_FLAG_EXPLAIN_ONLY in any parallel-query code, so I think > a bet is being missed somewhere. ExecGather() is where workers get launched, and that ain't happening if EXEC_FLAG_EXPLAIN_ONLY is set, unless I am *very* confused about how this works. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Jun 15, 2017 at 10:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> ... er, -ENOCAFFEINE. Nonetheless, there are no checks of >> EXEC_FLAG_EXPLAIN_ONLY in any parallel-query code, so I think >> a bet is being missed somewhere. > ExecGather() is where workers get launched, and that ain't happening > if EXEC_FLAG_EXPLAIN_ONLY is set, unless I am *very* confused about > how this works. Ah, you're right, it's done during first execution not during InitPlan. Nevermind ... regards, tom lane
On Thu, Jun 15, 2017 at 10:21 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Yes, I think it is for next query. If you refer the log below from lorikeet: > > 2017-06-13 16:44:57.179 EDT [59404ec6.2758:63] LOG: statement: > EXPLAIN (analyze, timing off, summary off, costs off) SELECT * FROM > tenk1; > 2017-06-13 16:44:57.247 EDT [59404ec9.2e78:1] ERROR: could not map > dynamic shared memory segment > 2017-06-13 16:44:57.248 EDT [59404dec.2d9c:5] LOG: worker process: > parallel worker for PID 10072 (PID 11896) exited with exit code 1 > 2017-06-13 16:44:57.273 EDT [59404ec6.2758:64] LOG: statement: select > stringu1::int2 from tenk1 where unique1 = 1; > TRAP: FailedAssertion("!(BackgroundWorkerData->parallel_register_count > - BackgroundWorkerData->parallel_terminate_count <= 1024)", File: > "/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c", > Line: 974) > 2017-06-13 16:45:02.652 EDT [59404dec.2d9c:6] LOG: server process > (PID 10072) was terminated by signal 6: Aborted > 2017-06-13 16:45:02.652 EDT [59404dec.2d9c:7] DETAIL: Failed process > was running: select stringu1::int2 from tenk1 where unique1 = 1; > 2017-06-13 16:45:02.652 EDT [59404dec.2d9c:8] LOG: terminating any > other active server processes > > Error "could not map dynamic shared memory segment" is due to query > "EXPLAIN .. SELECT * FROM tenk1" and Assertion failure is due to > another statement "select stringu1::int2 from tenk1 where unique1 = > 1;". I think you're right. So here's a theory: 1. The ERROR mapping the DSM segment is just a case of the worker the losing a race, and isn't a bug. 2. But when that happens, parallel_terminate_count is getting bumped twice for some reason. 3. So then the leader process fails that assertion when it tries to launch the parallel workers for the next query. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I think you're right. So here's a theory: > 1. The ERROR mapping the DSM segment is just a case of the worker the > losing a race, and isn't a bug. I concur that this is a possibility, but if we expect this to happen, seems like there should be other occurrences in the buildfarm logs. I trolled the last three months worth of check/installcheck logs (all runs not just failures), and could find exactly two cases of "could not map dynamic shared memory segment": sysname | branch | snapshot | stage | l ----------+---------------+---------------------+----------------+---------------------------------------------------------------------------------------------------lorikeet |REL9_6_STABLE | 2017-05-03 10:21:31 | Check | 2017-05-03 06:27:32.626 EDT [5909b094.1e28:1] ERROR: could not mapdynamic shared memory segmentlorikeet | HEAD | 2017-06-13 20:28:33 | InstallCheck-C | 2017-06-13 16:44:57.247EDT [59404ec9.2e78:1] ERROR: could not map dynamic shared memory segment (2 rows) Now maybe this can be explained away by saying that the worker never loses the race unless it's subject to cygwin's unusually slow fork() emulation, but somehow I doubt that. For one thing, it's not clear why that path would be slower than EXEC_BACKEND, which would also involve populating a new process image from scratch. BTW, that 9.6 failure is worth studying because it looks quite a bit different from the one on HEAD. It looks like the worker failed to launch and then the leader got hung up waiting for the worker. Eventually other stuff started failing because the select_parallel test is holding an exclusive lock on tenk1 throughout its session. (Does it really need to do that ALTER TABLE?) > 2. But when that happens, parallel_terminate_count is getting bumped > twice for some reason. > 3. So then the leader process fails that assertion when it tries to > launch the parallel workers for the next query. It seems like this has to trace to some sort of logic error in the postmaster that's allowing it to mess up parallel_terminate_count, but I'm not managing to construct a plausible flow of control that would cause that. regards, tom lane
I wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> I think you're right. So here's a theory: >> 1. The ERROR mapping the DSM segment is just a case of the worker the >> losing a race, and isn't a bug. > I concur that this is a possibility, Actually, no, it isn't. I tried to reproduce the problem by inserting a sleep into ParallelWorkerMain, and could not. After digging around in the code, I realize that the leader process *can not* exit the parallel query before the workers start, at least not without hitting an error first, which is not happening in these examples. The reason is that nodeGather cannot deem the query done until it's seen EOF on each tuple queue, which it cannot see until each worker has attached to and then detached from the associated shm_mq. (BTW, this also means that the leader is frozen solid if a worker process fails to start, but we knew that already.) So we still don't know why lorikeet is sometimes reporting "could not map dynamic shared memory segment". It's clear though that once that happens, the current code has no prayer of recovering cleanly. It looks from lorikeet's logs like there is something that is forcing a timeout via crash after ~150 seconds, but I do not know what that is. regards, tom lane
On Thu, Jun 15, 2017 at 5:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> I think you're right. So here's a theory: > >>> 1. The ERROR mapping the DSM segment is just a case of the worker the >>> losing a race, and isn't a bug. > >> I concur that this is a possibility, > > Actually, no, it isn't. I tried to reproduce the problem by inserting > a sleep into ParallelWorkerMain, and could not. After digging around > in the code, I realize that the leader process *can not* exit the > parallel query before the workers start, at least not without hitting > an error first, which is not happening in these examples. The reason > is that nodeGather cannot deem the query done until it's seen EOF on > each tuple queue, which it cannot see until each worker has attached > to and then detached from the associated shm_mq. Oh. That's sad. It definitely has to wait for any tuple queues that have been attached to be detached, but it would be better if it didn't have to wait for processes that haven't even attached yet. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Jun 15, 2017 at 5:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> ... nodeGather cannot deem the query done until it's seen EOF on >> each tuple queue, which it cannot see until each worker has attached >> to and then detached from the associated shm_mq. > Oh. That's sad. It definitely has to wait for any tuple queues that > have been attached to be detached, but it would be better if it didn't > have to wait for processes that haven't even attached yet. Hm. We assume they attach before they start taking any of the query work? Seems reasonable, and this would give us some chance of recovering from worker fork failure. regards, tom lane
On Thu, Jun 15, 2017 at 5:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Thu, Jun 15, 2017 at 5:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> ... nodeGather cannot deem the query done until it's seen EOF on >>> each tuple queue, which it cannot see until each worker has attached >>> to and then detached from the associated shm_mq. > >> Oh. That's sad. It definitely has to wait for any tuple queues that >> have been attached to be detached, but it would be better if it didn't >> have to wait for processes that haven't even attached yet. > > Hm. We assume they attach before they start taking any of the query > work? Seems reasonable, and this would give us some chance of recovering > from worker fork failure. Yeah, something like that. I'm not sure exactly how to implement it, though. I think I intended for it to work that way all along, but the code's not there. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Today, lorikeet failed with a new variant on the bgworker start crash: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2017-06-21%2020%3A29%3A10 This one is even more exciting than the last one, because it sure looks like the crashing bgworker took the postmaster down with it. That is Not Supposed To Happen. Wondering if we broke something here recently, I tried to reproduce it on a Linux machine by adding a randomized Assert failure in shm_mq_set_sender. I don't see any such problem, even with EXEC_BACKEND; we recover from the crash as-expected. So I'm starting to get a distinct feeling that there's something wrong with the cygwin port. But I dunno what. regards, tom lane
On 06/21/2017 06:44 PM, Tom Lane wrote: > Today, lorikeet failed with a new variant on the bgworker start crash: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2017-06-21%2020%3A29%3A10 > > This one is even more exciting than the last one, because it sure looks > like the crashing bgworker took the postmaster down with it. That is > Not Supposed To Happen. > > Wondering if we broke something here recently, I tried to reproduce it > on a Linux machine by adding a randomized Assert failure in > shm_mq_set_sender. I don't see any such problem, even with EXEC_BACKEND; > we recover from the crash as-expected. > > So I'm starting to get a distinct feeling that there's something wrong > with the cygwin port. But I dunno what. > > :-( Please let me know if there are tests I can run. I missed your earlier request in this thread, sorry about that. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: > Please let me know if there are tests I can run. I missed your earlier > request in this thread, sorry about that. That earlier request is still valid. Also, if you can reproduce the symptom that lorikeet just showed and get a stack trace from the (hypothetical) postmaster core dump, that would be hugely valuable. regards, tom lane
On Thu, Jun 22, 2017 at 7:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: >> Please let me know if there are tests I can run. I missed your earlier >> request in this thread, sorry about that. > > That earlier request is still valid. > Yeah, that makes and also maybe we can try to print dsm_segment handle value both before launching the worker and in parallel worker (ParallelWorkerMain). That might help us in identifying the reason of error "could not map dynamic shared memory segment". -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 06/22/2017 10:24 AM, Tom Lane wrote: > Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: >> Please let me know if there are tests I can run. I missed your earlier >> request in this thread, sorry about that. > That earlier request is still valid. Also, if you can reproduce the > symptom that lorikeet just showed and get a stack trace from the > (hypothetical) postmaster core dump, that would be hugely valuable. > > See attached log and stacktrace cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: > On 06/22/2017 10:24 AM, Tom Lane wrote: >> That earlier request is still valid. Also, if you can reproduce the >> symptom that lorikeet just showed and get a stack trace from the >> (hypothetical) postmaster core dump, that would be hugely valuable. > See attached log and stacktrace The stacktrace seems to be from the parallel-query-leader backend. Was there another core file? The lack of any indication in the postmaster log that the postmaster saw the parallel worker's crash sure looks the same as lorikeet's failure. But if the postmaster didn't dump core, then we're still at a loss as to why it didn't respond. regards, tom lane
On 06/23/2017 12:11 AM, Tom Lane wrote: > Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: >> On 06/22/2017 10:24 AM, Tom Lane wrote: >>> That earlier request is still valid. Also, if you can reproduce the >>> symptom that lorikeet just showed and get a stack trace from the >>> (hypothetical) postmaster core dump, that would be hugely valuable. >> See attached log and stacktrace > The stacktrace seems to be from the parallel-query-leader backend. > Was there another core file? > > The lack of any indication in the postmaster log that the postmaster saw > the parallel worker's crash sure looks the same as lorikeet's failure. > But if the postmaster didn't dump core, then we're still at a loss as to > why it didn't respond. > > Rerunning with some different settings to see if I can get separate cores. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 06/23/2017 07:47 AM, Andrew Dunstan wrote: > > On 06/23/2017 12:11 AM, Tom Lane wrote: >> Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: >>> On 06/22/2017 10:24 AM, Tom Lane wrote: >>>> That earlier request is still valid. Also, if you can reproduce the >>>> symptom that lorikeet just showed and get a stack trace from the >>>> (hypothetical) postmaster core dump, that would be hugely valuable. >>> See attached log and stacktrace >> The stacktrace seems to be from the parallel-query-leader backend. >> Was there another core file? >> >> The lack of any indication in the postmaster log that the postmaster saw >> the parallel worker's crash sure looks the same as lorikeet's failure. >> But if the postmaster didn't dump core, then we're still at a loss as to >> why it didn't respond. >> >> > > > Rerunning with some different settings to see if I can get separate cores. > Numerous attempts to get core dumps following methods suggested in Google searches have failed. The latest one is just hanging. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jun 23, 2017 at 9:12 AM, Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote: > > > On 06/22/2017 10:24 AM, Tom Lane wrote: >> Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: >>> Please let me know if there are tests I can run. I missed your earlier >>> request in this thread, sorry about that. >> That earlier request is still valid. Also, if you can reproduce the >> symptom that lorikeet just showed and get a stack trace from the >> (hypothetical) postmaster core dump, that would be hugely valuable. >> >> > > > See attached log and stacktrace > Is this all the log contents or is there something else? The attached log looks strange to me in the sense that the first worker that gets invoked is Parallel worker number 2 and it finds that somebody else has already set the sender handle. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 06/26/2017 10:36 AM, Amit Kapila wrote: > On Fri, Jun 23, 2017 at 9:12 AM, Andrew Dunstan > <andrew.dunstan@2ndquadrant.com> wrote: >> >> On 06/22/2017 10:24 AM, Tom Lane wrote: >>> Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: >>>> Please let me know if there are tests I can run. I missed your earlier >>>> request in this thread, sorry about that. >>> That earlier request is still valid. Also, if you can reproduce the >>> symptom that lorikeet just showed and get a stack trace from the >>> (hypothetical) postmaster core dump, that would be hugely valuable. >>> >>> >> >> See attached log and stacktrace >> > Is this all the log contents or is there something else? The attached > log looks strange to me in the sense that the first worker that gets > invoked is Parallel worker number 2 and it finds that somebody else > has already set the sender handle. > No, it's the end of the log file. I can rerun and get the whole log file if you like. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: > On 06/23/2017 07:47 AM, Andrew Dunstan wrote: >> Rerunning with some different settings to see if I can get separate cores. > Numerous attempts to get core dumps following methods suggested in > Google searches have failed. The latest one is just hanging. Well, if it's hung, maybe you could attach to the processes with gdb and get stack traces manually? regards, tom lane
On 06/26/2017 10:45 AM, Tom Lane wrote: > Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: >> On 06/23/2017 07:47 AM, Andrew Dunstan wrote: >>> Rerunning with some different settings to see if I can get separate cores. >> Numerous attempts to get core dumps following methods suggested in >> Google searches have failed. The latest one is just hanging. > Well, if it's hung, maybe you could attach to the processes with gdb and > get stack traces manually? > > In theory what I have set up is supposed to be doing that, but I'll try. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jun 26, 2017 at 8:09 PM, Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote: > > > On 06/26/2017 10:36 AM, Amit Kapila wrote: >> On Fri, Jun 23, 2017 at 9:12 AM, Andrew Dunstan >> <andrew.dunstan@2ndquadrant.com> wrote: >>> >>> On 06/22/2017 10:24 AM, Tom Lane wrote: >>>> Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: >>>>> Please let me know if there are tests I can run. I missed your earlier >>>>> request in this thread, sorry about that. >>>> That earlier request is still valid. Also, if you can reproduce the >>>> symptom that lorikeet just showed and get a stack trace from the >>>> (hypothetical) postmaster core dump, that would be hugely valuable. >>>> >>>> >>> >>> See attached log and stacktrace >>> >> Is this all the log contents or is there something else? The attached >> log looks strange to me in the sense that the first worker that gets >> invoked is Parallel worker number 2 and it finds that somebody else >> has already set the sender handle. >> > > > > No, it's the end of the log file. I can rerun and get the whole log file > if you like. > Okay, if possible, please share the same. Another way to get better information is if we change the code of shm_mq_set_sender such that it will hang if we hit Assertion condition. Once it hangs we can attach a debugger and try to get some information. Basically, just change the code of shm_mq_set_sender as below or something like that: void shm_mq_set_sender(shm_mq *mq, PGPROC *proc) { volatile shm_mq *vmq = mq; PGPROC *receiver; SpinLockAcquire(&mq->mq_mutex); if (vmq->mq_sender != NULL) { while(1) { } } If we are able to hit the above code, then we can print the values of mq_sender especially pid and see if the pid is of the current backend. In theory, it should be of the different backend as this is the first time we are setting the mq_sender. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jun 21, 2017 at 06:44:09PM -0400, Tom Lane wrote: > Today, lorikeet failed with a new variant on the bgworker start crash: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2017-06-21%2020%3A29%3A10 > > This one is even more exciting than the last one, because it sure looks > like the crashing bgworker took the postmaster down with it. That is > Not Supposed To Happen. > > Wondering if we broke something here recently, I tried to reproduce it > on a Linux machine by adding a randomized Assert failure in > shm_mq_set_sender. I don't see any such problem, even with EXEC_BACKEND; > we recover from the crash as-expected. > > So I'm starting to get a distinct feeling that there's something wrong > with the cygwin port. But I dunno what. I think signal blocking broke on Cygwin. On a system (gcc 5.4.0, CYGWIN_NT-10.0 2.7.0(0.306/5/3) 2017-02-12 13:18 x86_64) that reproduces lorikeet's symptoms, I instrumented the postmaster as attached. The patch's small_parallel.sql is a subset of select_parallel.sql sufficient to reproduce the mq_sender Assert failure and the postmaster silent exit. (It occasionally needed hundreds of iterations to do so.) The parallel query normally starts four bgworkers; when the mq_sender Assert fired, the test had started five workers in response to four registrations. The postmaster.c instrumentation regularly detects sigusr1_handler() calls while another sigusr1_handler() is already on the stack: 6328 2017-08-02 07:25:42.788 GMT LOG: forbid signals @ sigusr1_handler 6328 2017-08-02 07:25:42.788 GMT DEBUG: saw slot-0 registration, want 0 6328 2017-08-02 07:25:42.788 GMT DEBUG: saw slot-0 registration, want 1 6328 2017-08-02 07:25:42.788 GMT DEBUG: slot 1 not yet registered 6328 2017-08-02 07:25:42.789 GMT DEBUG: registering background worker "parallel worker for PID 4776" (slot 1) 6328 2017-08-02 07:25:42.789 GMT DEBUG: saw slot-1 registration, want 2 6328 2017-08-02 07:25:42.789 GMT DEBUG: saw slot-0 registration, want 2 6328 2017-08-02 07:25:42.789 GMT DEBUG: slot 2 not yet registered 6328 2017-08-02 07:25:42.789 GMT DEBUG: registering background worker "parallel worker for PID 4776" (slot 2) 6328 2017-08-02 07:25:42.789 GMT DEBUG: saw slot-2 registration, want 3 6328 2017-08-02 07:25:42.789 GMT DEBUG: saw slot-1 registration, want 3 6328 2017-08-02 07:25:42.789 GMT DEBUG: saw slot-0 registration, want 3 6328 2017-08-02 07:25:42.789 GMT DEBUG: slot 3 not yet registered 6328 2017-08-02 07:25:42.789 GMT DEBUG: registering background worker "parallel worker for PID 4776" (slot 3) 6328 2017-08-02 07:25:42.789 GMT DEBUG: saw slot-3 registration, want 4 6328 2017-08-02 07:25:42.789 GMT DEBUG: saw slot-2 registration, want 4 6328 2017-08-02 07:25:42.789 GMT DEBUG: saw slot-1 registration, want 4 6328 2017-08-02 07:25:42.789 GMT DEBUG: saw slot-0 registration, want 4 6328 2017-08-02 07:25:42.789 GMT DEBUG: slot 4 not yet registered 6328 2017-08-02 07:25:42.789 GMT DEBUG: registering background worker "parallel worker for PID 4776" (slot 4) 6328 2017-08-02 07:25:42.789 GMT DEBUG: starting background worker process "parallel worker for PID 4776" 6328 2017-08-02 07:25:42.790 GMT LOG: forbid signals @ sigusr1_handler 6328 2017-08-02 07:25:42.790 GMT WARNING: signals already forbidden @ sigusr1_handler 6328 2017-08-02 07:25:42.790 GMT LOG: permit signals @ sigusr1_handler postmaster algorithms rely on the PG_SETMASK() calls preventing that. Without such protection, duplicate bgworkers are an understandable result. I caught several other assertions; the PMChildFlags failure is another case of duplicate postmaster children: 6 TRAP: FailedAssertion("!(entry->trans == ((void *)0))", File: "pgstat.c", Line: 871) 3 TRAP: FailedAssertion("!(PMSignalState->PMChildFlags[slot] == 1)", File: "pmsignal.c", Line: 229) 20 TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line: 2523) 21 TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File: "shm_mq.c", Line: 221) Also, got a few "select() failed in postmaster: Bad address" I suspect a Cygwin signals bug. I'll try to distill a self-contained test case for the Cygwin hackers. The lack of failures on buildfarm member brolga argues that older Cygwin is not affected. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Wed, Aug 2, 2017 at 11:47 PM, Noah Misch <noah@leadboat.com> wrote: > postmaster algorithms rely on the PG_SETMASK() calls preventing that. Without > such protection, duplicate bgworkers are an understandable result. I caught > several other assertions; the PMChildFlags failure is another case of > duplicate postmaster children: > > 6 TRAP: FailedAssertion("!(entry->trans == ((void *)0))", File: "pgstat.c", Line: 871) > 3 TRAP: FailedAssertion("!(PMSignalState->PMChildFlags[slot] == 1)", File: "pmsignal.c", Line: 229) > 20 TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line: 2523) > 21 TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File: "shm_mq.c", Line: 221) > Also, got a few "select() failed in postmaster: Bad address" > > I suspect a Cygwin signals bug. I'll try to distill a self-contained test > case for the Cygwin hackers. The lack of failures on buildfarm member brolga > argues that older Cygwin is not affected. Nice detective work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Aug 03, 2017 at 10:45:50AM -0400, Robert Haas wrote: > On Wed, Aug 2, 2017 at 11:47 PM, Noah Misch <noah@leadboat.com> wrote: > > postmaster algorithms rely on the PG_SETMASK() calls preventing that. Without > > such protection, duplicate bgworkers are an understandable result. I caught > > several other assertions; the PMChildFlags failure is another case of > > duplicate postmaster children: > > > > 6 TRAP: FailedAssertion("!(entry->trans == ((void *)0))", File: "pgstat.c", Line: 871) > > 3 TRAP: FailedAssertion("!(PMSignalState->PMChildFlags[slot] == 1)", File: "pmsignal.c", Line: 229) > > 20 TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line: 2523) > > 21 TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File: "shm_mq.c", Line: 221) > > Also, got a few "select() failed in postmaster: Bad address" > > > > I suspect a Cygwin signals bug. I'll try to distill a self-contained test > > case for the Cygwin hackers. The lack of failures on buildfarm member brolga > > argues that older Cygwin is not affected. > > Nice detective work. Thanks. http://marc.info/?t=150183296400001 has my upstream report. The Cygwin project lead reproduced this, but a fix remained elusive. I guess we'll ignore weird postmaster-associated lorikeet failures for the foreseeable future.
On Tue, Jun 22, 2021 at 5:21 PM Noah Misch <noah@leadboat.com> wrote: > On Thu, Aug 03, 2017 at 10:45:50AM -0400, Robert Haas wrote: > > On Wed, Aug 2, 2017 at 11:47 PM, Noah Misch <noah@leadboat.com> wrote: > > > postmaster algorithms rely on the PG_SETMASK() calls preventing that. Without > > > such protection, duplicate bgworkers are an understandable result. I caught > > > several other assertions; the PMChildFlags failure is another case of > > > duplicate postmaster children: > > > > > > 6 TRAP: FailedAssertion("!(entry->trans == ((void *)0))", File: "pgstat.c", Line: 871) > > > 3 TRAP: FailedAssertion("!(PMSignalState->PMChildFlags[slot] == 1)", File: "pmsignal.c", Line: 229) > > > 20 TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line: 2523) > > > 21 TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File: "shm_mq.c", Line: 221) > > > Also, got a few "select() failed in postmaster: Bad address" > > > > > > I suspect a Cygwin signals bug. I'll try to distill a self-contained test > > > case for the Cygwin hackers. The lack of failures on buildfarm member brolga > > > argues that older Cygwin is not affected. > > > > Nice detective work. > > Thanks. http://marc.info/?t=150183296400001 has my upstream report. The > Cygwin project lead reproduced this, but a fix remained elusive. > > I guess we'll ignore weird postmaster-associated lorikeet failures for the > foreseeable future. While reading a list of recent build farm assertion failures I learned that this is still broken in Cygwin 3.2, and eventually found my way back to this thread. I was wondering about suggesting some kind of official warning, but I guess the manual already covers it with this 10 year old notice. I don't know much about Windows or Cygwin so I'm not sure if it needs updating or not, but I would guess that there are no longer any such systems? <productname>Cygwin</productname> is not recommended for running a production server, and it should only be used for running on older versions of <productname>Windows</productname> where the native build does not work.
On Tue, Jun 22, 2021 at 05:52:03PM +1200, Thomas Munro wrote: > On Tue, Jun 22, 2021 at 5:21 PM Noah Misch <noah@leadboat.com> wrote: > > On Thu, Aug 03, 2017 at 10:45:50AM -0400, Robert Haas wrote: > > > On Wed, Aug 2, 2017 at 11:47 PM, Noah Misch <noah@leadboat.com> wrote: > > > > postmaster algorithms rely on the PG_SETMASK() calls preventing that. Without > > > > such protection, duplicate bgworkers are an understandable result. I caught > > > > several other assertions; the PMChildFlags failure is another case of > > > > duplicate postmaster children: > > > > > > > > 6 TRAP: FailedAssertion("!(entry->trans == ((void *)0))", File: "pgstat.c", Line: 871) > > > > 3 TRAP: FailedAssertion("!(PMSignalState->PMChildFlags[slot] == 1)", File: "pmsignal.c", Line: 229) > > > > 20 TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line: 2523) > > > > 21 TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File: "shm_mq.c", Line: 221) > > > > Also, got a few "select() failed in postmaster: Bad address" > > > > > > > > I suspect a Cygwin signals bug. I'll try to distill a self-contained test > > > > case for the Cygwin hackers. The lack of failures on buildfarm member brolga > > > > argues that older Cygwin is not affected. > > > > > > Nice detective work. > > > > Thanks. http://marc.info/?t=150183296400001 has my upstream report. The > > Cygwin project lead reproduced this, but a fix remained elusive. > > > > I guess we'll ignore weird postmaster-associated lorikeet failures for the > > foreseeable future. > > While reading a list of recent build farm assertion failures I learned that > this is still broken in Cygwin 3.2, and eventually found my way back > to this thread. Interesting. Which branch(es) showed you failures? I had wondered if the move to sa_mask (commit 9abb2bfc) would effectively end the problem in v13+. Perhaps the Cygwin bug pokes through even that. Perhaps the sa_mask conditionals need to be "#if defined(WIN32) && !defined(__CYGWIN__)" to help current buildfarm members. > I was wondering about suggesting some kind of > official warning, but I guess the manual already covers it with this > 10 year old notice. I don't know much about Windows or Cygwin so I'm > not sure if it needs updating or not, but I would guess that there are > no longer any such systems? > > <productname>Cygwin</productname> is not recommended for running a > production server, and it should only be used for running on > older versions of <productname>Windows</productname> where > the native build does not work. I expect native builds work on all Microsoft-supported Windows versions, so +1 for removing everything after the comma.
On Tue, Jun 22, 2021 at 6:42 PM Noah Misch <noah@leadboat.com> wrote: > On Tue, Jun 22, 2021 at 05:52:03PM +1200, Thomas Munro wrote: > > While reading a list of recent build farm assertion failures I learned that > > this is still broken in Cygwin 3.2, and eventually found my way back > > to this thread. > > Interesting. Which branch(es) showed you failures? I had wondered if the > move to sa_mask (commit 9abb2bfc) would effectively end the problem in v13+. > Perhaps the Cygwin bug pokes through even that. Perhaps the sa_mask > conditionals need to be "#if defined(WIN32) && !defined(__CYGWIN__)" to help > current buildfarm members. I only have data going back about a month, but: REL_11_STABLE https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2021-05-31%2009:14:18 TRAP: FailedAssertion("!(mq->mq_sender == ((void *)0))", File: "/home/andrew/bf/root/REL_11_STABLE/pgsql.build/../pgsql/src/backend/storage/ipc/shm_mq.c", Line: 222) REL_13_STABLE https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2021-06-01%2010:12:10 TRAP: FailedAssertion("PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED", File: "/home/andrew/bf/root/REL_13_STABLE/pgsql.build/../pgsql/src/back
On 6/22/21 2:42 AM, Noah Misch wrote: > I was wondering about suggesting some kind of >> official warning, but I guess the manual already covers it with this >> 10 year old notice. I don't know much about Windows or Cygwin so I'm >> not sure if it needs updating or not, but I would guess that there are >> no longer any such systems? >> >> <productname>Cygwin</productname> is not recommended for running a >> production server, and it should only be used for running on >> older versions of <productname>Windows</productname> where >> the native build does not work. > I expect native builds work on all Microsoft-supported Windows versions, so +1 > for removing everything after the comma. > > If we just want to declare Cygwin unsupported I can halt lorikeet and we can be done with it. Brolga is on its last legs anyway, as we can't built on NT past release 10. It's worth pointing out that Cygwin psql is built with readline, which makes it rather nicer to use than the Native builds. We still have this in configure.ac: # readline on MinGW has problems with backslashes in psql and other bugs. # This is particularly a problem with non-US code pages. # Therefore disable its use until we understand the cause. 2004-07-20 if test "$PORTNAME" = "win32"; then if test "$with_readline" = yes; then AC_MSG_WARN([*** Readline does not work on MinGW --- disabling]) with_readline=no fi fi I have no idea if it's still true. My msys2 system (fairywren) has readline installed. Assuming I disable this piece of code and build with readline, how would I test for the errors reported in 2004? cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > It's worth pointing out that Cygwin psql is built with readline, which > makes it rather nicer to use than the Native builds. We still have this > in configure.ac: > > # readline on MinGW has problems with backslashes in psql and other bugs. > # This is particularly a problem with non-US code pages. > # Therefore disable its use until we understand the cause. 2004-07-20 > > I have no idea if it's still true. My msys2 system (fairywren) has > readline installed. Assuming I disable this piece of code and build with > readline, how would I test for the errors reported in 2004? A bit of archive excavation turned up various contemporaneous reports: https://www.postgresql.org/message-id/flat/6BCB9D8A16AC4241919521715F4D8BCE34BE7A%40algol.sollentuna.se https://www.postgresql.org/message-id/flat/ca3pfg%2427tt%241%40news.hub.org https://www.postgresql.org/message-id/flat/OFD583EA49.2F089C03-ONC1256E6A.002706B8-C1256E6A.002784FC%40flender.com https://www.postgresql.org/message-id/flat/OF585C42B8.14897FF3-ONC1256EBC.0042A26E-C1256EBC.0042E23F%40flender.com It's certainly possible that some or all of those problems are gone. regards, tom lane
On 6/22/21 10:41 AM, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> It's worth pointing out that Cygwin psql is built with readline, which >> makes it rather nicer to use than the Native builds. We still have this >> in configure.ac: >> >> # readline on MinGW has problems with backslashes in psql and other bugs. >> # This is particularly a problem with non-US code pages. >> # Therefore disable its use until we understand the cause. 2004-07-20 >> >> I have no idea if it's still true. My msys2 system (fairywren) has >> readline installed. Assuming I disable this piece of code and build with >> readline, how would I test for the errors reported in 2004? > A bit of archive excavation turned up various contemporaneous reports: > > https://www.postgresql.org/message-id/flat/6BCB9D8A16AC4241919521715F4D8BCE34BE7A%40algol.sollentuna.se > https://www.postgresql.org/message-id/flat/ca3pfg%2427tt%241%40news.hub.org > https://www.postgresql.org/message-id/flat/OFD583EA49.2F089C03-ONC1256E6A.002706B8-C1256E6A.002784FC%40flender.com > https://www.postgresql.org/message-id/flat/OF585C42B8.14897FF3-ONC1256EBC.0042A26E-C1256EBC.0042E23F%40flender.com > > It's certainly possible that some or all of those problems are gone. > > Thanks, I will need more info than seems to be there though. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Tue, Jun 22, 2021 at 08:18:42AM -0400, Andrew Dunstan wrote: > On 6/22/21 2:42 AM, Noah Misch wrote: > > I was wondering about suggesting some kind of > >> official warning, but I guess the manual already covers it with this > >> 10 year old notice. I don't know much about Windows or Cygwin so I'm > >> not sure if it needs updating or not, but I would guess that there are > >> no longer any such systems? > >> > >> <productname>Cygwin</productname> is not recommended for running a > >> production server, and it should only be used for running on > >> older versions of <productname>Windows</productname> where > >> the native build does not work. > > I expect native builds work on all Microsoft-supported Windows versions, so +1 > > for removing everything after the comma. > > If we just want to declare Cygwin unsupported I can halt lorikeet and we > can be done with it. No way. lorikeet is valuable.
On 6/22/21 11:02 PM, Noah Misch wrote: > On Tue, Jun 22, 2021 at 08:18:42AM -0400, Andrew Dunstan wrote: >> On 6/22/21 2:42 AM, Noah Misch wrote: >>> I was wondering about suggesting some kind of >>>> official warning, but I guess the manual already covers it with this >>>> 10 year old notice. I don't know much about Windows or Cygwin so I'm >>>> not sure if it needs updating or not, but I would guess that there are >>>> no longer any such systems? >>>> >>>> <productname>Cygwin</productname> is not recommended for running a >>>> production server, and it should only be used for running on >>>> older versions of <productname>Windows</productname> where >>>> the native build does not work. >>> I expect native builds work on all Microsoft-supported Windows versions, so +1 >>> for removing everything after the comma. >> If we just want to declare Cygwin unsupported I can halt lorikeet and we >> can be done with it. > No way. lorikeet is valuable. OK, no problem. Leaving it running is not a big deal. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com