Thread: thread safety on clients
Hi, I compiled current HEAD and trying to use pgbench, i initialized a test database this way: bin/pgbench -i -F80 -s100 test and then run with this options: bin/pgbench -c 50 -j 5 -l -t 20 test and get this crash: """ starting vacuum...end. TRAP: FailedAssertion("!((data - start) == data_size)", File: "heaptuple.c", Line: 255) Client 0 aborted in state 8. Probably the backend died while processing. LOG: server process (PID 30713) was terminated by signal 6: Aborted TRAP: FailedAssertion("!((data - start) == data_size)", File: "heaptuple.c", Line: 255) Client 8 aborted in state 8. Probably the backend died while processing. LOG: terminating any other active server processes WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. """ if i remove the -j option then it runs without a problem -- Atentamente, Jaime Casanova Soporte y capacitación de PostgreSQL Asesoría y desarrollo de sistemas Guayaquil - Ecuador Cel. +59387171157
On ons, 2009-12-09 at 14:02 -0500, Jaime Casanova wrote: > Hi, > > I compiled current HEAD and trying to use pgbench, i initialized a > test database this way: > bin/pgbench -i -F80 -s100 test > > and then run with this options: > bin/pgbench -c 50 -j 5 -l -t 20 test > > and get this crash: > """ > starting vacuum...end. > TRAP: FailedAssertion("!((data - start) == data_size)", File: > "heaptuple.c", Line: 255) > Client 0 aborted in state 8. Probably the backend died while processing. > LOG: server process (PID 30713) was terminated by signal 6: Aborted > TRAP: FailedAssertion("!((data - start) == data_size)", File: > "heaptuple.c", Line: 255) > Client 8 aborted in state 8. Probably the backend died while processing. > LOG: terminating any other active server processes > WARNING: terminating connection because of crash of another server process > DETAIL: The postmaster has commanded this server process to roll back > the current transaction and exit, because another server process > exited abnormally and possibly corrupted shared memory. > """ > > if i remove the -j option then it runs without a problem Possibly related to the incomplete removal of the enable-thread-safety option that I just posted about.
Peter Eisentraut wrote: > > starting vacuum...end. > > TRAP: FailedAssertion("!((data - start) == data_size)", File: > > "heaptuple.c", Line: 255) > > Client 0 aborted in state 8. Probably the backend died while processing. > > LOG: server process (PID 30713) was terminated by signal 6: Aborted > > TRAP: FailedAssertion("!((data - start) == data_size)", File: > > "heaptuple.c", Line: 255) > > Client 8 aborted in state 8. Probably the backend died while processing. > > LOG: terminating any other active server processes > > WARNING: terminating connection because of crash of another server process > > DETAIL: The postmaster has commanded this server process to roll back > > the current transaction and exit, because another server process > > exited abnormally and possibly corrupted shared memory. > > """ > > > > if i remove the -j option then it runs without a problem > > Possibly related to the incomplete removal of the enable-thread-safety > option that I just posted about. I thought about that but I can't figure out how that would affect pgbench. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: <blockquote cite="mid:200912110055.nBB0tFE29298@momjian.us" type="cite"><pre wrap="">Peter Eisentrautwrote: </pre><blockquote type="cite"><blockquote type="cite"><pre wrap="">if i remove the -j option then it runswithout a problem </pre></blockquote><pre wrap="">Possibly related to the incomplete removal of the enable-thread-safety option that I just posted about. </pre></blockquote><pre wrap=""> I thought about that but I can't figure out how that would affect pgbench. </pre></blockquote> The "-j" option is the recent addition to pgbench that causes it to launch multiple client threadswhen enabled, each handling a subset of the transactions. There's blocks of codes in pgbench.c now that depend onhaving sane values for thread safety in libpq. That it may be detecting the wrong thing and operating in an unsafe wayafter the recent change is what Peter's suggesting. This is good, actually, because I don't think we had many client-sidethread-safety tests floating around to catch problems in this area before.<br /><br /><pre class="moz-signature"cols="72">-- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support <a class="moz-txt-link-abbreviated" href="mailto:greg@2ndQuadrant.com">greg@2ndQuadrant.com</a> <a class="moz-txt-link-abbreviated"href="http://www.2ndQuadrant.com">www.2ndQuadrant.com</a> </pre>
Greg Smith wrote: > Bruce Momjian wrote: > > Peter Eisentraut wrote: > > > >>> if i remove the -j option then it runs without a problem > >>> > >> Possibly related to the incomplete removal of the enable-thread-safety > >> option that I just posted about. > >> > > > > I thought about that but I can't figure out how that would affect > > pgbench. > > > The "-j" option is the recent addition to pgbench that causes it to > launch multiple client threads when enabled, each handling a subset of > the transactions. There's blocks of codes in pgbench.c now that depend > on having sane values for thread safety in libpq. That it may be > detecting the wrong thing and operating in an unsafe way after the > recent change is what Peter's suggesting. This is good, actually, > because I don't think we had many client-side thread-safety tests > floating around to catch problems in this area before. I can reproduce the crash here so I can see if I can find the cause. However, the failure is happening in the _server_. Threading is unrelated to the server itself, only the client. I suppose the first test for me will be to test CVS before the thread change was made. The failure is in heap_fill_tuple(), and I am unclear how that assert could be getting triggered: CONTEXT: automatic analyze of table "test.public.pgbench_accounts"TRAP: FailedAssertion("!((data - start) == data_size)",File: "heaptuple.c", Line: 255)TRAP: FailedAssertion("!((data - start) == data_size)", File: "heaptuple.c", Line:255)TRAP: FailedAssertion("!((data - start) == data_size)", File: "heaptuple.c", Line: 255)LOG: server process (PID6076) was terminated by signal 6: Abort trapLOG: terminating any other active server processes -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > Greg Smith wrote: > > Bruce Momjian wrote: > > > Peter Eisentraut wrote: > > > > > >>> if i remove the -j option then it runs without a problem > > >>> > > >> Possibly related to the incomplete removal of the enable-thread-safety > > >> option that I just posted about. > > >> > > > > > > I thought about that but I can't figure out how that would affect > > > pgbench. > > > > > The "-j" option is the recent addition to pgbench that causes it to > > launch multiple client threads when enabled, each handling a subset of > > the transactions. There's blocks of codes in pgbench.c now that depend > > on having sane values for thread safety in libpq. That it may be > > detecting the wrong thing and operating in an unsafe way after the > > recent change is what Peter's suggesting. This is good, actually, > > because I don't think we had many client-side thread-safety tests > > floating around to catch problems in this area before. > > I can reproduce the crash here so I can see if I can find the cause. > > However, the failure is happening in the _server_. Threading is > unrelated to the server itself, only the client. I suppose the first > test for me will be to test CVS before the thread change was made. > > The failure is in heap_fill_tuple(), and I am unclear how that assert > could be getting triggered: > > CONTEXT: automatic analyze of table "test.public.pgbench_accounts" > TRAP: FailedAssertion("!((data - start) == data_size)", File: "heaptuple.c", Line: 255) > TRAP: FailedAssertion("!((data - start) == data_size)", File: "heaptuple.c", Line: 255) > TRAP: FailedAssertion("!((data - start) == data_size)", File: "heaptuple.c", Line: 255) > LOG: server process (PID 6076) was terminated by signal 6: Abort trap > LOG: terminating any other active server processes OK, turns out I did break threading by not defining ENABLE_THREAD_SAFETY in configure. I have applied a patch to CVS to fix this. However, the larger question is how did I crash a backend by not defining this? I guess it is possible for a client to crash the server by having a misconfigured client --- I wasn't aware of that. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Greg Smith <greg@2ndquadrant.com> writes: > The "-j" option is the recent addition to pgbench that causes it to > launch multiple client threads when enabled, each handling a subset of > the transactions. There's blocks of codes in pgbench.c now that depend > on having sane values for thread safety in libpq. That it may be > detecting the wrong thing and operating in an unsafe way after the > recent change is what Peter's suggesting. This is good, actually, > because I don't think we had many client-side thread-safety tests > floating around to catch problems in this area before. The report showed an assert inside the backend. It really doesn't matter *how* broken pgbench might be, it should not be able to cause that. My bet is that the real problem was a build inconsistency in the backend. Does "make distclean" and rebuild make it go away? regards, tom lane
On Thu, Dec 10, 2009 at 11:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > My bet is that the real problem was a build inconsistency in > the backend. Does "make distclean" and rebuild make it go away? > actually it was a clean build just after a cvs co (not an updated tree), i build the binaries and installed it in just created directory... i will try again now with the patch Bruce just committed -- Atentamente, Jaime Casanova Soporte y capacitación de PostgreSQL Asesoría y desarrollo de sistemas Guayaquil - Ecuador Cel. +59387171157
On Thu, Dec 10, 2009 at 11:33 PM, Jaime Casanova <jcasanov@systemguards.com.ec> wrote: > On Thu, Dec 10, 2009 at 11:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> My bet is that the real problem was a build inconsistency in >> the backend. Does "make distclean" and rebuild make it go away? >> > > actually it was a clean build just after a cvs co (not an updated > tree), i build the binaries and installed it in just created > directory... > i will try again now with the patch Bruce just committed > the problem has gone -- Atentamente, Jaime Casanova Soporte y capacitación de PostgreSQL Asesoría y desarrollo de sistemas Guayaquil - Ecuador Cel. +59387171157
Jaime Casanova wrote: > On Thu, Dec 10, 2009 at 11:33 PM, Jaime Casanova > <jcasanov@systemguards.com.ec> wrote: > > On Thu, Dec 10, 2009 at 11:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> > >> My bet is that the real problem was a build inconsistency in > >> the backend. Does "make distclean" and rebuild make it go away? > >> > > > > actually it was a clean build just after a cvs co (not an updated > > tree), i build the binaries and installed it in just created > > directory... > > i will try again now with the patch Bruce just committed > > the problem has gone Yes, but what if you test with the broken pgbench? As Tom says, it should not be able to crash the backend no matter what it does. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > Yes, but what if you test with the broken pgbench? As Tom says, it > should not be able to crash the backend no matter what it does. The crash is real --- I've replicated it here. Still trying to figure out what is the real cause. regards, tom lane
On 12/11/09, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Yes, but what if you test with the broken pgbench? As Tom says, it > > should not be able to crash the backend no matter what it does. > > > The crash is real --- I've replicated it here. Still trying to figure > out what is the real cause. Several threads writing to single connection? -- marko
I wrote: > The crash is real --- I've replicated it here. Still trying to figure > out what is the real cause. Okay, I've sussed it. The crash itself is due to a memory management mistake in the recently-rewritten EvalPlanQual code (pfree'ing a tuple that we still have live Datum references to). The reason the "broken" pgbench is able to trigger it is that it generates concurrent updates for the same row in pgbench_accounts with much higher frequency than normal. This forces the backends through the EvalPlanQual code, which is broken and crashes. QED. I'll commit a fix for that shortly, but this reminds me once again that the EvalPlanQual logic is desperately under-tested in our normal regression testing. We really need some kind of testing infrastructure that would let us exercise concurrent-update cases a bit more than "not at all". Also, the reason why Bruce's mistake exposed this is interesting. Omitting the #define for ENABLE_THREAD_SAFETY does not actually break "pgbench -j" at all -- it has a fallback strategy that uses multiple subprocesses instead of multiple threads. However, it has only one srandom() call, which occurs in the parent process before forking. This means that the subprocesses all start with the same random number state, which means they generate the same sequence of "random" account IDs to update, which is why the probability of an update collision requiring EvalPlanQual resolution is so high, especially at the start of the run (and the crash does happen pretty much immediately after starting pgbench). I'm inclined to think this is bad, and we should fix pgbench to re-randomize after forking. If we don't, we'll have synchronized behavior on some platforms and not others, which doesn't seem like a good idea. On the other hand, if you did want that type of behavior, it's hard to see how you'd get it. Is it worth trying to provide that as a (probably non-default) mode in pgbench? If so, how would you do it on threaded platforms? regards, tom lane
Tom Lane wrote: > Also, the reason why Bruce's mistake exposed this is interesting. > Omitting the #define for ENABLE_THREAD_SAFETY does not actually break > "pgbench -j" at all -- it has a fallback strategy that uses multiple > subprocesses instead of multiple threads. However, it has only one > srandom() call, which occurs in the parent process before forking. > This means that the subprocesses all start with the same random number > state, which means they generate the same sequence of "random" account > IDs to update We just got that as a bug report the other day too, with suggested fixes: http://archives.postgresql.org/pgsql-hackers/2009-12/msg00841.php > I'm inclined to think this is bad, and we should fix pgbench to > re-randomize after forking. If we don't, we'll have synchronized > behavior on some platforms and not others, which doesn't seem like a > good idea. On the other hand, if you did want that type of behavior, > it's hard to see how you'd get it. Is it worth trying to provide that > as a (probably non-default) mode in pgbench? If so, how would you > do it on threaded platforms? > It sounds like random pgbench run for a while would certainly expose the same thing you're concerned about eventually. I doubt it's worth the trouble to codify a bug just because it found another bug (it's hard enough to maintain code that only has to work right). If we want to stress this behavior, it would be easier to just test with a a bunch of clients going through a scale=1 database and use enough transactions to make update collisions likely. I'm working on a guide to stress testing new alpha builds with pgbench that will be ready in time for alpha3. I could easily add that as one of the suggested tests to run if you're concerned about getting more test coverage of that code path. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Greg Smith <greg@2ndquadrant.com> writes: > It sounds like random pgbench run for a while would certainly expose the > same thing you're concerned about eventually. Yeah. Actually the odd thing about it is that the crash seemed to invariably be on conflicting pgbench_accounts updates, which is a fairly low-contention table in this test design (but the bug turned it into high-contention). What I would have expected is crashes on the very similar updates to pgbench_branches, which is designed to be high-contention. It might be that there is some other effect going on here that explains why that wasn't happening. Need to go back and look more closely. regards, tom lane
I wrote: > ... What I would have expected is crashes on the very > similar updates to pgbench_branches, which is designed to be > high-contention. It might be that there is some other effect going on > here that explains why that wasn't happening. Need to go back and look > more closely. ... and the answer to that is that pgbench_branches isn't subject to the bug, because its only pass-by-reference column happens to be filled with all NULLs by the initialization step, unlike the accounts filler column which happens to be filled with non-null strings. Null values mean no dangling pointers and no chance for a memory management issue. So you could have run this all day and never seen a crash on pgbench_branches updates. (If you manually set the filler column non-null before starting a run, the unpatched code crashes instantly, even with a non-bollixed pgbench.) So, nothing to see here except lack of test coverage, move along. regards, tom lane
On Fri, Dec 11, 2009 at 6:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: > I'll commit a fix for that shortly, but this reminds me once again that > the EvalPlanQual logic is desperately under-tested in our normal > regression testing. We really need some kind of testing infrastructure > that would let us exercise concurrent-update cases a bit more than "not > at all". If i went back and cleaned up the parallel psql patch we would be able to write tests which tested basic concurrent functionality such as transactions blocking when they're supposed to. But that wouldn't catch something like this I don't think, not unless it got triggered pretty reliably by a simple evalplanqual recheck. > I'm inclined to think this is bad, and we should fix pgbench to > re-randomize after forking. If we don't, we'll have synchronized > behavior on some platforms and not others, which doesn't seem like a > good idea. On the other hand, if you did want that type of behavior, > it's hard to see how you'd get it. Is it worth trying to provide that > as a (probably non-default) mode in pgbench? If so, how would you > do it on threaded platforms? Well it's pretty useless for benchmarking unless someone comes up with a different plan for resolving these kinds of conflicts and we wanted to compare the costs. But it's clearly something we should do for testing for precisely this kind of bug. EvalPlanQual could trigger all kinds of bugs throughout the executor and it would be worth having something like this to check for them. I believe it's possible to give at least one of the random number generating apis a preallocated buffer for it to use to store the generator state, that could easily be per-thread state along with the connection and other state. I'm not sure which api that was and whether it's as portable or as good a generator as what we're using now though. -- greg
Greg Stark <gsstark@mit.edu> writes: > On Fri, Dec 11, 2009 at 6:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I'll commit a fix for that shortly, but this reminds me once again that >> the EvalPlanQual logic is desperately under-tested in our normal >> regression testing. �We really need some kind of testing infrastructure >> that would let us exercise concurrent-update cases a bit more than "not >> at all". > If i went back and cleaned up the parallel psql patch we would be able > to write tests which tested basic concurrent functionality such as > transactions blocking when they're supposed to. But that wouldn't > catch something like this I don't think, not unless it got triggered > pretty reliably by a simple evalplanqual recheck. It would have been caught if anyone had tried an EvalPlanQual'd update on a table with one of the unchanged columns being a non-null pass-by-ref value. Now it's certainly possible that a set of regression tests would by chance fail to exercise that case --- but IMO the real problem right now is we aren't even trying to exercise EvalPlanQual; we're actively avoiding it because the current test infrastructure can't ensure deterministic results if any such situation arises. But my recollection of the parallel psql patch discussion is that it was rejected because nobody felt comfortable with the API design. Do we have any better ideas in that department yet? regards, tom lane
Tom Lane wrote: > But my recollection of the parallel psql patch discussion is that it was > rejected because nobody felt comfortable with the API design. Do we > have any better ideas in that department yet? It wasn't rejected AFAICT. A finalized API with which there was (almost?) no dissent was posted by you, after a design/path from Greg Stark. The problem is that nobody stepped up to implementing that spec. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.