Thread: thread safety on clients

thread safety on clients

From

Jaime Casanova

Date:

09 December 2009, 15:03:20

Hi,

I compiled current HEAD and trying to use pgbench, i initialized a
test database this way:
bin/pgbench -i -F80 -s100 test

and then run with this options:
bin/pgbench -c 50 -j 5 -l -t 20 test

and get this crash:
"""
starting vacuum...end.
TRAP: FailedAssertion("!((data - start) == data_size)", File:
"heaptuple.c", Line: 255)
Client 0 aborted in state 8. Probably the backend died while processing.
LOG:  server process (PID 30713) was terminated by signal 6: Aborted
TRAP: FailedAssertion("!((data - start) == data_size)", File:
"heaptuple.c", Line: 255)
Client 8 aborted in state 8. Probably the backend died while processing.
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.
"""

if i remove the -j option then it runs without a problem

--
Atentamente,
Jaime Casanova
Soporte y capacitación de PostgreSQL
Asesoría y desarrollo de sistemas
Guayaquil - Ecuador
Cel. +59387171157

Re: thread safety on clients

From

Peter Eisentraut

Date:

10 December 2009, 18:59:21

On ons, 2009-12-09 at 14:02 -0500, Jaime Casanova wrote:
> Hi,
> 
> I compiled current HEAD and trying to use pgbench, i initialized a
> test database this way:
> bin/pgbench -i -F80 -s100 test
> 
> and then run with this options:
> bin/pgbench -c 50 -j 5 -l -t 20 test
> 
> and get this crash:
> """
> starting vacuum...end.
> TRAP: FailedAssertion("!((data - start) == data_size)", File:
> "heaptuple.c", Line: 255)
> Client 0 aborted in state 8. Probably the backend died while processing.
> LOG:  server process (PID 30713) was terminated by signal 6: Aborted
> TRAP: FailedAssertion("!((data - start) == data_size)", File:
> "heaptuple.c", Line: 255)
> Client 8 aborted in state 8. Probably the backend died while processing.
> LOG:  terminating any other active server processes
> WARNING:  terminating connection because of crash of another server process
> DETAIL:  The postmaster has commanded this server process to roll back
> the current transaction and exit, because another server process
> exited abnormally and possibly corrupted shared memory.
> """
> 
> if i remove the -j option then it runs without a problem

Possibly related to the incomplete removal of the enable-thread-safety
option that I just posted about.

Re: thread safety on clients

From

Bruce Momjian

Date:

10 December 2009, 20:55:25

Peter Eisentraut wrote:
> > starting vacuum...end.
> > TRAP: FailedAssertion("!((data - start) == data_size)", File:
> > "heaptuple.c", Line: 255)
> > Client 0 aborted in state 8. Probably the backend died while processing.
> > LOG:  server process (PID 30713) was terminated by signal 6: Aborted
> > TRAP: FailedAssertion("!((data - start) == data_size)", File:
> > "heaptuple.c", Line: 255)
> > Client 8 aborted in state 8. Probably the backend died while processing.
> > LOG:  terminating any other active server processes
> > WARNING:  terminating connection because of crash of another server process
> > DETAIL:  The postmaster has commanded this server process to roll back
> > the current transaction and exit, because another server process
> > exited abnormally and possibly corrupted shared memory.
> > """
> > 
> > if i remove the -j option then it runs without a problem
> 
> Possibly related to the incomplete removal of the enable-thread-safety
> option that I just posted about.

I thought about that but I can't figure out how that would affect
pgbench.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: thread safety on clients

From

Greg Smith

Date:

10 December 2009, 21:13:54

Bruce Momjian wrote: <blockquote cite="mid:200912110055.nBB0tFE29298@momjian.us" type="cite"><pre wrap="">Peter
Eisentrautwrote: </pre><blockquote type="cite"><blockquote type="cite"><pre wrap="">if i remove the -j option then it
runswithout a problem     </pre></blockquote><pre wrap="">Possibly related to the incomplete removal of the
enable-thread-safety
option that I just posted about.   </pre></blockquote><pre wrap="">
I thought about that but I can't figure out how that would affect
pgbench. </pre></blockquote> The "-j" option is the recent addition to pgbench that causes it to launch multiple client
threadswhen enabled, each handling a subset of the transactions.  There's blocks of codes in pgbench.c now that depend
onhaving sane values for thread safety in libpq.  That it may be detecting the wrong thing and operating in an unsafe
wayafter the recent change is what Peter's suggesting.  This is good, actually, because I don't think we had many
client-sidethread-safety tests floating around to catch problems in this area before.<br /><br /><pre
class="moz-signature"cols="72">-- 
 
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
<a class="moz-txt-link-abbreviated" href="mailto:greg@2ndQuadrant.com">greg@2ndQuadrant.com</a>  <a
class="moz-txt-link-abbreviated"href="http://www.2ndQuadrant.com">www.2ndQuadrant.com</a>
 
</pre>

Re: thread safety on clients

From

Bruce Momjian

Date:

10 December 2009, 21:31:40

Greg Smith wrote:
> Bruce Momjian wrote:
> > Peter Eisentraut wrote:
> >   
> >>> if i remove the -j option then it runs without a problem
> >>>       
> >> Possibly related to the incomplete removal of the enable-thread-safety
> >> option that I just posted about.
> >>     
> >
> > I thought about that but I can't figure out how that would affect
> > pgbench.
> >   
> The "-j" option is the recent addition to pgbench that causes it to 
> launch multiple client threads when enabled, each handling a subset of 
> the transactions.  There's blocks of codes in pgbench.c now that depend 
> on having sane values for thread safety in libpq.  That it may be 
> detecting the wrong thing and operating in an unsafe way after the 
> recent change is what Peter's suggesting.  This is good, actually, 
> because I don't think we had many client-side thread-safety tests 
> floating around to catch problems in this area before.

I can reproduce the crash here so I can see if I can find the cause.

However, the failure is happening in the _server_.  Threading is
unrelated to the server itself, only the client.  I suppose the first
test for me will be to test CVS before the thread change was made.

The failure is in heap_fill_tuple(), and I am unclear how that assert
could be getting triggered:
CONTEXT:  automatic analyze of table "test.public.pgbench_accounts"TRAP: FailedAssertion("!((data - start) ==
data_size)",File: "heaptuple.c", Line: 255)TRAP: FailedAssertion("!((data - start) == data_size)", File: "heaptuple.c",
Line:255)TRAP: FailedAssertion("!((data - start) == data_size)", File: "heaptuple.c", Line: 255)LOG:  server process
(PID6076) was terminated by signal 6: Abort trapLOG:  terminating any other active server processes
 

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: thread safety on clients

From

Bruce Momjian

Date:

10 December 2009, 22:23:51

Bruce Momjian wrote:
> Greg Smith wrote:
> > Bruce Momjian wrote:
> > > Peter Eisentraut wrote:
> > >   
> > >>> if i remove the -j option then it runs without a problem
> > >>>       
> > >> Possibly related to the incomplete removal of the enable-thread-safety
> > >> option that I just posted about.
> > >>     
> > >
> > > I thought about that but I can't figure out how that would affect
> > > pgbench.
> > >   
> > The "-j" option is the recent addition to pgbench that causes it to 
> > launch multiple client threads when enabled, each handling a subset of 
> > the transactions.  There's blocks of codes in pgbench.c now that depend 
> > on having sane values for thread safety in libpq.  That it may be 
> > detecting the wrong thing and operating in an unsafe way after the 
> > recent change is what Peter's suggesting.  This is good, actually, 
> > because I don't think we had many client-side thread-safety tests 
> > floating around to catch problems in this area before.
> 
> I can reproduce the crash here so I can see if I can find the cause.
> 
> However, the failure is happening in the _server_.  Threading is
> unrelated to the server itself, only the client.  I suppose the first
> test for me will be to test CVS before the thread change was made.
> 
> The failure is in heap_fill_tuple(), and I am unclear how that assert
> could be getting triggered:
> 
>     CONTEXT:  automatic analyze of table "test.public.pgbench_accounts"
>     TRAP: FailedAssertion("!((data - start) == data_size)", File: "heaptuple.c", Line: 255)
>     TRAP: FailedAssertion("!((data - start) == data_size)", File: "heaptuple.c", Line: 255)
>     TRAP: FailedAssertion("!((data - start) == data_size)", File: "heaptuple.c", Line: 255)
>     LOG:  server process (PID 6076) was terminated by signal 6: Abort trap
>     LOG:  terminating any other active server processes

OK, turns out I did break threading by not defining ENABLE_THREAD_SAFETY
in configure.  I have applied a patch to CVS to fix this.

However, the larger question is how did I crash a backend by not
defining this?  I guess it is possible for a client to crash the server
by having a misconfigured client --- I wasn't aware of that.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: thread safety on clients

From

Tom Lane

Date:

11 December 2009, 00:22:44

Greg Smith <greg@2ndquadrant.com> writes:
> The "-j" option is the recent addition to pgbench that causes it to 
> launch multiple client threads when enabled, each handling a subset of 
> the transactions.  There's blocks of codes in pgbench.c now that depend 
> on having sane values for thread safety in libpq.  That it may be 
> detecting the wrong thing and operating in an unsafe way after the 
> recent change is what Peter's suggesting.  This is good, actually, 
> because I don't think we had many client-side thread-safety tests 
> floating around to catch problems in this area before.

The report showed an assert inside the backend.  It really doesn't
matter *how* broken pgbench might be, it should not be able to cause
that.  My bet is that the real problem was a build inconsistency in
the backend.  Does "make distclean" and rebuild make it go away?
        regards, tom lane

Re: thread safety on clients

From

Jaime Casanova

Date:

11 December 2009, 00:34:01

On Thu, Dec 10, 2009 at 11:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> My bet is that the real problem was a build inconsistency in
> the backend.  Does "make distclean" and rebuild make it go away?
>

actually it was a clean build just after a cvs co (not an updated
tree), i build the binaries and installed it in just created
directory...
i will try again now with the patch Bruce just committed

--
Atentamente,
Jaime Casanova
Soporte y capacitación de PostgreSQL
Asesoría y desarrollo de sistemas
Guayaquil - Ecuador
Cel. +59387171157

Re: thread safety on clients

From

Jaime Casanova

Date:

11 December 2009, 01:11:38

On Thu, Dec 10, 2009 at 11:33 PM, Jaime Casanova
<jcasanov@systemguards.com.ec> wrote:
> On Thu, Dec 10, 2009 at 11:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> My bet is that the real problem was a build inconsistency in
>> the backend.  Does "make distclean" and rebuild make it go away?
>>
>
> actually it was a clean build just after a cvs co (not an updated
> tree), i build the binaries and installed it in just created
> directory...
> i will try again now with the patch Bruce just committed
>

the problem has gone

--
Atentamente,
Jaime Casanova
Soporte y capacitación de PostgreSQL
Asesoría y desarrollo de sistemas
Guayaquil - Ecuador
Cel. +59387171157

Re: thread safety on clients

From

Alvaro Herrera

Date:

11 December 2009, 10:38:10

Jaime Casanova wrote:
> On Thu, Dec 10, 2009 at 11:33 PM, Jaime Casanova
> <jcasanov@systemguards.com.ec> wrote:
> > On Thu, Dec 10, 2009 at 11:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >>
> >> My bet is that the real problem was a build inconsistency in
> >> the backend.  Does "make distclean" and rebuild make it go away?
> >>
> >
> > actually it was a clean build just after a cvs co (not an updated
> > tree), i build the binaries and installed it in just created
> > directory...
> > i will try again now with the patch Bruce just committed
> 
> the problem has gone

Yes, but what if you test with the broken pgbench?  As Tom says, it
should not be able to crash the backend no matter what it does.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: thread safety on clients

From

Tom Lane

Date:

11 December 2009, 11:06:07

Alvaro Herrera <alvherre@commandprompt.com> writes:
> Yes, but what if you test with the broken pgbench?  As Tom says, it
> should not be able to crash the backend no matter what it does.

The crash is real --- I've replicated it here.  Still trying to figure
out what is the real cause.
        regards, tom lane

Re: thread safety on clients

From

Marko Kreen

Date:

11 December 2009, 11:28:37

On 12/11/09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
>  > Yes, but what if you test with the broken pgbench?  As Tom says, it
>  > should not be able to crash the backend no matter what it does.
>
>
> The crash is real --- I've replicated it here.  Still trying to figure
>  out what is the real cause.

Several threads writing to single connection?

-- 
marko

Re: thread safety on clients

From

Tom Lane

Date:

11 December 2009, 14:08:44

I wrote:
> The crash is real --- I've replicated it here.  Still trying to figure
> out what is the real cause.

Okay, I've sussed it.  The crash itself is due to a memory management
mistake in the recently-rewritten EvalPlanQual code (pfree'ing a tuple
that we still have live Datum references to).  The reason the "broken"
pgbench is able to trigger it is that it generates concurrent updates
for the same row in pgbench_accounts with much higher frequency than
normal.  This forces the backends through the EvalPlanQual code, which
is broken and crashes.  QED.

I'll commit a fix for that shortly, but this reminds me once again that
the EvalPlanQual logic is desperately under-tested in our normal
regression testing.  We really need some kind of testing infrastructure
that would let us exercise concurrent-update cases a bit more than "not
at all".

Also, the reason why Bruce's mistake exposed this is interesting.
Omitting the #define for ENABLE_THREAD_SAFETY does not actually break
"pgbench -j" at all -- it has a fallback strategy that uses multiple
subprocesses instead of multiple threads.  However, it has only one
srandom() call, which occurs in the parent process before forking.
This means that the subprocesses all start with the same random number
state, which means they generate the same sequence of "random" account
IDs to update, which is why the probability of an update collision
requiring EvalPlanQual resolution is so high, especially at the start
of the run (and the crash does happen pretty much immediately after
starting pgbench).

I'm inclined to think this is bad, and we should fix pgbench to
re-randomize after forking.  If we don't, we'll have synchronized
behavior on some platforms and not others, which doesn't seem like a
good idea.  On the other hand, if you did want that type of behavior,
it's hard to see how you'd get it.  Is it worth trying to provide that
as a (probably non-default) mode in pgbench?  If so, how would you
do it on threaded platforms?
        regards, tom lane

Re: thread safety on clients

From

Greg Smith

Date:

11 December 2009, 15:49:00

Tom Lane wrote:
> Also, the reason why Bruce's mistake exposed this is interesting.
> Omitting the #define for ENABLE_THREAD_SAFETY does not actually break
> "pgbench -j" at all -- it has a fallback strategy that uses multiple
> subprocesses instead of multiple threads.  However, it has only one
> srandom() call, which occurs in the parent process before forking.
> This means that the subprocesses all start with the same random number
> state, which means they generate the same sequence of "random" account
> IDs to update

We just got that as a bug report the other day too, with suggested 
fixes:  http://archives.postgresql.org/pgsql-hackers/2009-12/msg00841.php

> I'm inclined to think this is bad, and we should fix pgbench to
> re-randomize after forking.  If we don't, we'll have synchronized
> behavior on some platforms and not others, which doesn't seem like a
> good idea.  On the other hand, if you did want that type of behavior,
> it's hard to see how you'd get it.  Is it worth trying to provide that
> as a (probably non-default) mode in pgbench?  If so, how would you
> do it on threaded platforms?
>   
It sounds like random pgbench run for a while would certainly expose the 
same thing you're concerned about eventually.  I doubt it's worth the 
trouble to codify a bug just because it found another bug (it's hard 
enough to maintain code that only has to work right).  If we want to 
stress this behavior, it would be easier to just test with a a bunch of 
clients going through a scale=1 database and use enough transactions to 
make update collisions likely.  I'm working on a guide to stress testing 
new alpha builds with pgbench that will be ready in time for alpha3.  I 
could easily add that as one of the suggested tests to run if you're 
concerned about getting more test coverage of that code path.

-- 
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com

Re: thread safety on clients

From

Tom Lane

Date:

11 December 2009, 15:57:02

Greg Smith <greg@2ndquadrant.com> writes:
> It sounds like random pgbench run for a while would certainly expose the 
> same thing you're concerned about eventually.

Yeah.  Actually the odd thing about it is that the crash seemed to
invariably be on conflicting pgbench_accounts updates, which is a fairly
low-contention table in this test design (but the bug turned it into
high-contention).  What I would have expected is crashes on the very
similar updates to pgbench_branches, which is designed to be
high-contention.  It might be that there is some other effect going on
here that explains why that wasn't happening.  Need to go back and look
more closely.
        regards, tom lane

Re: thread safety on clients

From

Tom Lane

Date:

11 December 2009, 16:21:16

I wrote:
> ... What I would have expected is crashes on the very
> similar updates to pgbench_branches, which is designed to be
> high-contention.  It might be that there is some other effect going on
> here that explains why that wasn't happening.  Need to go back and look
> more closely.

... and the answer to that is that pgbench_branches isn't subject to the
bug, because its only pass-by-reference column happens to be filled with
all NULLs by the initialization step, unlike the accounts filler column
which happens to be filled with non-null strings.  Null values mean no
dangling pointers and no chance for a memory management issue.  So you
could have run this all day and never seen a crash on pgbench_branches
updates.  (If you manually set the filler column non-null before
starting a run, the unpatched code crashes instantly, even with a
non-bollixed pgbench.)

So, nothing to see here except lack of test coverage, move along.
        regards, tom lane

Re: thread safety on clients

From

Greg Stark

Date:

14 December 2009, 00:41:16

On Fri, Dec 11, 2009 at 6:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
> I'll commit a fix for that shortly, but this reminds me once again that
> the EvalPlanQual logic is desperately under-tested in our normal
> regression testing.  We really need some kind of testing infrastructure
> that would let us exercise concurrent-update cases a bit more than "not
> at all".

If i went back and cleaned up the parallel psql patch we would be able
to write tests which tested basic concurrent functionality such as
transactions blocking when they're supposed to. But that wouldn't
catch something like this I don't think, not unless it got triggered
pretty reliably by a simple evalplanqual recheck.

> I'm inclined to think this is bad, and we should fix pgbench to
> re-randomize after forking.  If we don't, we'll have synchronized
> behavior on some platforms and not others, which doesn't seem like a
> good idea.  On the other hand, if you did want that type of behavior,
> it's hard to see how you'd get it.  Is it worth trying to provide that
> as a (probably non-default) mode in pgbench?  If so, how would you
> do it on threaded platforms?

Well it's pretty useless for benchmarking unless someone comes up with
a different plan for resolving these kinds of conflicts and we wanted
to compare the costs. But it's clearly something we should do for
testing for precisely this kind of bug. EvalPlanQual could trigger all
kinds of bugs throughout the executor and it would be worth having
something like this to check for them.

I believe it's possible to give at least one of the random number
generating apis a preallocated buffer for it to use to store the
generator state, that could easily be per-thread state along with the
connection and other state. I'm not sure which api that was and
whether it's as portable or as good a generator as what we're using
now though.

--
greg

Re: thread safety on clients

From

Tom Lane

Date:

14 December 2009, 10:47:50

Greg Stark <gsstark@mit.edu> writes:
> On Fri, Dec 11, 2009 at 6:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'll commit a fix for that shortly, but this reminds me once again that
>> the EvalPlanQual logic is desperately under-tested in our normal
>> regression testing. �We really need some kind of testing infrastructure
>> that would let us exercise concurrent-update cases a bit more than "not
>> at all".

> If i went back and cleaned up the parallel psql patch we would be able
> to write tests which tested basic concurrent functionality such as
> transactions blocking when they're supposed to. But that wouldn't
> catch something like this I don't think, not unless it got triggered
> pretty reliably by a simple evalplanqual recheck.

It would have been caught if anyone had tried an EvalPlanQual'd update on
a table with one of the unchanged columns being a non-null pass-by-ref
value.  Now it's certainly possible that a set of regression tests would
by chance fail to exercise that case --- but IMO the real problem right
now is we aren't even trying to exercise EvalPlanQual; we're actively
avoiding it because the current test infrastructure can't ensure
deterministic results if any such situation arises.

But my recollection of the parallel psql patch discussion is that it was
rejected because nobody felt comfortable with the API design.  Do we
have any better ideas in that department yet?
        regards, tom lane

Re: thread safety on clients

From

Alvaro Herrera

Date:

14 December 2009, 10:55:32

Tom Lane wrote:

> But my recollection of the parallel psql patch discussion is that it was
> rejected because nobody felt comfortable with the API design.  Do we
> have any better ideas in that department yet?

It wasn't rejected AFAICT.  A finalized API with which there was
(almost?) no dissent was posted by you, after a design/path from Greg
Stark.  The problem is that nobody stepped up to implementing that spec.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.